Slow transfers of datalad dataset to Google Drive sibling set up with RClone

Summary of what happened:

Transferring a datalad dataset to a google drive sibling set up with RClone is extremely slow.

Command used (and if a helper script was used, a link to the helper script or the command generated):

Using datalad push gdrive, gdrive is the RClone sibling.

Version:

rclone --version
rclone v1.60.1

  • os/version: darwin 12.6.2 (64 bit)
  • os/kernel: 21.6.0 (x86_64)
  • os/type: darwin
  • os/arch: amd64
  • go/version: go1.19.3
  • go/linking: dynamic
  • go/tags: none

datalad --version
datalad 0.17.10

Environment (Docker, Singularity, custom installation):

Installed in M1 MacBook Pro using homebrew.

Data formatted according to a validatable standard? Please provide the output of the validator:

Relevant log outputs (up to 20 lines):

Screenshots / relevant information:

Here’s how the transfer is going:

datalad push --to gdrive
Transfer data to 'gdrive':  50%|████████████████                | 2.00/4.00 [00:00<00:00, 4.49k Steps/s]
Total:   2%|█▉                                                                                           | 201M/9.63G [59:53<46:47:35, 56.0k Bytes/s]

question was also a bit discussed on DataLad matrix.io channel Overall we haven’t arrived at a conclusion for slow speeds. Would be useful to see what is the load on system at that point, e.g. what are the busiest processes according to top, how much of IO and wait time using iostat etc (not sure if all present on OSX, I am a linux person). dstat could also be a nice tool to observe various factors through time.

Thanks. One thing to mention is that while seeing these slow speeds with Datalad I copied a big file using the same rclone remote and this worked at the speeds I would expect given my connection. The speeds with datalad were like 50-60 Kb/s (brought me back to the old dialup days).

Any suggestions on how to investigate this further are welcome.

well – we would need to identify the reason/culprit. git-annex-remote-rclone is really a small bash script (see git-annex-remote-rclone/git-annex-remote-rclone at master · git-annex-remote-rclone/git-annex-remote-rclone · GitHub) and datalad just calls git-annex here. That bash script might indeed have some limitations/overhead and we might need to identify them.

Do you have lots of small annexed files in that repo may be (that is our case/bottleneck in case of A complement to https://github.com/dandisets/ to contain zarr datasets · GitHub)?

in any case – I would have tried to see what speeds I would get with plain git annex copy --to=gdrive, then see if parallelization may be improves it ( git annex copy --to=gdrive -J5). If that all works fast and datalad push doesn’t – then “let’s talk” about troubleshooting something in datalad. If it is slow as well – run above annex copy as annex copy --debug and see may be interactions with rclone special remote and possibly seeing some culprit (too many too small files or too many requests and not too much actual traffic)

Hi @yarikoptic I apologize for the late response, I was away in the field.

To answer your questions, yes, the repository has a lot of small files. I will try the plain git annex copy route and report back. Thanks!

To add, the -J5 seems to have helped, but I couldn’t figure out how to show the speeds of the transfers. But overall the speeds seem rather slow with the git annex command.