"datalad push" stuck on "update availability (75%)"

Hi DataLad afficionados (and experts),

I am trying to datalad push --to gin a new datalad dataset to a fresh (empty) repository on GIN, following the nice guide in the DataLad Handbook.

However, after some initial action (several outputs), the CLI gets stuck on Update availability for 'gin': 75% ... for a long time. I have cancelled and re-run this several times now, and one time I got this error:

[INFO] Update availability information
CommandError: ‘git -c diff.ignoreSubmodules=none push --progress --porcelain gin master:master git-annex:git-annex’ failed with exitcode 128 under /vol2/appelhoff/mpib_sp_eeg
Enumerating objects: 5387, done.
Counting objects: 100% (5387/5387), done.
Delta compression using up to 48 threads
CommandError: ‘ssh -o ControlPath=/home/appelhoff/.cache/datalad/sockets/c9a23143 git@gin.g-node.org ‘git-receive-pack ‘"’"’/sappelhoff/mpib_sp_eeg.git’"’"’’’ failed with exitcode 255
fatal: the remote end hung up unexpectedly
Compressing objects: 100% (3662/3662), done.
fatal: the remote end hung up unexpectedly

What could be the problem?

  • The gin repo is private (for now)
  • I am doing all of this from a linux server into which I am logged via ssh from my local machine
  • my ssh certificate (rsa) is stored on that server (/home/appelhoff/.ssh/id_rsa), the datalad dataset is on that server /vol2/appelhoff/... (but that shouldn’t be a problem, right?) … GIN has my public key stored and recognized that the key has been used (yet no data is updated on the repo, it’s still empty/fresh)
  • before doing “the real thing” that I am doing here (trying to push ~40GB of real data), I pushed some smaller toy files (~1GB) to a toy repo on GIN from my local machine, and that worked without problems (so is the server the problem? a potential firewall that the IT put in place?)
  • git annex version 8.x, datalad version 0.14.3

update: I cloned my toy dataset to the server, added a text file there, and datalad pushed back to GIN. And that worked, so a firewall or ssh problem from the server side can be excluded.

Does GIN have a problem with me trying to push a huge amount of data at once?

Or could it have something to do with the GIN repo that I want to push to being completely empty?

update: It worked now. I suspect the issue was because I created the datalad dataset with the text2git configuration and then committed a large amount of text files. I noticed that this was problematic, because apart from a datalad push --to gin, a git push -u gin master was also running into problems.

I solved it with running git gc --aggressive (took >1 hour) and then running datalad push --to gin again.

So the lesson is probably to use the text2git configuration with care, and when in doubt, run git gc --aggressive to see if it fixes the situation.

Thanks for the report and the analysis. AFAIK GIN does not automatically reject “heavy” Git repos (that have loads of data directly commited).

I have personally uploaded >50GB to GIN via DataLad last week, so pure size and even a large number of files was no issue, and should work.

1 Like

Yes. If you have eg bunch of xml or html files - you might end up with too much under git.

Also it can depend on how you were adding files, if one at a time, you might end up with a long git-annex branch history. Might want even to git annex forget to squash it.

If your .git/objects size is large, gc is also a good thing to trigger indeed. If still large, see above :wink:

2 Likes

@sappelhoff you were faster than I was at troubleshooting that one, but I confirm that I solved it in a similar way.

For the record this problem arose when trying to push MRIQC or fMRIprep output to GIN. Those HTML reports for sure are pretty but they are BIG and with a text2git config, they might not play too well with Git.

So I tweaked the config to make sure that files above a certain size where annexed and it worked well and they all lived happily ever after…

2 Likes