Datalad push --to gin errors

Hi Datalad Team,

I recently updated to Datalad v0.17.9, and used the very useful datalad-create-sibling-gin command to set up gin repos as siblings for my superdataset and its subdatasets.

I succesfully datalad pushed one of my subdatasets, but when trying another one (containing mriqc output) using the command datalad push --to gin I get the following error message, regardless of whether I include the --data anything or -f checkdatapresent options.

I also noted that during the pushing, I sometimes get the Connection to gin.g-node.org closed by remote host message, after which pushing resumes automatically (sometimes I need to re-enter my GIN ssh password though).

Any suggestions on how to fix this would be welcome!

Thanks,

Lukas

CommandError: ‘git -c diff.ignoreSubmodules=none push --progress --porcelain gin master:master git-annex:git-annex’ failed with exitcode 128 under /data/proj_erythritol/proj_erythritol_4a/mriqc
Delta compression using up to 64 threads
CommandError: ‘ssh -o ControlPath=/home/luna.kuleuven.be/u0027997/.cache/datalad/sockets/53dce49f git@gin.g-node.org ‘git-receive-pack ‘"’"’/labgas/proj_erythritol_4a-mriqc.git’"’"’’’ failed with exitcode 255
send-pack: unexpected disconnect while reading sideband packet
fatal: the remote end hung up unexpectedly

Just adding here that in another subdataset, I got a different error when executing datalad push --to gin --data anything -f checkdatapresent:

CommandError: ‘git -c diff.ignoreSubmodules=none annex copy --batch -z --to gin --json --json-error-messages --json-progress -c annex.dotfiles=true’ failed with exitcode 1 under /data/proj_erythritol/proj_erythritol_4a/secondlevel [info keys: stdout_json]
to gin…
Transfer failed
Transfer failed [14 times]
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
copy: 14 failed

apparently I saw that error before: FTR: git push error - send-pack: unexpected disconnect while reading sideband packet · Issue #6130 · datalad/datalad · GitHub but it remained a mystery .

What is also your git-annex version? there were recent tune ups (like initremote type=git is not working for unkn reason) which might relate.

What do you see if you do manually

git annex copy --to=gin SOMESAMPLEFILE
git push gin master:master

Thanks a lot for the quick response Yarik!

Here is the version info of git-annex (output of git annex version)

git-annex version: 10.20220822-1~ndall+1
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.26 DAV-1.3.4 feed-1.3.0.1 ghc-8.8.4 http-client-0.6.4.1 persistent-sqlite-2.10.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.1.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10

The repo is indeed private, but repos for the subdatasets I managed to push without problems are private too, so this is probably not the main issue?

Manually running git annex copy and git push resulted in a similar error

u0027997@gbw-s-labgas01:/data/proj_erythritol/proj_erythritol_4a/mriqc$ git annex copy --to=gin sub-003_T1w.html
u0027997@gbw-s-labgas01:/data/proj_erythritol/proj_erythritol_4a/mriqc$ git push gin master:master
Enter passphrase for key ‘/home/luna.kuleuven.be/u0027997/.ssh/id_ed25519’:
Enumerating objects: 660, done.
Counting objects: 100% (660/660), done.
Delta compression using up to 64 threads
Connection to gin.g-node.org closed by remote host.
fatal: the remote end hung up unexpectedly
Compressing objects: 100% (624/624), done.
fatal: the remote end hung up unexpectedly

Any idea what the cause of the other error may be? That one looks different…

Thanks a ton,

Lukas

On another note, I also get the following error message when running several datalad commands, without however the commands being prevented from exiting correctly (at least so it seems)

[ERROR ] Internal error, cannot import interface ‘datalad_hirni.commands.import_dicoms’: ImportError(cannot import name ‘AnnotatePaths’ from ‘datalad.interface.annotate_paths’ (/opt/anaconda3/lib/python3.8/site-packages/datalad/interface/annotate_paths.py))
[ERROR ] Skipping unusable command interface ‘datalad_hirni.commands.import_dicoms.ImportDicoms’ from extension ‘hirni’
[ERROR ] Internal error, cannot import interface ‘datalad_hirni.commands.spec4anything’: ImportError(cannot import name ‘AnnotatePaths’ from ‘datalad.interface.annotate_paths’ (/opt/anaconda3/lib/python3.8/site-packages/datalad/interface/annotate_paths.py))
[ERROR ] Skipping unusable command interface ‘datalad_hirni.commands.spec4anything.Spec4Anything’ from extension ‘hirni’
[ERROR ] Internal error, cannot import interface ‘datalad_hirni.commands.dicom2spec’: ImportError(cannot import name ‘AnnotatePaths’ from ‘datalad.interface.annotate_paths’ (/opt/anaconda3/lib/python3.8/site-packages/datalad/interface/annotate_paths.py))
[ERROR ] Skipping unusable command interface ‘datalad_hirni.commands.dicom2spec.Dicom2Spec’ from extension ‘hirni’

Any idea whether I should do something about this, and what can be done if yes?

Thanks again and happy Thanksgiving,

Lukas

no immediate idea… I would try googling it up, sorry

if you aren’t using hirni (AFAIK not actively developed/maintained ATM), just pip uninstall datalad-hirni to get those warning/errors out of the way .

Thanks a lot Yarik!

I succesfully uninstalled datalad-hirni, so that should be solved.

I Googled the “verification of content failed” error, but the only thing I seem to get is my own post here :wink:

Any idea on how to solve the unexpected disconnect error?

Lukas

my fear is that “unexpected disconnect” is very gin specific and ideally should be troubleshooted along with someone among gin admins but not sure how active/available they are. may be you would spot some peculiarity of it which makes it different from others for you (e.g big .git/objects?)

Thanks Yarik - got in touch with GIN people via e-mail and they got back to me - currently trying their suggestion to push with GIN CLI rather than datalad commands (but still solving issues with my recent git-annex version which does not seem to be compatible with GIN CLI) - will keep you posted

.git/objects is big, but not bigger than other subdatasets which I do manage to push using datalad without problems, hence unlikely to be the issue?

Further suggestions for both errors are welcome!

Cheers,

Lukas

how big is big? you might prefer to adjust your .gitattributes looking forward to get more content into annex.

You might like running GIT_TRACE_PACKET=true git push to see more information about what is going on at git level for push.

I am curious to discover what GIN CLI would do differently here which/if that would resolve the case

Thanks Yarik, and happy New Year!

GIN CLI did not help since it is not compatible with git annex > version 8 whereas I am running a version 10. I provided them with more details which they are not looking at from the GIN side.

This is what my .gitattributes for my superdataset looks like

* annex.backend=MD5E
**/.git* annex.largefiles=nothing
** annex.largefiles=((mimeencoding=binary)and(largerthan=0))
**/code/** annex.largefiles=nothing
**/*.json annex.largefiles=nothing
**/*.tsv annex.largefiles=nothing
**/*.txt annex.largefiles=nothing
**/*.log annex.largefiles=nothing
**/*.html annex.largefiles=nothing
**/*.h5 annex.largefiles=nothing

This is for one of the big/problematic subdatasets

* annex.backend=MD5E
**/.git* annex.largefiles=nothing
** annex.largefiles=((mimeencoding=binary)and(largerthan=0))

Any advice there?

When running GIT_TRACE_PACKET=true git push I get the following error

fatal: The current branch master has no upstream branch.
To push the current branch and set the remote as upstream, use

    git push --set-upstream gin master

To have this happen automatically for branches without a tracking
upstream, see 'push.autoSetupRemote' in 'git help config'.

I can obviously do this, or can I also run my usual datalad push command after GIT_TRACE_PACKET=true?

Thanks a ton again,

Lukas

again – how big is big? (output of du -scm .git/objects). you could also inspect which large files are directly in git (i.e. not symlink) with straight du or ncdu or some other helper. Judging from your problematic .gitattributes – might be some large text files, such as XML files produced by freesurfer .

So do GIT_TRACE_PACKET=true git push gin

Thanks Yarik!

Here is the output of du -scm .git/objects for the subdataset

911	.git/objects
911	total

A straight du .git/objects identifies the following as the largest item

668260	.git/objects/pack

Any suggestion to tackle/prevent the problem by improving .gitattributes in the current or future datasets?

Still getting the error of the upstream branch when trying the git push command, should I simply set the upstream branch as suggested by the error?

Thanks,

Lukas

yes – 1GB .git/objects - too large :-/

I can recommend what I already recommended

you could also inspect which large files are directly in git (i.e. not symlink) with straight du or ncdu or some other helper.

i.e. to first identify what do you commit to git instead of git annex and what is heavy.

Thanks a ton again!

It looks like a bunch of larger .svg figures are being saved to git rather than annex for this particular subdataset.

Shall I add **/*.svg annex.largefiles=(anything) to my .gitattributes for this particular subdataset?

Maybe also get rid of some rules in the superdataset (see above, as some .tsv files created by fmriprep as well as the html reports also seem to take up quite some space?
Or maybe better to implement a purely size-based rule for the entire superdataset and its subdatasets (by modifying .gitattributes at the superdataset level)?

Will datalad saving and/or pushing afterwards move those contents from git to git annex, or do I need to do some manual work to achieve this?

Lukas

sounds like would make sense! As for the rest of the changes – (un)fortunately I do not know yet “one rule to rule them all”, but generally you might indeed prefer to stick closer to “in case of doubt – go to git-annex” since it is easier to move to git than other way around since comitted to git – would be dragged along in git history.

well, “it depends” on your configuration. Try - and see. In general/by default, datalad push would push everything so all annexed files should be pushed as well. If you set annex wanted expression for git annex remote – datalad push would push only what is “wanted” by default.

Thanks again Yarik!

I added that line to .gitattributes and then datalad saved the subdataset, but that does not move the .svg files into git annex (no symlinks).

When running datalad push --to gin --data anything -f checkdatapresent everything goes smoothly until the very end, where I run in the verification of content failed error I mention above ending in copy: 8038 failed. The number of failed copies is considerably lower in other subdatasets where I encounter this issue, but the same error occurs.

I used git-filter-repo --analyze to identify the culprits in several subdatasets (produces very useful reports) and removed all those extension-based rules from .gitattributes in my superdataset, followed by datalad save -r.

In another subdataset with large .git/objects, I run into the remote end hang up unexpectedly error I mention above, so there it seems to result in a different issue when datalad pushing to gin, but maybe the underlying issue is the same,?

Would it help to use git gc (–aggressive) or even git-filter-repo? Or do you see other options to move too many/large files from git to git annex? Would using .gitignore help (at least to prevent this in the future)?

Thanks a lot in advance again.

Best wishes,

Lukas

I think you might need to explicitly re git annex add the files to get them annexed following adjusted .gitattributes.

it would be something related to having huge .git/objects – may be remote site forbids it (like github started to do recently for commits with more than 50MB diffs etc)… might be traffic, etc

Would it help to use git gc (–aggressive) or even git-filter-repo?

if you do want to reduce .git/objects – yes, you would need something like that! but RTFM – there are aspects to all of that, such as you need to expire git reflog etc to make git gc effective. I would recommend to “practice” on a copy or a clone first. An alternative, if you do not care about history etc, just do smth like mv .git /tmp/git-moved; git init; git annex init; datalad save -m "Readding afresh" :wink: but YMMV and in either case you could experience pains if you already have clones etc.

Thanks Yarik!

Looks like git annex add that already added to git and are unmodified (based on the docs as well as a quick test I did), so that may not help unfortunately.

Not sure whether I want to move into the heavy machinery to solve this issue, but may give it a shot (will probably need your help, maybe when I visit Tor at Dartmouth in spring), and already adapted .gitattributes consistently everywhere to keep .git/objects smaller in current/future datasets.

I also sent the link to this thread to Thomas at GIN, and will keep you posted about any suggestion he may have.

Any suggestions or other encounters of similar issues are welcome in the meantime!

Best wishes,

Lukas

Hi Yarik et al,

Just wanted to let you know that I somewhat coincidentally found a solution to the “verification of content failed” problem (not sure all steps are needed, but the git annex fsck is for sure vital).

git annex unused
git annex dropunused --force
git annex dropunused --force --from=gin
git annex fsck
git annex copy --to=gin --all
datalad push --to gin --data anything -f all

Hope this is useful for other users!

Cheers,

Lukas