I recently updated to Datalad v0.17.9, and used the very useful datalad-create-sibling-gin command to set up gin repos as siblings for my superdataset and its subdatasets.
I succesfully datalad pushed one of my subdatasets, but when trying another one (containing mriqc output) using the command datalad push --to gin I get the following error message, regardless of whether I include the --data anything or -f checkdatapresent options.
I also noted that during the pushing, I sometimes get the Connection to gin.g-node.org closed by remote host message, after which pushing resumes automatically (sometimes I need to re-enter my GIN ssh password though).
Any suggestions on how to fix this would be welcome!
Thanks,
Lukas
CommandError: ‘git -c diff.ignoreSubmodules=none push --progress --porcelain gin master:master git-annex:git-annex’ failed with exitcode 128 under /data/proj_erythritol/proj_erythritol_4a/mriqc
Delta compression using up to 64 threads
CommandError: ‘ssh -o ControlPath=/home/luna.kuleuven.be/u0027997/.cache/datalad/sockets/53dce49f git@gin.g-node.org ‘git-receive-pack ‘"’"’/labgas/proj_erythritol_4a-mriqc.git’"’"’’’ failed with exitcode 255
send-pack: unexpected disconnect while reading sideband packet
fatal: the remote end hung up unexpectedly
Just adding here that in another subdataset, I got a different error when executing datalad push --to gin --data anything -f checkdatapresent:
CommandError: ‘git -c diff.ignoreSubmodules=none annex copy --batch -z --to gin --json --json-error-messages --json-progress -c annex.dotfiles=true’ failed with exitcode 1 under /data/proj_erythritol/proj_erythritol_4a/secondlevel [info keys: stdout_json]
to gin…
Transfer failed
Transfer failed [14 times]
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
verification of content failed
copy: 14 failed
The repo is indeed private, but repos for the subdatasets I managed to push without problems are private too, so this is probably not the main issue?
Manually running git annex copy and git push resulted in a similar error
u0027997@gbw-s-labgas01:/data/proj_erythritol/proj_erythritol_4a/mriqc$ git annex copy --to=gin sub-003_T1w.html
u0027997@gbw-s-labgas01:/data/proj_erythritol/proj_erythritol_4a/mriqc$ git push gin master:master
Enter passphrase for key ‘/home/luna.kuleuven.be/u0027997/.ssh/id_ed25519’:
Enumerating objects: 660, done.
Counting objects: 100% (660/660), done.
Delta compression using up to 64 threads
Connection to gin.g-node.org closed by remote host.
fatal: the remote end hung up unexpectedly
Compressing objects: 100% (624/624), done.
fatal: the remote end hung up unexpectedly
Any idea what the cause of the other error may be? That one looks different…
On another note, I also get the following error message when running several datalad commands, without however the commands being prevented from exiting correctly (at least so it seems)
[ERROR ] Internal error, cannot import interface ‘datalad_hirni.commands.import_dicoms’: ImportError(cannot import name ‘AnnotatePaths’ from ‘datalad.interface.annotate_paths’ (/opt/anaconda3/lib/python3.8/site-packages/datalad/interface/annotate_paths.py))
[ERROR ] Skipping unusable command interface ‘datalad_hirni.commands.import_dicoms.ImportDicoms’ from extension ‘hirni’
[ERROR ] Internal error, cannot import interface ‘datalad_hirni.commands.spec4anything’: ImportError(cannot import name ‘AnnotatePaths’ from ‘datalad.interface.annotate_paths’ (/opt/anaconda3/lib/python3.8/site-packages/datalad/interface/annotate_paths.py))
[ERROR ] Skipping unusable command interface ‘datalad_hirni.commands.spec4anything.Spec4Anything’ from extension ‘hirni’
[ERROR ] Internal error, cannot import interface ‘datalad_hirni.commands.dicom2spec’: ImportError(cannot import name ‘AnnotatePaths’ from ‘datalad.interface.annotate_paths’ (/opt/anaconda3/lib/python3.8/site-packages/datalad/interface/annotate_paths.py))
[ERROR ] Skipping unusable command interface ‘datalad_hirni.commands.dicom2spec.Dicom2Spec’ from extension ‘hirni’
Any idea whether I should do something about this, and what can be done if yes?
my fear is that “unexpected disconnect” is very gin specific and ideally should be troubleshooted along with someone among gin admins but not sure how active/available they are. may be you would spot some peculiarity of it which makes it different from others for you (e.g big .git/objects?)
Thanks Yarik - got in touch with GIN people via e-mail and they got back to me - currently trying their suggestion to push with GIN CLI rather than datalad commands (but still solving issues with my recent git-annex version which does not seem to be compatible with GIN CLI) - will keep you posted
.git/objects is big, but not bigger than other subdatasets which I do manage to push using datalad without problems, hence unlikely to be the issue?
GIN CLI did not help since it is not compatible with git annex > version 8 whereas I am running a version 10. I provided them with more details which they are not looking at from the GIN side.
This is what my .gitattributes for my superdataset looks like
When running GIT_TRACE_PACKET=true git push I get the following error
fatal: The current branch master has no upstream branch.
To push the current branch and set the remote as upstream, use
git push --set-upstream gin master
To have this happen automatically for branches without a tracking
upstream, see 'push.autoSetupRemote' in 'git help config'.
I can obviously do this, or can I also run my usual datalad push command after GIT_TRACE_PACKET=true?
again – how big is big? (output of du -scm .git/objects). you could also inspect which large files are directly in git (i.e. not symlink) with straight du or ncdu or some other helper. Judging from your problematic .gitattributes – might be some large text files, such as XML files produced by freesurfer .
It looks like a bunch of larger .svg figures are being saved to git rather than annex for this particular subdataset.
Shall I add **/*.svg annex.largefiles=(anything) to my .gitattributes for this particular subdataset?
Maybe also get rid of some rules in the superdataset (see above, as some .tsv files created by fmriprep as well as the html reports also seem to take up quite some space?
Or maybe better to implement a purely size-based rule for the entire superdataset and its subdatasets (by modifying .gitattributes at the superdataset level)?
Will datalad saving and/or pushing afterwards move those contents from git to git annex, or do I need to do some manual work to achieve this?
sounds like would make sense! As for the rest of the changes – (un)fortunately I do not know yet “one rule to rule them all”, but generally you might indeed prefer to stick closer to “in case of doubt – go to git-annex” since it is easier to move to git than other way around since comitted to git – would be dragged along in git history.
well, “it depends” on your configuration. Try - and see. In general/by default, datalad push would push everything so all annexed files should be pushed as well. If you set annex wanted expression for git annex remote – datalad push would push only what is “wanted” by default.
I added that line to .gitattributes and then datalad saved the subdataset, but that does not move the .svg files into git annex (no symlinks).
When running datalad push --to gin --data anything -f checkdatapresent everything goes smoothly until the very end, where I run in the verification of content failed error I mention above ending in copy: 8038 failed. The number of failed copies is considerably lower in other subdatasets where I encounter this issue, but the same error occurs.
I used git-filter-repo --analyze to identify the culprits in several subdatasets (produces very useful reports) and removed all those extension-based rules from .gitattributes in my superdataset, followed by datalad save -r.
In another subdataset with large .git/objects, I run into the remote end hang up unexpectedly error I mention above, so there it seems to result in a different issue when datalad pushing to gin, but maybe the underlying issue is the same,?
Would it help to use git gc (–aggressive) or even git-filter-repo? Or do you see other options to move too many/large files from git to git annex? Would using .gitignore help (at least to prevent this in the future)?
I think you might need to explicitly re git annex add the files to get them annexed following adjusted .gitattributes.
it would be something related to having huge .git/objects – may be remote site forbids it (like github started to do recently for commits with more than 50MB diffs etc)… might be traffic, etc
Would it help to use git gc (–aggressive) or even git-filter-repo?
if you do want to reduce .git/objects – yes, you would need something like that! but RTFM – there are aspects to all of that, such as you need to expire git reflog etc to make git gc effective. I would recommend to “practice” on a copy or a clone first. An alternative, if you do not care about history etc, just do smth like mv .git /tmp/git-moved; git init; git annex init; datalad save -m "Readding afresh" but YMMV and in either case you could experience pains if you already have clones etc.
Looks like git annex add that already added to git and are unmodified (based on the docs as well as a quick test I did), so that may not help unfortunately.
Not sure whether I want to move into the heavy machinery to solve this issue, but may give it a shot (will probably need your help, maybe when I visit Tor at Dartmouth in spring), and already adapted .gitattributes consistently everywhere to keep .git/objects smaller in current/future datasets.
I also sent the link to this thread to Thomas at GIN, and will keep you posted about any suggestion he may have.
Any suggestions or other encounters of similar issues are welcome in the meantime!
Just wanted to let you know that I somewhat coincidentally found a solution to the “verification of content failed” problem (not sure all steps are needed, but the git annex fsck is for sure vital).