Datalad push --to gin errors

yarikoptic · March 29, 2023, 6:21pm

thanks for sharing! not yet sure what to make out of it… may be you have/remember what was output from running annex fsck – did it find any issues?

lukasvo76 · March 29, 2023, 8:29pm

Yes it does! Bad file size, no known copies, and others, but it seems to fix them all allowing git annex copy and datalad push to function normally

yarikoptic · March 29, 2023, 10:49pm

“fix” might be the wrong word as you might be missing some target annexed data files. Might want to check if you don’t have broken symlinks in the tree

lukasvo76 · March 30, 2023, 2:49pm

Thanks Yarik!

Datalad status showed a clean tree.

Anything else I should check?

Lukas

yarikoptic · March 30, 2023, 5:34pm

something like find -xtype l on linux could give you the dangling links. Also git annex find --not --in here would return list of files in the tree for which content is known to not be present locally (e.g. moved away by fsck since checksum mismatched.

lukasvo76 · April 7, 2023, 3:33pm

Thanks Yarik!

The files for which content could not be verified previously indeed appear in the output of git annex find --not --in here, and cannot be opened anymore locally nor downloaded from GIN.

Any idea on how to retrieve them? Where would/could they be moved away by fsck?

All those files are .mat files created by Matlab scripts (and perhaps appended by later scripts, which may cause the problem verifying content). Any idea how this can be prevented in the future?

Lukas

lukasvo76 · April 11, 2023, 8:36pm

Just FYI, the problem I encounter seems to be described in this datalad issue and this datalad issue, but they do not seem to provide a solution.

lukasvo76 · April 18, 2023, 7:52am

Hi Yarik,

Here is another update.

1. With the help of Thomas @GIN, I found a simple solution to the following errors mentioned above

Connection to [gin.g-node.org](http://gin.g-node.org/) closed by remote host.
fatal: the remote end hung up unexpectedly

and

send-pack: unexpected disconnect while reading sideband packet

namely

cd /data/ proj_erythritol_4a/mriqc

git gc   #garbage collection and cleanup

datalad siblings -s gin remove

datalad create-sibling-gin labgas/proj_erythritol_4a-mriqc -s gin —private   #create fresh remote repo

datalad push --to gin

Hence, a simple git gc combined with removal of the corrupt gin repo and recreating it followed by a simple datalad push does the trick. This may be useful for other users encountering such problems.

2. Verification of content error

See my previous replies - git annex fsck solves the pushing issue, but the files identified by fsck are not available anymore.

Not a disaster since they can be recreated by rerunning scripts, but any advice on how to retrieve them and/or prevent the problem in the future would still be highly appreciated!

Cheers,

Lukas

yarikoptic · April 19, 2023, 3:26pm

well, “broken” files are still there under .git/annex/bad/, may be would come handy?
otherwise - I don’t quite remember the full story here to identify the moment which “broke” the files. May be it was a situation due to some odd filesystem like NFS, e.g. that some files were not fully sync’ed to drive before being added to git-annex thus their checksums computed by it did not correspond, or smth like that??? Looking at files under bad/ might give an idea – if they aren’t broken then may be checksum was indeed on partial content or smth like that?

lukasvo76 · April 19, 2023, 3:59pm

Thanks a lot Yarik!

They are under .git/annex/bad indeed, and not broken!

Is there an easy way to retrieve/restore them to their original location/name from there?

lukasvo76 · May 3, 2023, 8:04am

Hi @yarikoptic, any further thoughts on this?

Thanks a lot!

Cheers,

Lukas

yarikoptic · May 3, 2023, 2:53pm

I think so. Let me simulate the situation and “fix” with such bash script

#!/bin/bash

export PS4='> '

#set -u
set -x

cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)"

git init
git annex init


# partial content
echo -n 123 > 1234
# is annexed/committed (moved into annex)
git annex add 1234
git commit -m 123 1234
# while file is being modified/grows
# we need to simulate it by jumping permissions
key=$(readlink -f 1234)
chmod +w "$key"
echo -n 4 >> 1234
chmod -w "$key"

git annex fsck

ls -l 1234
ls -lL 1234 && { echo "should be broken"; exit 1; } || :

for k in .git/annex/bad/*; do 
    kn=$(basename $k)
    target=$(find . -lname "*/$kn")
    mv "$k" "$target"
    git annex add "$target"
done
git commit -m 'Moved files from annex/bad into the tree assuming that they are not bad'

which results in

> ls -l 1234
lrwxrwxrwx 1 yoh yoh 178 May  3 10:52 1234 -> .git/annex/objects/ZV/3J/SHA256E-s3--a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3/SHA256E-s3--a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3
> ls -lL 1234
ls: cannot access '1234': No such file or directory
> :
> for k in .git/annex/bad/*
>> basename .git/annex/bad/SHA256E-s3--a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3
> kn=SHA256E-s3--a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3
>> find . -lname '*/SHA256E-s3--a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3'
> target=./1234
> mv .git/annex/bad/SHA256E-s3--a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3 ./1234
> git annex add ./1234
add 1234 
ok                                
(recording state in git...)
> git commit -m 'Moved files from annex/bad into the tree assuming that they are not bad'
[master c241adb] Moved files from annex/bad into the tree assuming that they are not bad
 1 file changed, 1 insertion(+), 1 deletion(-)

but I would tripple check first if those are indeed “not bad”. Could you share output of

ls -l .git/annex/bad – e.g. how to actual file sizes compare to the ones annex saw when adding (and thus in the key name)?

lukasvo76 · May 3, 2023, 3:14pm

Lovely, thanks @yarikoptic

Here is the output, which interestingly seems to suggest that for most of them, the size matches, except for one

u0027997@gbw-s-labgas01:/data/proj_erythritol/proj_erythritol_4a/secondlevel$ ls -l .git/annex/bad
total 728156
-rw-rw-rw- 1 u0139539 domain users 112685050 Nov 10 12:24 MD5E-s112685050–510b95676125dc1cd81c56654afc455b.mat
-rw-rw-rw- 1 u0139539 domain users 112812769 Nov 10 12:24 MD5E-s112812769–e49ce95c63fc2a67e84149b2d68d23d8.mat
-rw-rw-rw- 1 u0139539 domain users 11451364 Nov 10 12:36 MD5E-s11451375–7b984e0aae15255a486bccbd91b65eaf.mat
-r–r–r-- 1 u0027997 domain users 194616 Jun 28 2022 MD5E-s194616–912e6e5700cdcc112fba58f044c31acc.mat
-r–r–r-- 1 u0027997 domain users 2294136 Jul 19 2022 MD5E-s392416–2efea85a8575493124d17a906ec06fbc.mat
-rw-rw-rw- 1 u0139539 domain users 506174651 Nov 10 12:36 MD5E-s506174651–84976e010c2df3443b806e8d81b07b22.mat
-r–r–r-- 1 u0027997 domain users 757 Jul 26 2022 MD5E-s757–bd1727ec426877f3ddacdeb65fb85169.mat

yarikoptic · May 3, 2023, 3:50pm

indeed odd – e.g. in this case it became smaller. a possible explanation that those files already existed and were open for writing “in place” and then not synced properly so NFS did not show updated state. We are confronted with similar odd behavior even with tiny test files even outside of datalad context (e.g. in dandi-cli).
Altogether – there is a hope that those files are indeed not bad but I would say there is no guarantee in that right?

BTW – another question – for this repo, did git-annex figured out the need for pidlock ? (output of git config annex.pidlock ?) May be setting it could somehow help to prevent such issues (just a wild hope/guess) …

lukasvo76 · May 3, 2023, 3:57pm

Agreed, good hope but no guarantee!

No output of git config annex.pidlock in this subdataset! Should I turn it on?

Shall I try to run your bash script from the for loop onwards?

yarikoptic · May 3, 2023, 4:14pm

FWIW I do have it set on a server user-wide (~/.gitconfig) where I know that I deal mostly with datasets on NFS. So far I did not have similar to you “gotchas” AFAIK. YMMV

Your call. No warranty of any kind etc I would probably do manually on one or two files to see how it goes

lukasvo76 · May 3, 2023, 10:21pm

Hi @yarikoptic,

The following manual approach based on your script (but more convoluted I see now) works

1. Remove corrupt file In target directory

git annex unannex <corrupt_filename>
rm <corrupt_filename>
datalad save

2. Move file from .git/annex/bad to target directory

mv .git/annex/bad/MD5E… target_dir/<corrupt_filename>
datalad save

After doing this for all corrupt files, datalad push --to gin runs without errors, and all files are available locally and on GIN again without being corrupt.

For larger amounts of files, an automated/looped solution would obviously be needed, and your loop may do that trick I guess?

yarikoptic · May 4, 2023, 12:12am

I would dread to recommend any automated solution here. Ideally the solution should revolve around not running into such a situation and trying to figure out how/what to fix to prevent it!

lukasvo76 · May 4, 2023, 12:25am

Haha obviously agree!

Still not clear to me what cause the issue in the first place, but both types of errors are at least solved.

Thanks a ton for all the help @yarikoptic!