Datalad not downloading to NFS mount ("notneeded"). Works on local dir

I’m having some confusing issues getting data with datalad. I’ve used datalad on our server before to grab openneuro datasets (albeit a dataset with a much smaller number of files), so I’m not sure if the problem is related to this specific repo, with many files, or if something changed on our servers.

I can clone the repo, though I get a message about config file download failure:

$ datalad clone https://github.com/ReproBrainChart/HBN_CPAC.git
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore                                                                                                                                                                                
[INFO   ] https://github.com/ReproBrainChart/HBN_CPAC.git/config download failed: Not Found                                                                                                                                                          
[INFO   ] RIA store unavailable. -caused by- file:///cbica/comp_space/RBC/tmp_dir/output_ria/ria-layout-version not found, self.ria_store_url: ria+file:///cbica/comp_space/RBC/tmp_dir/output_ria, self.store_base_pass: /cbica/comp_space/RBC/tmp_dir/output_ria, self.store_base_pass_push: None, path: <class 'pathlib.PosixPath'> /cbica/comp_space/RBC/tmp_dir/output_ria/ria-layout-version -caused by- [Errno 2] No such file or directory: '/cbica/comp_space/RBC/tmp_dir/output_ria/ria-layout-version'                                                                                                                                                                                                                                              
[INFO   ] RIA store unavailable. -caused by- file:///cbica/comp_space/RBC/tmp_dir/input_ria/ria-layout-version not found, self.ria_store_url: ria+file:///cbica/comp_space/RBC/tmp_dir/input_ria, self.store_base_pass: /cbica/comp_space/RBC/tmp_dir/input_ria, self.store_base_pass_push: None, path: <class 'pathlib.PosixPath'> /cbica/comp_space/RBC/tmp_dir/input_ria/ria-layout-version -caused by- [Errno 2] No such file or directory: '/cbica/comp_space/RBC/tmp_dir/input_ria/ria-layout-version'                                                                   

When I use “datalad get”, I get a “notneeded” message, and nothing is downloaded:

$ datalad get cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-?_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv
action summary:
  get (notneeded: 2)

I’m running this on an NFS mount, so I’m not sure if that’s causing the problem (though again it worked on other datasets). If I datalad clone to a local scratch directory, I get the same config download failure message, but when I datalad get it does download eventually after a long initial “hang”. I get why NFS might slow things down, but I’m confused why I would get a “notneeded” message, as though the file already exists.

I’m running all of this on a ubuntu 22.04 server, with datalad 1.0.1.

Any ideas what might be going on?

@kjamison can you describe how you worked around the problem?

I was able to get the data by cloning the repo to a local temporary/scratch location on the server (not NFS), download the data there, then copy the data back to the NFS location.

~$ cd /tmp/
/tmp$ datalad clone https://github.com/ReproBrainChart/HBN_CPAC.git

[INFO   ] Remote origin not usable by git-annex; setting annex-ignore                                                                                                                                                                                
[INFO   ] https://github.com/ReproBrainChart/HBN_CPAC.git/config download failed: Not Found 
[INFO   ] RIA store unavailable. -caused by- file:///cbica/comp_space/RBC/tmp_dir/output_ria/ria-layout-version not found, self.ria_store_url: ria+file:///cbica/comp_space/RBC/tmp_dir/output_ria, self.store_base_pass: /cbica/comp_space/RBC/tmp_dir/output_ria, self.store_base_pass_push: None, path: <class 'pathlib.PosixPath'> /cbica/comp_space/RBC/tmp_dir/output_ria/ria-layout-version -caused by- [Errno 2] No such file or directory: '/cbica/comp_space/RBC/tmp_dir/output_ria/ria-layout-version' 
[INFO   ] RIA store unavailable. -caused by- file:///cbica/comp_space/RBC/tmp_dir/input_ria/ria-layout-version not found, self.ria_store_url: ria+file:///cbica/comp_space/RBC/tmp_dir/input_ria, self.store_base_pass: /cbica/comp_space/RBC/tmp_dir/input_ria, self.store_base_pass_push: None, path: <class 'pathlib.PosixPath'> /cbica/comp_space/RBC/tmp_dir/input_ria/ria-layout-version -caused by- [Errno 2] No such file or directory: '/cbica/comp_space/RBC/tmp_dir/input_ria/ria-layout-version' 
install(ok): /tmp/HBN_CPAC (dataset)

/tmp$ cd HBN_CPAC
/tmp/HBN_CPAC$ datalad get cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-?_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv

get(ok): cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-2_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv (file) [from fcp-indi...]               
get(ok): cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv (file) [from fcp-indi...]               
action summary:
  get (ok: 2)

/tmp/HBN_CPAC$ find cpac_RBCv0/ -type l -exec test -e {} \; -print0 | rsync -avL --files-from=- --from0 ./ ~/mydata/

edit: The last line is needed to copy the file contents to the target destination, resolving only valid symlinks (files I actually downloaded). Note that the output data in ~/mydata is no longer datalad compliant symlinks pointing to git annex files

something funny about that NFS, the question is “what” :wink:

What is the output of

head cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-?_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv

git branch

git annex get --debug cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-?_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv

git annex version

on NFS?

$ readlink cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv

../../../../.git/annex/objects/pf/1K/MD5E-s1017618--032c6f14790f382174d4dc99c2b84049.tsv/MD5E-s1017618--032c6f14790f382174d4dc99c2b84049.tsv
$ head cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-?_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv

head: cannot open 'cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv' for reading: No such file or directory
$ git branch
  git-annex
* main
$ git annex get --debug cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-?_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv

[2024-06-04 10:28:28.390455461] (Utility.Process) process [3271590] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","ls-files","--stage","-z","--error-unmatch","--","cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv","cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-2_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv"]
[2024-06-04 10:28:28.391040989] (Utility.Process) process [3271591] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)","--buffer"]
[2024-06-04 10:28:28.391647521] (Utility.Process) process [3271592] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)","--buffer"]
[2024-06-04 10:28:28.392363548] (Utility.Process) process [3271593] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","show-ref","git-annex"]
error: pathspec 'cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv' did not match any file(s) known to git
error: pathspec 'cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-2_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv' did not match any file(s) known to git
Did you forget to 'git add'?
[2024-06-04 10:28:28.426796389] (Utility.Process) process [3271593] done ExitSuccess
[2024-06-04 10:28:28.427295087] (Utility.Process) process [3271594] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","show-ref","--hash","refs/heads/git-annex"]
[2024-06-04 10:28:28.432535812] (Utility.Process) process [3271594] done ExitSuccess
[2024-06-04 10:28:28.433373947] (Utility.Process) process [3271595] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","log","refs/heads/git-annex..e198a732e3e5cedc6793cccfe1df1ca0e28092aa","--pretty=%H","-n1"]
[2024-06-04 10:28:28.531586848] (Utility.Process) process [3271595] done ExitSuccess
[2024-06-04 10:28:28.532719514] (Utility.Process) process [3271596] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)","--buffer"]
[2024-06-04 10:28:28.535624299] (Utility.Process) process [3271596] done ExitSuccess
[2024-06-04 10:28:28.535688313] (Utility.Process) process [3271592] done ExitSuccess
[2024-06-04 10:28:28.535725265] (Utility.Process) process [3271591] done ExitSuccess
[2024-06-04 10:28:28.535751666] (Utility.Process) process [3271590] done ExitFailure 1
get: 1 failed
$ git annex version

git-annex version: 10.20230626-g8594d49
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.29 DAV-1.3.4 feed-1.3.2.0 ghc-8.10.7 http-client-0.7.9 persistent-sqlite-2.13.0.3 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.1.2
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10

how large is this beast??? I tried to clone but ran out of space in /tmp after cloning 11GB! (smells like .gitattributes was not set “optimally” and lots of files were added directly into git instead of git-annex)

so what happens if you run

git status

and that command git annex actually ran (quoted above), which I reassemble pasting it into ipython session:

In [1]: ' '.join(["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","ls-files","--stage","-z","--error-unmatch","--","cpac_RBCv0/sub-NDARTB661TVR/ses-H
   ...: BNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv","cpac_RB
   ...: Cv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-2_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_
   ...: correlations.tsv"])
Out[1]: '--git-dir=.git --work-tree=. --literal-pathspecs -c annex.debug=true ls-files --stage -z --error-unmatch -- cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-2_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv'

just prepend with git :wink:

I get 19GB after a fresh datalad clone. find ./ -type l | wc -l returns 4110978

$ git status | head -n 20
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	deleted:    .datalad/.gitattributes
	deleted:    .datalad/config
	deleted:    .gitattributes
	deleted:    .gitignore
	deleted:    .gitmodules
	deleted:    CHANGELOG.md
	... (keeps going forever)
$ git --git-dir=.git --work-tree=. --literal-pathspecs -c annex.debug=true ls-files --stage -z --error-unmatch -- cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-2_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv
error: pathspec 'cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-1_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv' did not match any file(s) known to git
error: pathspec 'cpac_RBCv0/sub-NDARTB661TVR/ses-HBNsiteCBIC/func/sub-NDARTB661TVR_ses-HBNsiteCBIC_task-rest_run-2_atlas-Schaefer2018p200n17_space-MNI152NLin6ASym_reg-aCompCor_desc-PearsonNilearn_correlations.tsv' did not match any file(s) known to git
Did you forget to 'git add'?

Edit: Just to clarify, these commands, and the previous reply, were executed in my NFS location. If I am working in a local scratch dir, git status gives me nothing to commit, working tree clean

image

with this number of files, it should have ideally been partitioned into subdatasets, but it largely orthogonal to overall “byte size”… FTR my initial attempt on smaug crashed as

$> datalad clone https://github.com/ReproBrainChart/HBN_CPAC.git
install(error): /mnt/btrfs/datasets/incoming/HBN_CPAC (dataset) [Failed to clone from any candidate source URL. Encountered errors per each url were:                               
- https://github.com/ReproBrainChart/HBN_CPAC.git                                                                                                                                   
  CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false clone --progress https://github.com/ReproBrainChart/HBN_CPAC.git /mnt/btrfs/datasets/incoming/HBN_CPAC' failed with exitcode 128 [err: 'Cloning into '/mnt/btrfs/datasets/incoming/HBN_CPAC'...
error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
error: 4061 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output']
- https://github.com/ReproBrainChart/HBN_CPAC.git/.git
  CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false clone --progress https://github.com/ReproBrainChart/HBN_CPAC.git/.git /mnt/btrfs/datasets/incoming/HBN_CPAC' failed with exitcode 128 [err: 'Cloning into '/mnt/btrfs/datasets/incoming/HBN_CPAC'...
remote: Not Found
fatal: repository 'https://github.com/ReproBrainChart/HBN_CPAC.git/.git/' not found']]

and with the dirty git status output I expect that you also got some initial crash as that checkout did not complete successfully and hence you ended up in a state where locally you have something in a very odd state. Not yet sure what to blame here (or both) – the huge size and/or NFS. Still waiting to get my clone on NFS complete to some result.

well, my “orthogonal” didn’t stand the testing by this beast – here it is indeed a number of files is such (4 million) that it does crank up overall size considerably! If I clone with --depth 1 (thus not fetching objects for git history) I get some more “concise” clone which does save in amount of data in .git/objects and thus may be transferred but not really in run time I think

$> git clone --depth 1 https://github.com/ReproBrainChart/HBN_CPAC.git HBN_CPAC-depth1
Cloning into 'HBN_CPAC-depth1'...
remote: Enumerating objects: 2032629, done.
remote: Counting objects: 100% (2032629/2032629), done.
remote: Compressing objects: 100% (2025288/2025288), done.
remote: Total 2032629 (delta 4907), reused 2032598 (delta 4880), pack-reused 0
Receiving objects: 100% (2032629/2032629), 241.54 MiB | 8.81 MiB/s, done.
Resolving deltas: 100% (4907/4907), done.
Checking connectivity: 2032629, done.
Updating files: 100% (4110999/4110999), done.
git clone --depth 1 https://github.com/ReproBrainChart/HBN_CPAC.git   77.16s user 117.79s system 52% cpu 6:14.08 total

$> du -scm HBN_CPAC*/.git/objects
296     HBN_CPAC-depth1/.git/objects
1316    HBN_CPAC/.git/objects
1612    total

The rest of demanded GBs is indeed due to about 4k per symlink demanded and thus needing those 4*4=~16 GBs