Hi! I’m trying to clone the HBN CPAC data from RBC with datalad. I’ve tried running it on different servers, on Linux and on Windows, but I always get the same issue: the cloning process stops after 50%. Running it in debug mode gave “Filename too long”, changing the path and enabling long paths did not resolve it. Cloning and downloading data from PNC Freesurfer works.
where is that dataset? datalad wtf -S system
might be of help as to show what is the max path length on any particular mount point or max length of filename for current folder (IIRC). Then worth comparing to the names you see in that HBN dataset.
I don’t have the issue with long paths anymore, but I still do not have any luck downloading files from the repository. It looks like it successfully cloned the data, but when I attempt datalad get or git annex get, nothing happens. Datalad status also gets stuck. Again no issues with other repositories.
I can only repeat my question
since otherwise can’t help to troubleshoot anything from the data provided. git annex whereis
on specific file might tell you where to find any particular file, git annex info
might tell you on known repos etc
Expressed my unsolicited opinion at Ideally this dataset should be made more "modular" and lightweight · Issue #2 · ReproBrainChart/HBN_CPAC · GitHub .
And indeed filenames are quite long , git-sizer reported
Processing blobs: 6077403
Processing trees: 3894754
Processing commits: 2538
Matching commits to trees: 2538
Processing annotated tags: 0
Processing references: 12
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Trees | | |
| * Count | 3.89 M | ** |
| * Blobs | | |
| * Count | 6.08 M | **** |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 76.7 KiB | * |
| * Maximum parents [1] | 1.63 k | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Trees | | |
| * Maximum entries [2] | 4.10 k | **** |
| | | |
| Biggest checkouts | | |
| * Number of directories [2] | 1.91 M | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Maximum path length [3] | 274 B | ** |
| * Number of files [2] | 2.03 M | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Number of symlinks [4] | 4.12 M | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
[1] 5b71e594d5503aa77f2a0f8f3882c94f0f953837
[2] 404ea015a493c111f8e792c3a26843fe8f7b2374 (refs/heads/git-annex^{tree})
[3] c7f5b92c9155a916cc657f7c5e66fcd811388739 (refs/remotes/origin/complete-pass-0.1^{tree})
[4] ddfd4d13e1355396b7bf5d8c17cee398488b591a (82f5f92aac2a1090cedbab805ec73336f196a024:cpac_RBCv0)
so max length is 274 which might cause trouble to some .
FWIW initial cloning went fine for me and was not way too long (5 minutes):
(dev3) 2 10998.....................................:Tue 01 Oct 2024 09:25:08 PM EDT:.
smaug:/mnt/btrfs/datasets/datalad/crawl-misc/hbn
$> git clone https://github.com/ReproBrainChart/HBN_CPAC
Cloning into 'HBN_CPAC'...
remote: Enumerating objects: 9974692, done.
remote: Counting objects: 100% (2729/2729), done.
remote: Compressing objects: 100% (882/882), done.
remote: Total 9974692 (delta 773), reused 2625 (delta 672), pack-reused 9971963 (from 1)
Receiving objects: 100% (9974692/9974692), 1.03 GiB | 24.52 MiB/s, done.
Resolving deltas: 100% (151303/151303), done.
Updating files: 100% (4110998/4110998), done.
git clone https://github.com/ReproBrainChart/HBN_CPAC 121.75s user 120.73s system 76% cpu 5:15.44 total
but then getting an individual file was more “involved”
$> git annex find --not --in here | head -n 1 | xargs git annex get
Remote origin not usable by git-annex; setting annex-ignore
https://github.com/ReproBrainChart/HBN_CPAC/config download failed: Not Found
git-annex: <stdout>: hFlush: resource vanished (Broken pipe)
get cpac_RBCv0/sub-NDARAA075AMK/ses-HBNsiteSI/anat/sub-NDARAA075AMK_ses-HBNsiteSI_desc-brain_mask.json (from fcp-indi...) (scanning for annexed files...)
...
get cpac_RBCv0/sub-NDARAA075AMK/ses-HBNsiteSI/anat/sub-NDARAA075AMK_ses-HBNsiteSI_desc-brain_mask.json (from fcp-indi...) (scanning for annexed files...)
ok
(recording state in git...)
git annex find --not --in here 28.83s user 16.07s system 60% cpu 1:14.09 total
head -n 1 0.00s user 0.00s system 0% cpu 1:14.07 total
xargs git annex get 924.98s user 1110.91s system 52% cpu 1:04:25.98 total
so over an hour but succeeded just fine. Which file did you try to get and what error/output you received?
FWIW, subsequent annex get
was much more speedier albeit still long (due to that exuberant size of the repo).
Thank you!
Cloning took longer for me, but it seems to have worked (output is the same as above). But when I run the exact command (git annex find --not --in here | head -n 1 | xargs git annex get), I get no output. Same issue when I just try to use git annex get. Any ideas?
“no output” as
- A: it exits without output and exit code Y
- B: it “works” for a while without outputting anything and you give up on waiting after X hours?
It exists without output, exit code is 0
that is interesting. If you run
git annex find --not --in here | head -n 1
alone and there is no output it would mean that annex thinks (and may be right?) that all files are already present locally?? if that is not true, just run
git annex --debug find --not --in here
and share what it says - might point to culprit but then also check git annex version
since it better be recentish if we are getting this deep.
Will you be at SfN? then may be we could troubleshoot interactively together. You can find me at DANDI booth most of the time.