Cloning HBN CPAC data

Emily_Vol · September 27, 2024, 2:16pm

Hi! I’m trying to clone the HBN CPAC data from RBC with datalad. I’ve tried running it on different servers, on Linux and on Windows, but I always get the same issue: the cloning process stops after 50%. Running it in debug mode gave “Filename too long”, changing the path and enabling long paths did not resolve it. Cloning and downloading data from PNC Freesurfer works.

yarikoptic · September 27, 2024, 7:51pm

where is that dataset? datalad wtf -S system might be of help as to show what is the max path length on any particular mount point or max length of filename for current folder (IIRC). Then worth comparing to the names you see in that HBN dataset.

Emily_Vol · September 30, 2024, 3:26pm

I don’t have the issue with long paths anymore, but I still do not have any luck downloading files from the repository. It looks like it successfully cloned the data, but when I attempt datalad get or git annex get, nothing happens. Datalad status also gets stuck. Again no issues with other repositories.

yarikoptic · September 30, 2024, 7:55pm

I can only repeat my question

since otherwise can’t help to troubleshoot anything from the data provided. git annex whereis on specific file might tell you where to find any particular file, git annex info might tell you on known repos etc

Emily_Vol · October 1, 2024, 8:32pm

Sorry, it’s GitHub - ReproBrainChart/HBN_CPAC: RBC Repository of HBN CPAC data

yarikoptic · October 2, 2024, 3:22pm

Expressed my unsolicited opinion at Ideally this dataset should be made more "modular" and lightweight · Issue #2 · ReproBrainChart/HBN_CPAC · GitHub .

And indeed filenames are quite long , git-sizer reported

Processing blobs: 6077403
Processing trees: 3894754
Processing commits: 2538
Matching commits to trees: 2538
Processing annotated tags: 0
Processing references: 12
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Trees                      |           |                                |
|   * Count                    |  3.89 M   | **                             |
| * Blobs                      |           |                                |
|   * Count                    |  6.08 M   | ****                           |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  76.7 KiB | *                              |
|   * Maximum parents      [1] |  1.63 k   | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Trees                      |           |                                |
|   * Maximum entries      [2] |  4.10 k   | ****                           |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [2] |  1.91 M   | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Maximum path length    [3] |   274 B   | **                             |
| * Number of files        [2] |  2.03 M   | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Number of symlinks     [4] |  4.12 M   | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |

[1]  5b71e594d5503aa77f2a0f8f3882c94f0f953837
[2]  404ea015a493c111f8e792c3a26843fe8f7b2374 (refs/heads/git-annex^{tree})
[3]  c7f5b92c9155a916cc657f7c5e66fcd811388739 (refs/remotes/origin/complete-pass-0.1^{tree})
[4]  ddfd4d13e1355396b7bf5d8c17cee398488b591a (82f5f92aac2a1090cedbab805ec73336f196a024:cpac_RBCv0)

so max length is 274 which might cause trouble to some .

FWIW initial cloning went fine for me and was not way too long (5 minutes):

(dev3) 2 10998.....................................:Tue 01 Oct 2024 09:25:08 PM EDT:.
smaug:/mnt/btrfs/datasets/datalad/crawl-misc/hbn
$> git clone https://github.com/ReproBrainChart/HBN_CPAC
Cloning into 'HBN_CPAC'...
remote: Enumerating objects: 9974692, done.
remote: Counting objects: 100% (2729/2729), done.
remote: Compressing objects: 100% (882/882), done.
remote: Total 9974692 (delta 773), reused 2625 (delta 672), pack-reused 9971963 (from 1)
Receiving objects: 100% (9974692/9974692), 1.03 GiB | 24.52 MiB/s, done.
Resolving deltas: 100% (151303/151303), done.
Updating files: 100% (4110998/4110998), done.
git clone https://github.com/ReproBrainChart/HBN_CPAC  121.75s user 120.73s system 76% cpu 5:15.44 total

but then getting an individual file was more “involved”

$> git annex find --not --in here | head -n 1 | xargs git annex get
  Remote origin not usable by git-annex; setting annex-ignore
  https://github.com/ReproBrainChart/HBN_CPAC/config download failed: Not Found
git-annex: <stdout>: hFlush: resource vanished (Broken pipe)
get cpac_RBCv0/sub-NDARAA075AMK/ses-HBNsiteSI/anat/sub-NDARAA075AMK_ses-HBNsiteSI_desc-brain_mask.json (from fcp-indi...) (scanning for annexed files...)
...
get cpac_RBCv0/sub-NDARAA075AMK/ses-HBNsiteSI/anat/sub-NDARAA075AMK_ses-HBNsiteSI_desc-brain_mask.json (from fcp-indi...) (scanning for annexed files...)
ok
(recording state in git...)
git annex find --not --in here  28.83s user 16.07s system 60% cpu 1:14.09 total
head -n 1  0.00s user 0.00s system 0% cpu 1:14.07 total
xargs git annex get  924.98s user 1110.91s system 52% cpu 1:04:25.98 total

so over an hour but succeeded just fine. Which file did you try to get and what error/output you received?

FWIW, subsequent annex get was much more speedier albeit still long (due to that exuberant size of the repo).

Emily_Vol · October 3, 2024, 3:57pm

Thank you!

Cloning took longer for me, but it seems to have worked (output is the same as above). But when I run the exact command (git annex find --not --in here | head -n 1 | xargs git annex get), I get no output. Same issue when I just try to use git annex get. Any ideas?

yarikoptic · October 3, 2024, 6:29pm

“no output” as

A: it exits without output and exit code Y
B: it “works” for a while without outputting anything and you give up on waiting after X hours?

Emily_Vol · October 3, 2024, 7:30pm

It exists without output, exit code is 0

yarikoptic · October 4, 2024, 12:54am

that is interesting. If you run

git annex find --not --in here | head -n 1

alone and there is no output it would mean that annex thinks (and may be right?) that all files are already present locally?? if that is not true, just run

git annex --debug find --not --in here

and share what it says - might point to culprit but then also check git annex version since it better be recentish if we are getting this deep.

Will you be at SfN? then may be we could troubleshoot interactively together. You can find me at DANDI booth most of the time.