Datalad get partial file (header?)

Hi @yarikoptic (likely),

I was curious if you’ve thought about datalad just getting headers for each nifti file instead of the entire file. The use-case is being able to build statistical models that require metadata about the nifti file but not the nifti file itself (to save space/time).

Or maybe a better workaround is to create new git repos with only header data, but it would be cool to be able to only fetch/sniff a certain amount of bytes from a file in git-annex/datalad.

Thoughts/ideas?

James

GitHub - datalad/datalad-fuse: DataLad extension to provide FUSE file system access is the WiP toward this. It relies on fsspec for actual “sparse cached access” , and uses http* urls for the annexed files. If you are to use programmatically (e.g. to populate that repo of headers if so very much desired) could use FsspecAdapter as here: datalad-fuse/fsspec_head.py at 0063f7b0310151ca868bc64489ee6452a10753bc · datalad/datalad-fuse · GitHub to get an open file instance you can read from etc. For a turnkey, use datalad fusefs, e.g.:

/tmp > datalad install ///openneuro/ds000001
[INFO   ] access to 1 dataset sibling s3-PRIVATE not auto-enabled, enable with:                           
|               datalad siblings -d "/tmp/ds000001" enable -s s3-PRIVATE                                  
install(ok): /tmp/ds000001 (dataset)

/tmp > mkdir ds000001-mounted

exit:1 /tmp > datalad fusefs -d ds000001 --foreground ds000001-mounted &
[1] 104174

/tmp > du -scm ds000001
2	ds000001
2	total

/tmp > nib-ls ds000001-mounted/sub-01/func/sub-01_task-balloonanalogrisktask_run-0*nii.gz         
ds000001-mounted/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz int16 [ 64,  64,  33, 300] 3.12x3.12x4.00x2.00
ds000001-mounted/sub-01/func/sub-01_task-balloonanalogrisktask_run-02_bold.nii.gz int16 [ 64,  64,  33, 300] 3.12x3.12x4.00x2.00
ds000001-mounted/sub-01/func/sub-01_task-balloonanalogrisktask_run-03_bold.nii.gz int16 [ 64,  64,  33, 300] 3.12x3.12x4.00x2.00


/tmp > du -scm ds000001                                                                  
17	ds000001
17	total

/tmp > du -scm --apparent-size ds000001
137	ds000001
137	total

so – there are now some “sparse” files for over 130MBs if downloaded in full, but it fetched only 15MB or so (IIRC default block size about 5MB, and there are 3 files) to get those headers for nib-ls. Data is cached under ds000001/.git/datalad/cache/fsspec .

Hope this helps

1 Like