Is there a way to restrict file size for datalad clone / install?

surchs · January 24, 2022, 5:41pm

I am trying to use datalad to get the openneuro superdataset (e.g. here: DataLad Repository).

My goals are to get only the participants.tsv files for all of the openneuro datasets (i.e. datalad subdatasets of the OpenNeuro superdataset) and (ideally) nothing else. If I cannot get just the participants.tsv files, I would like to keep the total size of each downloaded dataset to something very small (e.g. below 1M).

Here are some things I have tried:

Recursive install without getting data:

datalad install ///openneuro -r

This gives me all openneuro datasets, including the participants.tsv files and a lot of other, individually small files. However, some datasets (e.g. ds000031) have a ton of these “smaller” files so the whole dataset still ends up being over 1G of data:

ds000031 on  master
➜ du -h --max-depth=1 | sort -h
12K	./.datalad
5.5M	./sub-01
50M	./.git
1.3G	./sourcedata
1.4G	.

Clone and then get without data

I then tried to only clone the super dataset with:
datalad clone ///openneuro
This gives me a super lean superdataset (yay) because the subdatasets are empty. I then go in and datalad get the individual subdatasets with the --no-data flag, e.g.:
datalad get -n ds000031.

This seems to get me the same outcome as running datalad install recursively on the super dataset. That is, I still get all of the small files and end up with a overall large dataset footprint.

Clone and then get with exact file path

I cloned the empty superdataset again (datalad clone ///openneuro) and then requested the file paths as described in the docs:
for d in */;do datalad get ${d}/participants.tsv;done

This gets me the participants.tsv file but it also seems to get the rest of the small files. Again, same outcome as recursive install or dataset level get.

So via all three methods, I get a lot of small files that for some datasets sum up to quite a large size. Is there some way to:

set a hard limit for the file size to be gotten via datalad install or datalad get?
force get to only get the file I request by specific path?
get the absolutely minimal amount of files in a datalad dataset?

yarikoptic · January 24, 2022, 7:11pm

that would probably the “best” way ATM to get full clone of datasets but without any files get'ed. But there is no way to avoid fetching content under git, not git-annex, in those datasets, and some have too much content committed directly into git

(git)smaug:/mnt/datasets/datalad/crawl/openneuro[master]git
$> du -scm */.git/objects | sort -n | tail
97	ds001734/.git/objects
100	ds003495/.git/objects
170	ds002894/.git/objects
182	ds000201/.git/objects
249	ds003620/.git/objects
286	ds002790/.git/objects
630	ds002785/.git/objects
697	ds003846/.git/objects
741	ds003097/.git/objects
5481	total

given that all participants.tsv are straight in git, and thus on github, you could do something

for ds in ds*; do curl https://raw.githubusercontent.com/OpenNeuroDatasets/$ds/master/participants.tsv; done

on a non-recursive clone, and just redirect those into where you want them

surchs · January 27, 2022, 1:06am

Thanks Yarik! I somehow hadn’t realized the difference between annex and github files here. Your suggestion with curl is indeed the best way to only get the participants.tsv files.