I am trying to use datalad to get the openneuro superdataset (e.g. here: DataLad Repository).
My goals are to get only the participants.tsv files for all of the openneuro datasets (i.e. datalad subdatasets of the OpenNeuro superdataset) and (ideally) nothing else. If I cannot get just the participants.tsv files, I would like to keep the total size of each downloaded dataset to something very small (e.g. below 1M).
Here are some things I have tried:
Recursive install without getting data:
datalad install ///openneuro -r
This gives me all openneuro datasets, including the participants.tsv files and a lot of other, individually small files. However, some datasets (e.g. ds000031
) have a ton of these “smaller” files so the whole dataset still ends up being over 1G of data:
ds000031 on î‚ master
âžś du -h --max-depth=1 | sort -h
12K ./.datalad
5.5M ./sub-01
50M ./.git
1.3G ./sourcedata
1.4G .
Clone and then get without data
I then tried to only clone the super dataset with:
datalad clone ///openneuro
This gives me a super lean superdataset (yay) because the subdatasets are empty. I then go in and datalad get
the individual subdatasets with the --no-data
flag, e.g.:
datalad get -n ds000031
.
This seems to get me the same outcome as running datalad install
recursively on the super dataset. That is, I still get all of the small files and end up with a overall large dataset footprint.
Clone and then get with exact file path
I cloned the empty superdataset again (datalad clone ///openneuro
) and then requested the file paths as described in the docs:
for d in */;do datalad get ${d}/participants.tsv;done
This gets me the participants.tsv file but it also seems to get the rest of the small files. Again, same outcome as recursive install or dataset level get.
So via all three methods, I get a lot of small files that for some datasets sum up to quite a large size. Is there some way to:
- set a hard limit for the file size to be gotten via
datalad install
ordatalad get
? - force get to only get the file I request by specific path?
- get the absolutely minimal amount of files in a datalad dataset?