Datalad search openneuro

ltetrel · February 23, 2021, 9:16pm

Hi all,

I am currently trying to aggregate some subjects from openneuro with different dataset properties, for example by age, scan type, fieldmaps etc…

After cloning openneuro:
datalad install ///openneuro
I search for specific filters (for example male 40yo) with:
datalad -c datalad.search.index-egrep-documenttype=all search bids.subject.age:40 bids.subject.sex:male

How can I list all the values that exists for a given field? For example to list all manufacturers on openneuro, that could look like:
datalad list bids.Manufacturer
How to check if some files exists for all the sub-datasets? For example to check for different fieldmaps (phase difference map):
datalad exists *_magnitude1.json

Thank you all!

yarikoptic · February 23, 2021, 11:19pm

Thank you for showing interest in datalad search. FWIW – as there is ongoing work on refactoring metadata storage etc, and there numerous issues with openneuro datasets complicating streamlinging this process, I have stopped extracting/aggregating metadata for openneuro (openneuro itself doesn’t do that, so it was up to us – datalad – to do that). Now that I see that there is interest, I will try to find time to re-introduced extraction of metadata. Hopefully within a week or two. Meanwhile, metadata will not be complete. Back to the specific questions

How can I list all the values that exists for a given field?

$> datalad search --show-keys full 'bids.Manufacturer$'
bids.Manufacturer
 in  128 datasets
 has 20 unique values: 'ANT'; 'Agilent'; 'Biosemi'; 'Brain Vision'; 'Bruker BioSpin MRI GmbH'; 'Bruker'; 'CTF'; 'Elekta/Neuromag'; 'GE 3 Tesla MR750'; 'GE MEDICAL SYSTEMS'; 'GE'; 'General Electrics'; 'Neurofile NT'; 'Philips Medical Systems'; 'Philips'; 'SIEMENS '; 'SIEMENS'; 'Siemens'; 'g.tec'; 'gtec'

How to check if some files exists for all the sub-datasets?

that probably not possible ATM with a single command/efficiently. I would have probably done datalad -c datalad.search.index-egrep-documenttype=all search 'path:.*_magnitude1\.json' and then post-processed list to identify which datasets have that file (and thus be able to tell which don’t). FTR: within extracted metadata there are no hits for such files ATM :-/ The " issue we have “on file” exactly for such a feature is a way to limit search output with a single (or specified # of) hit per dataset · Issue #2935 · datalad/datalad · GitHub . May be I would eventually get back to “postponed” https://github.com/datalad/datalad/pull/3948 some time soon and include it while at it (since seems to be a “common” use-case)

Meanwhile you could just install all subdatasets and do smth like

$> for ds in *; do find $ds -iname *_magnitude1.json | grep -q . || echo $ds; done
ds000001
ds000002
ds000003
ds000005
ds000006
ds000007
ds000008
ds000009
ds000011
ds000017
ds000030
...

which do not have any...

here are the ones which have some

$> for ds in *; do find $ds -iname *_magnitude1.json | grep -q . && echo $ds; done
ds000172
ds000201
ds000221
ds000256
ds001430
ds001454
ds001545
ds001595
ds001600
ds002105
ds002294
ds002419
ds002549
ds002683
ds002702
ds002727
ds002776
ds002842
ds002872
ds002898
ds002938
ds003017
ds003023
ds003085
ds003145
ds003242

franklin · February 24, 2021, 1:03am

Hi @ltetrel

Perhaps in addition to the approach Yarik nicely described - I catalog several metadata fields for all of our public datasets in this google sheet. This could help find datasets that match what you are looking for (e.g. ages, modalities)

Thank you,
Franklin

ltetrel · February 24, 2021, 3:42pm

@yarikoptic The commands that you showed will really help me, thank you for that.
@franklin I will definitively check you google sheet, this is also really usefull for the community, thank you!