I am currently trying to aggregate some subjects from openneuro with different dataset properties, for example by age, scan type, fieldmaps etc…
After cloning openneuro: datalad install ///openneuro
I search for specific filters (for example male 40yo) with: datalad -c datalad.search.index-egrep-documenttype=all search bids.subject.age:40 bids.subject.sex:male
How can I list all the values that exists for a given field? For example to list all manufacturers on openneuro, that could look like: datalad list bids.Manufacturer
How to check if some files exists for all the sub-datasets? For example to check for different fieldmaps (phase difference map): datalad exists *_magnitude1.json
Thank you for showing interest in datalad search. FWIW – as there is ongoing work on refactoring metadata storage etc, and there numerous issues with openneuro datasets complicating streamlinging this process, I have stopped extracting/aggregating metadata for openneuro (openneuro itself doesn’t do that, so it was up to us – datalad – to do that). Now that I see that there is interest, I will try to find time to re-introduced extraction of metadata. Hopefully within a week or two. Meanwhile, metadata will not be complete. Back to the specific questions
How can I list all the values that exists for a given field?
$> datalad search --show-keys full 'bids.Manufacturer$'
bids.Manufacturer
in 128 datasets
has 20 unique values: 'ANT'; 'Agilent'; 'Biosemi'; 'Brain Vision'; 'Bruker BioSpin MRI GmbH'; 'Bruker'; 'CTF'; 'Elekta/Neuromag'; 'GE 3 Tesla MR750'; 'GE MEDICAL SYSTEMS'; 'GE'; 'General Electrics'; 'Neurofile NT'; 'Philips Medical Systems'; 'Philips'; 'SIEMENS '; 'SIEMENS'; 'Siemens'; 'g.tec'; 'gtec'
How to check if some files exists for all the sub-datasets?
that probably not possible ATM with a single command/efficiently. I would have probably done datalad -c datalad.search.index-egrep-documenttype=all search 'path:.*_magnitude1\.json' and then post-processed list to identify which datasets have that file (and thus be able to tell which don’t). FTR: within extracted metadata there are no hits for such files ATM :-/ The " issue we have “on file” exactly for such a feature is a way to limit search output with a single (or specified # of) hit per dataset · Issue #2935 · datalad/datalad · GitHub . May be I would eventually get back to “postponed” https://github.com/datalad/datalad/pull/3948 some time soon and include it while at it (since seems to be a “common” use-case)
Meanwhile you could just install all subdatasets and do smth like
$> for ds in *; do find $ds -iname *_magnitude1.json | grep -q . || echo $ds; done
ds000001
ds000002
ds000003
ds000005
ds000006
ds000007
ds000008
ds000009
ds000011
ds000017
ds000030
...
which do not have any...
here are the ones which have some
Perhaps in addition to the approach Yarik nicely described - I catalog several metadata fields for all of our public datasets in this google sheet. This could help find datasets that match what you are looking for (e.g. ages, modalities)
@yarikoptic The commands that you showed will really help me, thank you for that. @franklin I will definitively check you google sheet, this is also really usefull for the community, thank you!