I am currently trying to aggregate some subjects from openneuro with different dataset properties, for example by age, scan type, fieldmaps etc…
After cloning openneuro:
datalad install ///openneuro
I search for specific filters (for example male 40yo) with:
datalad -c datalad.search.index-egrep-documenttype=all search bids.subject.age:40 bids.subject.sex:male
How can I list all the values that exists for a given field? For example to list all manufacturers on openneuro, that could look like:
datalad list bids.Manufacturer
How to check if some files exists for all the sub-datasets? For example to check for different fieldmaps (phase difference map):
datalad exists *_magnitude1.json
Thank you all!
Thank you for showing interest in
datalad search. FWIW – as there is ongoing work on refactoring metadata storage etc, and there numerous issues with openneuro datasets complicating streamlinging this process, I have stopped extracting/aggregating metadata for openneuro (openneuro itself doesn’t do that, so it was up to us – datalad – to do that). Now that I see that there is interest, I will try to find time to re-introduced extraction of metadata. Hopefully within a week or two. Meanwhile, metadata will not be complete. Back to the specific questions
- How can I list all the values that exists for a given field?
$> datalad search --show-keys full 'bids.Manufacturer$'
in 128 datasets
has 20 unique values: 'ANT'; 'Agilent'; 'Biosemi'; 'Brain Vision'; 'Bruker BioSpin MRI GmbH'; 'Bruker'; 'CTF'; 'Elekta/Neuromag'; 'GE 3 Tesla MR750'; 'GE MEDICAL SYSTEMS'; 'GE'; 'General Electrics'; 'Neurofile NT'; 'Philips Medical Systems'; 'Philips'; 'SIEMENS '; 'SIEMENS'; 'Siemens'; 'g.tec'; 'gtec'
- How to check if some files exists for all the sub-datasets?
that probably not possible ATM with a single command/efficiently. I would have probably done
datalad -c datalad.search.index-egrep-documenttype=all search 'path:.*_magnitude1\.json' and then post-processed list to identify which datasets have that file (and thus be able to tell which don’t). FTR: within extracted metadata there are no hits for such files ATM :-/ The " issue we have “on file” exactly for such a feature is a way to limit search output with a single (or specified # of) hit per dataset · Issue #2935 · datalad/datalad · GitHub . May be I would eventually get back to “postponed” https://github.com/datalad/datalad/pull/3948 some time soon and include it while at it (since seems to be a “common” use-case)
Meanwhile you could just install all subdatasets and do smth like
$> for ds in *; do find $ds -iname *_magnitude1.json | grep -q . || echo $ds; done
which do not have any...
here are the ones which have some
$> for ds in *; do find $ds -iname *_magnitude1.json | grep -q . && echo $ds; done
Perhaps in addition to the approach Yarik nicely described - I catalog several metadata fields for all of our public datasets in this google sheet. This could help find datasets that match what you are looking for (e.g. ages, modalities)
@yarikoptic The commands that you showed will really help me, thank you for that.
@franklin I will definitively check you google sheet, this is also really usefull for the community, thank you!