Return hash and URL for every nifti file in Datalad super-dataset

I would like to see which of the QC metrics in MRIQC Web API were obtained using publicly available data. MRIQC Web API stores hashes of every input file. AFAIK datalad uses the same hashes. Can I use datalad to return hash and URL (S3 or HTTP) for every nifti file in the super-dataset?

Well, with this PR (https://github.com/datalad/datalad/pull/2950), if your hash is MD5 (or otherwise matches the hash used by git-annex backend) then – SURE!

Here is output on a sample ds000001 dataset:

(git-annex)hopa:~/datalad/openfmri/ds000001[master]git
$> datalad_ -f '{path}: {metadata[datalad_core][annex-key]} {metadata[datalad_core][url]}' -c datalad.search.index-egrep-documenttype=all search path:.*\.nii.gz 
/home/yoh/datalad/openfmri/ds000001/sub-01/anat/sub-01_inplaneT2.nii.gz: MD5E-s669578--0017a7174b9fdebeb1e57f36027bfb96.nii.gz ['http://openneuro.s3.amazonaws.com/ds000001/ds000001_R1.1.0/uncompressed/sub001/anatomy/inplane001.nii.gz?versionId=ystKDnaPkdzSwzdRPZH0PtMMknZJCQV4', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.1/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=wVn1tHi1XBn7avQ9Q1oXuoqE2Splyc.X', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.2/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=RQFsRCUlj0X77.OTMJZ01Dcx5MRjfljQ', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.3/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=XdlgqNsFbXeHxzHsdZkR1CUNxozBcL8W', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/anat/sub-01_inplaneT2.nii.gz?versionId=ZY7d3uaj43tD6oPwq8JCuRyEBD91OgCY']
/home/yoh/datalad/openfmri/ds000001/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz: MD5E-s47258494--99452aee04e7ab70b735a11a4f6f4f7a.nii.gz ['http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz?versionId=Wy_03zgeQNifQr_IQO4wIjSSiieCtBuN', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz?versionId=gKVoxHwYIVymbhb267Xnw.yY0xF30ixs']
/home/yoh/datalad/openfmri/ds000001/sub-01/func/sub-01_task-balloonanalogrisktask_run-02_bold.nii.gz: MD5E-s47298484--b6749f0bfe58576c02956130751a53a6.nii.gz ['http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/func/sub-01_task-balloonanalogrisktask_run-02_bold.nii.gz?versionId=TVNY535ikKrBTOySmxEDmYD.jVxZBAYB', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/func/sub-01_task-balloonanalogrisktask_run-02_bold.nii.gz?versionId=i4llASMRuWo1jmssxIQnHeeTiGLi4ACZ']
/home/yoh/datalad/openfmri/ds000001/sub-01/func/sub-01_task-balloonanalogrisktask_run-03_bold.nii.gz: MD5E-s47362134--7108659fb22117167624472507f20e2f.nii.gz ['http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/func/sub-01_task-balloonanalogrisktask_run-03_bold.nii.gz?versionId=5XNQVwUMNcbdo4zKC_uVTEId0uYCL884', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/func/sub-01_task-balloonanalogrisktask_run-03_bold.nii.gz?versionId=ab1aP2TFsYfeausttdjeAXKuSqExO9L8']
/home/yoh/datalad/openfmri/ds000001/sub-01/anat/sub-01_T1w.nii.gz: MD5E-s5663237--4608ffbd6b78ce3a325eb338fa556589.nii.gz ['http://openneuro.s3.amazonaws.com/ds000001/ds000001_R1.1.0/uncompressed/sub001/anatomy/highres001.nii.gz?versionId=8TJ17W9WInNkQPdiQ9vS7wo8ZJ9llF80', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.1/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=qap.MnvLhQkkiWNwEPB4UaTewS4EndCo', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.2/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=n.2cGDI.yRsOgUjQEOw9esd6dSERQdgq', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.3/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=vS_.31LRwlKEZ7lqro58B3IyOYAr_bfb', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=miHvOjH0rAHzGi7gdEwhsr1SHC9fprJA']

But if you would like to have other checksums, then probably someone would need to provide an additional extractor to extract those specifically into a dedicated metadata record… but I guess on any sizeable dataset it would be taking awhile to do so.

edit: well, after PR is merged, I would need to rerun metadata extraction to add those new records

1 Like

Chris, Did you run that search already?

Nope - got roped into other stuff. Interest what you will find.

Alright, I’ve installed the datalad superdataset and installed datalad from the github master branch and I am just seeing NA in the MD5 spot. Any chance you could rebuild that index?

MH02086115MACLT:ds000001 nielsond$ datalad -f '{path}: {metadata[datalad_core][annex-key]} {metadata[datalad_core][url]}' -c datalad.search.index-egrep-documenttype=all search path:.*\.nii.gz

/Users/nielsond/data/datalad/datasets.datalad.org/openfmri/ds000001/sub-01/anat/sub-01_T1w.nii.gz: N/A ['http://openneuro.s3.amazonaws.com/ds000001/ds000001_R1.1.0/uncompressed/sub001/anatomy/highres001.nii.gz?versionId=8TJ17W9WInNkQPdiQ9vS7wo8ZJ9llF80', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.1/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=qap.MnvLhQkkiWNwEPB4UaTewS4EndCo', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.2/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=n.2cGDI.yRsOgUjQEOw9esd6dSERQdgq', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.3/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=vS_.31LRwlKEZ7lqro58B3IyOYAr_bfb', 'http://openneuro.s3.amazonaws.com/ds000001/ds000001_R2.0.4/uncompressed/sub-01/anat/sub-01_T1w.nii.gz?versionId=miHvOjH0rAHzGi7gdEwhsr1SHC9fprJA']

Thanks @yarikoptic!

I’m now able to run
datalad -f '{path}: {metadata[annex][key]} {metadata[datalad_core][url]}' -c datalad.search.index-egrep-documenttype=all search path:.*\.nii.gz
from the base of the all datasets directory and get hashes for everything you’ve been able to regenerate indices on.

1 Like