Total amount of fMRI hours on OpenNeuro?

pmin · December 11, 2023, 4:56pm

I’m trying to estimate the total number of hours of fMRI data available in OpenNeuro. I would like to be able to do this without actually downloading everything that’s on there. How should I do this?

AFAICT, I can get number of subjects per dataset and number of fMRI datasets from just metadata using the API. To get number of hours of data is trickier. My plan is to use the graphql API to list files under func directories, then for each tsv event file, download them, and find the time from first onset to final onset + duration. I can imagine a bunch of ways that this could fail because of bad event tagging, etc. Do you think this is a reasonable way of doing things, or is there an easier/more reliable ways of getting the total amount of hours of data in OpenNeuro?

Context: one of my hopes is to rebuild this famous graph (Stevenson & Kording 2011) in terms of cumulative amount of freely available data by year.

Steven · December 11, 2023, 5:21pm

Hi @pmin,

The problem with your events.tsv approach is that resting state files are not required to have events files, so this would lead to an underestimate.

Depending on your tolerance for downloading files, you could use events.tsv as an estimate for non-rest files, and then for other files (rest and others that do not have events for whatever reason), you can download just one file, get the time, and assume that other acquisitions in the dataset use the same acquisition. Use the number of files per dataset to then get the amount of time per data.

There is probably a better way to do this, but nothing is coming to mind to me immediately.

Best,
Steven

Steven · December 11, 2023, 5:43pm

Another interesting direction is that it looks like a lot, if not all, openneuro datasets are on BrainLife.io (e.g., brainlife). It is possible they may have an API that would allow you to get the information you need from the cloud and not do much locally. It could be worth asking in their Slack channel: https://brainlife.slack.com/ (@Soichi_Hayashi?)

effigies · December 11, 2023, 7:03pm

Consider using datalad-fuse and downloading the ///openneuro super-dataset. You should then be able to write a tool that opens the image headers and calculates the TR * nvols for all BOLD files without fetching the entirety of every dataset.

It will take a lot of time, nonetheless, as there are >150k BOLD files, so that’s a good chunk of requests to S3.

pmin · December 11, 2023, 7:10pm

@effigies A difficulty I foresee is that NIfTI files in the archives are gzip’ed. I don’t know of a good way to read just enough data from S3 to be able to partially uncompress a gz file and then read the header from the decompressed file.

effigies · December 11, 2023, 7:24pm

Fetching a kilobyte will be enough to read the header. Here’s a quick tool that will read from a stream without reading past what you need:

#!/usr/bin/env python
import sys, gzip, nibabel as nb

with gzip.GzipFile(fileobj=sys.stdin.buffer, mode='rb') as img:
    header = nb.Nifti1Image.from_stream(img).header

print(header['dim'][4])

You could then do something like:

$ datalad fsspec-head -c 1024 ds000001/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz | read-nvols
300

Remi-Gau · December 12, 2023, 8:39am

@pmin

I am already tracking quite few things about all the open bids datalad datasets out there in this repo:

github.com

neurodatascience/cohort_creator/blob/main/cohort_creator/data/openneuro.tsv

name	nb_subjects	has_participant_tsv	has_participant_json	participant_columns	has_phenotype_dir	datatypes	sessions	tasks	size	authors	institutions	raw	fmriprep	freesurfer	mriqc
ds000001	16	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	[]	['balloonanalogrisktask']	2.2 GB	['Tom Schonberg', 'Christopher Trepel', 'Craig Fox', 'Russell A. Poldrack']	[]	https://github.com/OpenNeuroDatasets/ds000001	https://github.com/OpenNeuroDerivatives/ds000001-fmriprep	n/a	https://github.com/OpenNeuroDerivatives/ds000001-mriqc
ds000002	17	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	[]	['deterministicclassification', 'mixedeventrelatedprobe', 'probabilisticclassification']	2.7 GB	['Aron, A.R.', 'Gluck, M.A.', 'Poldrack, R.A.']	[]	https://github.com/OpenNeuroDatasets/ds000002	https://github.com/OpenNeuroDerivatives/ds000002-fmriprep	n/a	https://github.com/OpenNeuroDerivatives/ds000002-mriqc
ds000003	13	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	[]	['rhymejudgment']	394.2 MB	['Xue, G.', 'Russell A. Poldrack']	[]	https://github.com/OpenNeuroDatasets/ds000003	n/a	n/a	https://github.com/OpenNeuroDerivatives/ds000003-mriqc
ds000005	16	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	[]	['mixedgamblestask']	1.8 GB	['Tom S.M.', 'Fox C.R.', 'Trepel C.', 'Poldrack R.A.']	[]	https://github.com/OpenNeuroDatasets/ds000005	https://github.com/OpenNeuroDerivatives/ds000005-fmriprep	n/a	https://github.com/OpenNeuroDerivatives/ds000005-mriqc
ds000006	14	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	['retest', 'test']	['livingnonlivingdecisionwithplainormirrorreversedtext']	4.4 GB	['K Jimura', 'E Stover', 'F Cazalis', 'Russell A. Poldrack']	[]	https://github.com/OpenNeuroDatasets/ds000006	n/a	n/a	https://github.com/OpenNeuroDerivatives/ds000006-mriqc
ds000007	20	True	False	['participant_id', 'gender', 'age']	False	['anat', 'func']	[]	['stopmanual', 'stopvocal', 'stopword']	3.4 GB	['Xue G', 'Aron AR', 'Russell A. Poldrack']	[]	https://github.com/OpenNeuroDatasets/ds000007	n/a	n/a	https://github.com/OpenNeuroDatasets/ds000007/tree/main/derivatives/mriqc
ds000008	14	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	[]	['conditionalstopsignal', 'stopsignal']	2.0 GB	['Aron, A.R.', 'Behrens, T.E.', 'Frank, M.', 'Smith, S.', 'Poldrack, R.A.']	[]	https://github.com/OpenNeuroDatasets/ds000008	https://github.com/OpenNeuroDerivatives/ds000008-fmriprep	n/a	https://github.com/OpenNeuroDerivatives/ds000008-mriqc
ds000009	24	True	False	['participant_id', 'age', 'sex', 'm_SSRTquant', 'm_NumPumps_Avg', 'm_TotalAmt_Avg', 'm_NumExplAvg', 'm_logit_k', 'm_AN-SN', 'b_SSRTquant', 'b_NumPumps_Avg', 'b_TotalAmt_Avg', 'b_NumExplAvg', 'b_k', 'b_AN-SN', 'GABS_TOTAL', 'CARE_ENGAGING-DRUG', 'CARE_ENGAGING-AGGR/ILLEG', 'CARE_ENGAGING-SEX', 'CARE_ENGAGING-DRINKING', 'CARE_ENGAGING-SPORTS', 'CARE_ENGAGING-TOTAL', 'CARE_NEGCONS-DRUG', 'CARE_NEGCONS-AGGR/ILLEG', 'CARE_NEGCONS-SEX', 'CARE_NEGCONS-DRINKING', 'CARE_NEGCONS-SPORTS', 'CARE_NEGCONS-TOTAL', 'CARE_POSCONS-DRUG', 'CARE_POSCONS-AGGR/ILLEG', 'CARE_POSCONS-SEX', 'CARE_POSCONS-DRINKING', 'CARE_POSCONS-SPORTS', 'CARE_POSCONS-TOTAL', 'NEED', 'FOR', 'COG_TOTAL', 'COG', 'REFLECTION_NUM', 'CORR', 'COG.1', 'REFLECTION_CONFIDENCE', 'COG.2', 'REFLECTION_NUM.1', 'SEEN', 'SENS', 'SEEK_BOREDOM', 'SENS.1', 'SEEK_DISINHIB', 'SENS.2', 'SEEK_EXP', 'SEEK', 'SENS.3', 'SEEK_THRILL', 'SEEK.1', 'SENS.4', 'SEEK_TOTAL', 'BIS_ATTN', 'BIS_MOTOR', 'BIS_NP', 'BIS_TOTAL', 'BIS/BAS_BAS-REWRESP', 'BIS/BAS_BAS-FUNSEEK', 'BIS/BAS_BAS-DRIVE', 'BIS/BAS_BAS', 'TOTAL', 'BIS/BAS_BIS', 'DOSPERT_RT-ETH', 'DOSPERT_RT-FIN', 'DOSPERT_RT-HLTH/SAF', 'DOSPERT_RT-REC', 'DOSPERT_RT-SOC', 'DOSPERT_RT-TOTAL', 'DOSPERT_RP-ETH', 'DOSPERT_RP-FIN', 'DOSPERT_RP-HLTH/SAF', 'DOSPERT_RP-REC', 'DOSPERT_RP-SOC', 'DOSPERT_RP-TOTAL', 'DOSPERT_EB-ETH', 'DOSPERT_EB-FIN', 'DOSPERT_EB-HLTH/SAF', 'DOSPERT_EB-REC', 'DOSPERT_EB-SOC', 'DOSPERT_EB-TOTAL', 'PANAS_(Post-Pre)-Interested', 'PANAS_(Post-Pre)-Distressed', 'PANAS_(Post-Pre)-Excited', 'PANAS_(Post-Pre)-Upset', 'PANAS_(Post-Pre)-Strong']	False	['anat', 'dwi', 'func']	[]	['balloonanalogrisktask', 'discounting', 'emotionalregulation', 'stopsignal']	6.1 GB	['Jessica Cohen', 'Russell A. Poldrack']	[]	https://github.com/OpenNeuroDatasets/ds000009	https://github.com/OpenNeuroDerivatives/ds000009-fmriprep	n/a	https://github.com/OpenNeuroDerivatives/ds000009-mriqc
ds000011	14	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	[]	['Classificationprobewithoutfeedback', 'Dualtaskweatherprediction', 'Singletaskweatherprediction', 'tonecounting']	2.3 GB	['Foerde, K.', 'Knowlton, B.J.', 'Russell A. Poldrack']	[]	https://github.com/OpenNeuroDatasets/ds000011	n/a	n/a	https://github.com/OpenNeuroDerivatives/ds000011-mriqc
ds000017	8	True	False	['participant_id', 'sex', 'age']	False	['anat', 'func']	['timepoint1', 'timepoint2']	['probabilisticclassification', 'selectivestopsignaltask']	2.4 GB	['Rizk-Jackson', 'Aron',

This file has been truncated. show original

So far I was being lazy and only getting the total dataset size, but I could start adding more info given what was done in this thread.

You could then grab these numbers from the tsvs in the repo if you want.

effigies · December 12, 2023, 1:31pm

@Remi-Gau I think you’ve pointed me at that resource before, but I just had another look. Note that there are FreeSurfer derivatives inside the fMRIPrep derivatives, under sourcedata/freesurfer, e.g.: https://github.com/OpenNeuroDerivatives/ds000117-fmriprep/tree/main/sourcedata/freesurfer

Remi-Gau · December 12, 2023, 2:05pm

Yup I think I have an issue somewhere to ease fetching of those nested freesurfer data based on the assumption that if the fmriprep data is on openneuro I should be able to fetch freesurfer data too.

Remi-Gau · December 21, 2023, 8:24am

@pmin
Currently fetching nifti headers from all bold files of openneuro (actually only from the first participant of each dataset and assuming that the content should be the same for other participants).
Will keep you posted.

@effigies is it expected that for some datasets the datalad fsspec-head command would give a no valid URL found type of error? (working from my phone atm but can send the exact error later)

Remi-Gau · December 21, 2023, 9:28am

Here is the error (with the datalad command passed to it)

 ds004274
  Getting 'scan' duration for sub-1
   sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz
datalad fsspec-head -d /home/remi/openneuro/ds004274 -c 1024 sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz | python 
/home/remi/github/origami/cohort_creator/tools/read_nb_vols
[ERROR] Could not find a usable URL for sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz within /home/remi/openneuro/ds004274 
Traceback (most recent call last):
  File "/home/remi/github/origami/cohort_creator/tools/read_nb_vols", line 11, in <module>
    header = nb.Nifti1Image.from_stream(img).header
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/filebasedimages.py", line 554, in from_stream
    return klass.from_file_map(klass._filemap_from_iobase(io_obj))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/analyze.py", line 960, in from_file_map
    header = klass.header_class.from_fileobj(hdrf)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/nifti1.py", line 707, in from_fileobj
    hdr = klass(raw_str, endianness, check)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/nifti1.py", line 694, in __init__
    super().__init__(binaryblock, endianness, check)
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/analyze.py", line 252, in __init__
    super().__init__(binaryblock, endianness, check)
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/wrapstruct.py", line 160, in __init__
    raise WrapStructError('Binary block is wrong size')
nibabel.wrapstruct.WrapStructError: Binary block is wrong size

Note that the error is not systematic (not investigated further to see if see if there is any pattern as to when it happens)

mszczepanik · January 9, 2024, 2:26pm

First, let me say, what a cool idea!

Tl; dr: The “no valid URL found” error can be removed by setting git annex enableremote s3-PUBLIC public=yes for a given dataset - although I don’t fully understand how and why that matters.

Here’s some observations I made:

@Remi-Gau I can replicate the error for the dataset & file you mention.

This is git annex whereis for the file in question:

❱ git annex whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz
whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz (2 copies)
  	3f850df4-16ea-4c85-8c80-5cbd206f6980 -- OpenNeuro
   	e5fdc610-8352-4db3-b405-9912525a22df -- [s3-PUBLIC]
ok

Looking at ds004274 and comparing various things to another randomly selected dataset (ds004271), the main difference I see is that the s3-PUBLIC special remote has parameter public set to no:

❱ git annex info s3-PUBLIC
...
type: S3
creds: not available
bucket: openneuro.org
...
public: no
...

After changing the special remote configuration:

❱ git annex enableremote s3-PUBLIC public=yes
enableremote s3-PUBLIC ok

… git annex whereis also shows the S3 HTTPS URL:

❱ git annex whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz
whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz (2 copies)
  	3f850df4-16ea-4c85-8c80-5cbd206f6980 -- OpenNeuro
   	e5fdc610-8352-4db3-b405-9912525a22df -- [s3-PUBLIC]

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004274/sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz?versionId=K5i2HiaxVC03VwFvxqyEZQTIFEsWIje4
ok

And fsspec-head works without issues:

❱ datalad fsspec-head -c 1024 sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz | /tmp/read-nvols
500

According to git-annex S3 special remote docs:

“public - Deprecated. This enables public read access to files sent to the S3 remote using ACLs. Note that Amazon S3 buckets created after April 2023 do not support using ACLs in this way and a Bucket Policy must instead be used. This should only be set for older buckets.”

This public=no seems to throw off some interactions with S3. DataLad / git-annex can get the file, but (as seen when running with datalad --log-level debug or git annex get directly) some of the internals report “S3 bucket does not allow public access; Set both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to use S3”. But then drop errors out with the same message, unless used with --reckless availability.

Remi-Gau · January 9, 2024, 2:44pm

Cool let me try that to see how much more info I can collect.

Remi-Gau · January 9, 2024, 4:53pm

so far this is working MUCH better!!

alexisthual · February 18, 2024, 8:22pm

Is there any news about this?

Remi-Gau · February 23, 2024, 8:11am

I have made some prrogress but not as much as I wished because Nilearn is keeping me busy.

In brief, install the cohort_creator package

GitHub - neurodatascience/cohort_creator: Creates a neuroimaging cohort by aggregating data across datasets..

I need to cut a new release with the new features, so for now either install directly from main or clone and install locally.

pip install git+https://github.com/neurodatascience/cohort_creator.git

The listings of datasets known to the package ((and a whole bunch of information associated with them)) are stored in TSVs in this folder.

You can launch a Dash app from the terminal with

cohort_creator browse

This should display the listing of all datasets that you can filter to update the graphs.

mszczepanik · October 8, 2025, 5:37pm

Hope it’s ok to revive this topic - It’s been more than 1.5 years since the last activity, but I revisited the idea in some spare time, and wanted to share. I used the method outlined in the thread (clone the openneuro subdataset, read NIfTI headers with DataLad-fuse, extract the dimensions, and multiply number of volumes by the TR).

I would put the current number, as of 2025-09-17, at 27838 hours (see caveats below).

The cumulative number of fMRI hours has been growing:

And here is a breakdown by year, probably more in line with the OP’s question:

Here are some caveats and lessons learned:

The date used for the x-axis is the timestamp of the last Git commit in the datset. So if there was some curation step done well after the initial publication, it would shift the assumed date forward. I have a feeling that this rarely happens, though.
Out of 1511 subdatasets, I was able to process 1456, and find BOLD files with readable headers in 813 of them. Browsing OpenNeuro by modality (MRI) shows 932 public datasets, so it’s not too far off, but the reported number might be underestimated.
On the other hand, I did not check dataset-level metadata, so for sure the count includes some non-human MRI or derived datasets. Still fMRI though.
I used BIDSLayout from pybids to list the BOLD files (layout.get(extension=[".nii.gz", ".nii"], suffix="bold")) and get corresponding TRs. I skipped the datasets where the BIDSLayout failed to initialize, which probably accounts for the majority of the difference. The errors which I was able to spot and check were usually due to inconsistencies between file names and metadata, or due to invalid characters in JSON sidecars. Pybids reads the whole layout, so the errors may be irrelevant for BOLD files, and so more datasets could be parsed by globbing for BOLD files and re-implementing JSON sidecar inheritance “by hand”. That being said, pybids is a well established tool, so including datasets which can be parsed by pybids seems to be fair.
I used fsspec_head from datalad-fuse to get the beginning of the NIfTI file. I skipped files where it didn’t work - these were rare occasions, but I’ve seen datasets where one or more files had no associated S3 URL.
On that note, the spurious “could not find a usable URL” error discussed in this thread (which could at the time be worked around by setting the deprecated public parameter for the special remote) no longer happens on the current git-annex version. I believe the rare errors I’ve seen were genuine.
An interesting side-note: fetching 1 kB should be more than enough to get the NIfTI header (348 / 540 bytes for NIfTI-1/2 respectively), but NIfTI allows extra data to be placed after the header. In some datasets these extra data can be exceed 512 kB (some sort of processing software outputs?). Nibabel’s Nifti1Image.from_stream() tries to read those data and include them in the header, which ends in an EOFError. Not willing to guess the sufficient fsspec_head size, I ended up reading the dimensions from the header by checking specific bytes as fallback - luckily the format is very well defined and explained.
The time to process was highly variable but roughly 2 hours per batch of 100 datasets, maybe more (that time is spent cloning, initializing BIDSLayout and making S3 requests). With even a small cluster (I cautiously scheduled 4 batches at a time) this is easily doable “overnight”.
I used openneuro superdataset commit f27a06ec413227ad54a1660340871fbdacd6c046

The code I used (for both dataset parsing and graphs) is in msz/mri-hours: How many hours of fMRI on OpenNeuro? - Psychoinformatics Hub: Powered by Forgejo-aneksajo.

If you want to check my homework (or calculate your own statistics), I added per-dataset CSV files in msz/mri-hours-stats: Duration of fMRI (BOLD) in OpenNeuro datasets - Psychoinformatics Hub: Powered by Forgejo-aneksajo.