Total amount of fMRI hours on OpenNeuro?

I’m trying to estimate the total number of hours of fMRI data available in OpenNeuro. I would like to be able to do this without actually downloading everything that’s on there. How should I do this?

AFAICT, I can get number of subjects per dataset and number of fMRI datasets from just metadata using the API. To get number of hours of data is trickier. My plan is to use the graphql API to list files under func directories, then for each tsv event file, download them, and find the time from first onset to final onset + duration. I can imagine a bunch of ways that this could fail because of bad event tagging, etc. Do you think this is a reasonable way of doing things, or is there an easier/more reliable ways of getting the total amount of hours of data in OpenNeuro?

Context: one of my hopes is to rebuild this famous graph (Stevenson & Kording 2011) in terms of cumulative amount of freely available data by year.

Hi @pmin,

The problem with your events.tsv approach is that resting state files are not required to have events files, so this would lead to an underestimate.

Depending on your tolerance for downloading files, you could use events.tsv as an estimate for non-rest files, and then for other files (rest and others that do not have events for whatever reason), you can download just one file, get the time, and assume that other acquisitions in the dataset use the same acquisition. Use the number of files per dataset to then get the amount of time per data.

There is probably a better way to do this, but nothing is coming to mind to me immediately.

Best,
Steven

1 Like

Another interesting direction is that it looks like a lot, if not all, openneuro datasets are on BrainLife.io (e.g., brainlife). It is possible they may have an API that would allow you to get the information you need from the cloud and not do much locally. It could be worth asking in their Slack channel: https://brainlife.slack.com/ (@Soichi_Hayashi?)

1 Like

Consider using datalad-fuse and downloading the ///openneuro super-dataset. You should then be able to write a tool that opens the image headers and calculates the TR * nvols for all BOLD files without fetching the entirety of every dataset.

It will take a lot of time, nonetheless, as there are >150k BOLD files, so that’s a good chunk of requests to S3.

1 Like

@effigies A difficulty I foresee is that NIfTI files in the archives are gzip’ed. I don’t know of a good way to read just enough data from S3 to be able to partially uncompress a gz file and then read the header from the decompressed file.

Fetching a kilobyte will be enough to read the header. Here’s a quick tool that will read from a stream without reading past what you need:

#!/usr/bin/env python
import sys, gzip, nibabel as nb

with gzip.GzipFile(fileobj=sys.stdin.buffer, mode='rb') as img:
    header = nb.Nifti1Image.from_stream(img).header

print(header['dim'][4])

You could then do something like:

$ datalad fsspec-head -c 1024 ds000001/sub-01/func/sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz | read-nvols
300
1 Like

@pmin

I am already tracking quite few things about all the open bids datalad datasets out there in this repo:

So far I was being lazy and only getting the total dataset size, but I could start adding more info given what was done in this thread.

You could then grab these numbers from the tsvs in the repo if you want.

2 Likes

@Remi-Gau I think you’ve pointed me at that resource before, but I just had another look. Note that there are FreeSurfer derivatives inside the fMRIPrep derivatives, under sourcedata/freesurfer, e.g.: https://github.com/OpenNeuroDerivatives/ds000117-fmriprep/tree/main/sourcedata/freesurfer

Yup I think I have an issue somewhere to ease fetching of those nested freesurfer data based on the assumption that if the fmriprep data is on openneuro I should be able to fetch freesurfer data too.

2 Likes

@pmin
Currently fetching nifti headers from all bold files of openneuro (actually only from the first participant of each dataset and assuming that the content should be the same for other participants).
Will keep you posted.

@effigies is it expected that for some datasets the datalad fsspec-head command would give a no valid URL found type of error? (working from my phone atm but can send the exact error later)

1 Like

Here is the error (with the datalad command passed to it)

 ds004274
  Getting 'scan' duration for sub-1
   sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz
datalad fsspec-head -d /home/remi/openneuro/ds004274 -c 1024 sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz | python 
/home/remi/github/origami/cohort_creator/tools/read_nb_vols
[ERROR] Could not find a usable URL for sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz within /home/remi/openneuro/ds004274 
Traceback (most recent call last):
  File "/home/remi/github/origami/cohort_creator/tools/read_nb_vols", line 11, in <module>
    header = nb.Nifti1Image.from_stream(img).header
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/filebasedimages.py", line 554, in from_stream
    return klass.from_file_map(klass._filemap_from_iobase(io_obj))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/analyze.py", line 960, in from_file_map
    header = klass.header_class.from_fileobj(hdrf)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/nifti1.py", line 707, in from_fileobj
    hdr = klass(raw_str, endianness, check)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/nifti1.py", line 694, in __init__
    super().__init__(binaryblock, endianness, check)
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/analyze.py", line 252, in __init__
    super().__init__(binaryblock, endianness, check)
  File "/home/remi/miniconda3/lib/python3.11/site-packages/nibabel/wrapstruct.py", line 160, in __init__
    raise WrapStructError('Binary block is wrong size')
nibabel.wrapstruct.WrapStructError: Binary block is wrong size

Note that the error is not systematic (not investigated further to see if see if there is any pattern as to when it happens)

First, let me say, what a cool idea!

Tl; dr: The “no valid URL found” error can be removed by setting git annex enableremote s3-PUBLIC public=yes for a given dataset - although I don’t fully understand how and why that matters.

Here’s some observations I made:

@Remi-Gau I can replicate the error for the dataset & file you mention.

This is git annex whereis for the file in question:

❱ git annex whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz
whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz (2 copies)
  	3f850df4-16ea-4c85-8c80-5cbd206f6980 -- OpenNeuro
   	e5fdc610-8352-4db3-b405-9912525a22df -- [s3-PUBLIC]
ok

Looking at ds004274 and comparing various things to another randomly selected dataset (ds004271), the main difference I see is that the s3-PUBLIC special remote has parameter public set to no:

❱ git annex info s3-PUBLIC
...
type: S3
creds: not available
bucket: openneuro.org
...
public: no
...

After changing the special remote configuration:

❱ git annex enableremote s3-PUBLIC public=yes
enableremote s3-PUBLIC ok

git annex whereis also shows the S3 HTTPS URL:

❱ git annex whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz
whereis sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz (2 copies)
  	3f850df4-16ea-4c85-8c80-5cbd206f6980 -- OpenNeuro
   	e5fdc610-8352-4db3-b405-9912525a22df -- [s3-PUBLIC]

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004274/sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz?versionId=K5i2HiaxVC03VwFvxqyEZQTIFEsWIje4
ok

And fsspec-head works without issues:

❱ datalad fsspec-head -c 1024 sub-1/ses-002/func/sub-1_ses-002_task-PSAP_bold.nii.gz | /tmp/read-nvols
500

According to git-annex S3 special remote docs:

public - Deprecated. This enables public read access to files sent to the S3 remote using ACLs. Note that Amazon S3 buckets created after April 2023 do not support using ACLs in this way and a Bucket Policy must instead be used. This should only be set for older buckets.”

This public=no seems to throw off some interactions with S3. DataLad / git-annex can get the file, but (as seen when running with datalad --log-level debug or git annex get directly) some of the internals report “S3 bucket does not allow public access; Set both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to use S3”. But then drop errors out with the same message, unless used with --reckless availability.

1 Like

Cool let me try that to see how much more info I can collect.

so far this is working MUCH better!!

1 Like

Is there any news about this? :blush:

I have made some prrogress but not as much as I wished because Nilearn is keeping me busy. :wink:

In brief, install the cohort_creator package

GitHub - neurodatascience/cohort_creator: Creates a neuroimaging cohort by aggregating data across datasets..

I need to cut a new release with the new features, so for now either install directly from main or clone and install locally.

pip install git+https://github.com/neurodatascience/cohort_creator.git

The listings of datasets known to the package ((and a whole bunch of information associated with them)) are stored in TSVs in this folder.

You can launch a Dash app from the terminal with

cohort_creator browse

This should display the listing of all datasets that you can filter to update the graphs.

1 Like