Estimating storage needs for imaging studies — benchmarks & suggestions?

Hi all,

As we’re budgeting for some new imaging studies, we’re looking at our data storage needs. As long as I’ve been working in imaging, estimating and those needs has been a challenge. There’s a persistent feeling that “we’re using too much storage” but as I think about it, I don’t know that anyone has a clear idea of how much storage er should be using.

So! I’m wondering if anyone out there can give an estimate of roughly how much we might expect our raw data to grow as we process and analyze it, and techniques that have helped to keep data usage at “reasonable” levels.

For what it’s worth, we’re looking to collect fairly MRI data — anatomical, task and resting fMRI, and diffusion. FSL is our main analysis package.

Any advice (even anecdotes) y’all have would be more than welcome.

Not sure if this is advice or anecdote, but here’s an example from one subject from OpenFMRI with some preprocessing done.

% du -Lsh /data/bids/ds000114/sub-07                    
260M    /data/bids/ds000114/sub-07
% du -Lsh /data/out/ds000114/derivatives/fmriprep/sub-07  
1.4G    /data/out/ds000114/derivatives/fmriprep/sub-07
% du -Lsh /data/out/ds000114/derivatives/freesurfer/sub-07
276M    /data/out/ds000114/derivatives/freesurfer/sub-07

So just from preprocessing, it’s pretty easy to get a 6-7x expansion on top of the original data. Obviously the details of your FSL pipeline are going to make a difference, as well as how many steps of your analysis need to be kept, but perhaps this is a useful data point.

How many steps need to be kept is probably central to the question of “how to keep data usage at ‘reasonable’ levels”. If I can plug nipype, one nice thing about organizing your workflows as nipype pipelines is that the working directory can generally be deleted once the pipeline is fully run and the desired outputs are saved. (This is how fmriprep, for example, is organized. The intermediate results are substantially larger than those saved as “derivatives”.)

Checking out the relative sizes of raw and preprocessed HCP data might be another useful data point.

I don’t immediately have a comparison for statistical analyses, but those should be much smaller than resampled BOLD series, which are going to be the bulk of the preprocessing outputs.

1 Like

Thanks! That’s quite helpful.

The fmriprep growth is kind of scary — we haven’t used it much to date but I think we will be starting pretty soon. I’m guessing you’ve sampled the functional data from 4mm to 2mm or something similar?

The fmriprep growth is kind of scary

Yeah, I made two errors:

  1. I thought I’d only preprocessed on one functional run, so I removed the others from the raw data calculation. That should have been 408M, making it a 4-5x inflation.
  2. I resampled the BOLD series to 4 spaces: T1w (aligned to T1w, original voxel size); MNI (aligned to MNI, original voxel size); fsaverage5 (~32k mesh per hemisphere); fsnative (~164k mesh per hemisphere).

This was a stress test; you probably would not actually use all 4 outputs.

Here’s the actual listing of the individual files (for one run):


24M     sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold.nii.gz


7.9M    sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-fsaverage5.L.func.gii
7.8M    sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-fsaverage5.R.func.gii
87M     sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-fsnative.L.func.gii
87M     sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-fsnative.R.func.gii
33M     sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-MNI152NLin2009cAsym_preproc.nii.gz
80M     sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-T1w_preproc.nii.gz

I know I’m being dense here, but if your MNI152NLin2009cAsym and T1w ones are both in original voxel size, why is T1w so much larger?

Not dense. I should have followed my instinct and shared the nib-ls output, but thought it’d be too cluttering. The difference is that T1w is saved on disk float32, while MNI is saved as int16 (to be clear, it uses scaling factors, so it will be loaded as a float, and have the correct range).

I’ll have to check whether we’ve already fixed T1w to use int16, or if that’s on our to-do list.

Okay then one last question. I’d have imagined the MNI one to be basically the same size as the original — is it larger just because it’s randomly compressed a bit differently?

Also: fmriprep doesn’t mask the bold series, I take it?

I guess that was two questions

The difference might come from a different field of view/grid size. @effigies could you post nib-ls outputs?

Hi all. Sorry, busy weekend.

Original file:

sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold.nii.gz int16 [ 64,  64,  30, 184] 4.00x4.00x4.00x2.50   sform


sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-MNI152NLin2009cAsym_preproc.nii.gz    int16  [ 49,  58,  49, 184] 4.00x4.00x4.00x2.50
sub-07/ses-test/func/sub-07_ses-test_task-fingerfootlips_bold_space-T1w_preproc.nii.gz                   float32 [ 67,  67,  68, 184] 4.00x4.00x4.00x2.50

Ok so the larger field of view and increased numerical precision explains why MNI outputs are so much larger than the T1w ones. Changing the numerical precision should yield larges decrease in size than modifying the field of view (zeros get compressed well).

To avoid derailing this thread further, I’ve opened an issue at fMRIPrep. I hope you can get more advice from groups with different experiences.

Thank you both. One last question (and I know this is getting to be pretty fmriprep-specific) — does fmriprep apply any sort of mask to the functional data?

It provides a mask in each volumetric output space, but does not apply the mask to the outputs. I believe this is largely in the interest of transparency, though it does come at the cost of space.

Next release will include CIFTI outputs which should be an attractive disk space preserving alternative.

How so? I’ve looked at the specs for cifti and nifti-2 and don’t see anything jumping out about space savings…

CIFTI only stores values in the surface (like gifti) and in selected subcortical regions. Data from white matter and outside of the brain are discarded.

1 Like