Differences between FMRIPREP runs on same data. What causes them?

winkler · March 10, 2021, 2:21am

Hi all,

If we run FMRIPREP twice on the same data, we get different results. That’s well known, and such variability can be avoided with the use of options such as --random-seed 0, --skull-strip-fixed-seed, --nthreads 1, and --omp-nthreads 1.

However, there is no obvious way to choose one seed over another in --random-seed, and for the analyses we are running, the choice of the seed causes relevant, non-ignorable differences in the final results.

Before we establish a procedure that will run FMRIPREP many times to cancel out such variability, we’d like to understand what causes it. We thought ANTs could be the culprit but we also observe temporal variability in the *_desc-preproc_bold.nii.gz, which rules ANTs out since registration would affect all volumes equally. We also thought of ICA-AROMA but the latter should (so we understand) not affect the *_desc-preproc_bold.nii.gz files, just produce covariates in the *_desc-confounds_timeseries.tsv file.

The screenshot below shows the maps of the correlations of the time series of pairs of runs of FMRIPREP. That is, for every voxel from *_desc-preproc_bold.nii.gz file, we correlate the time series of that voxel with the homologous voxel in the *_desc-preproc_bold.nii.gz from a different run of FMRIPREP (that presumably used a different random seed). Note that the variability isn’t random, having some spatial pattern. Note also that it isn’t negligible. While we’ve set the colorbar between 0.95 and 1, we have correlations here as low as 0.27 in some places – and have seen them even negative!

What could be causing this?

Thanks!

All the best,

Anderson

Just fyi, the calls to FMRIPREP were (the run names are in the headers of the windows in the screenshot):

# run E
fmriprep [REDACTED]/BIDS/AP2 ${TMPDIR}/s9900.out participant --participant_label s9900 -w ${TMPDIR}/s9900.wrk --notrack --output-space MNI152NLin2009cAsym:res-2 fsaverage --nthreads 1 --omp-nthreads 1 --skip_bids_validation --use-aroma

# run F
fmriprep [REDACTED]/BIDS/AP2 ${TMPDIR}/s9900.out participant --participant_label s9900 -w ${TMPDIR}/s9900.wrk --notrack --output-space MNI152NLin2009cAsym:res-2 fsaverage --nthreads 1 --omp-nthreads 1 --skip_bids_validation --use-aroma

# run G
fmriprep [REDACTED]/BIDS/AP2 ${TMPDIR}/s9900.out participant --participant_label s9900 -w ${TMPDIR}/s9900.wrk --notrack --output-space MNI152NLin2009cAsym:res-2 fsaverage --nthreads 1 --omp-nthreads 1 --skip_bids_validation

# run H
fmriprep [REDACTED]/BIDS/AP2 ${TMPDIR}/s9900.out participant --participant_label s9900 -w ${TMPDIR}/s9900.wrk --notrack --output-space MNI152NLin2009cAsym:res-2 fsaverage --nthreads 1 --omp-nthreads 1 --skip_bids_validation

winkler · March 15, 2021, 2:29pm

I’d like to kindly bump this thread up.

Also, to facilitate finding this in the future, I’m writing here “FMRIPREP is not reproducible”, to help search engines.

oesteban · April 5, 2021, 3:27pm

What version of fMRIPrep is this? What is the platform you are running on?

Perhaps @mgxd can correct me, but I believe you can modify and set arbitrary seeds through the config file. This config file is not meant for the user to tweak, but I think in this case could be useful before setting costly experiments, as you were mentioning.

The documentation about the config file is found here - Developers - API — fmriprep version documentation

Unfortunately, I can see now this is not documented - but here we added a feature to track seeds (ENH: Keep track and allow reuse of random seeds with the config module by mgxd · Pull Request #2086 · nipreps/fmriprep · GitHub).

As you will find, there’s a way to set the ANTs’ seed, but also other seeds (e.g., for Python’s random number generator).

I agree this is puzzling and worrisome.

The most parsimonious answer would be a bug - we might promise to control for the seeds with the command line option and then not honor the promise because we did not catch the error during the quality assessment of related PRs. The bug could affect the whole system or subparts (i.e., only ANTs, only Python, other, etc.).

There’s also the possibility that some pieces of the workflow are not deterministic (although we believed they are).

Would you be able to share scripts or any materials (including data, if you identified some datasets more susceptible to larger undesired variabilities) so that we can replicate this in our settings? Having more eyes on the problem will definitely speed up the resolution.

@pbellec’s team is starting to look into the run-to-run reproducibility of the results, so I would imagine we have to bring this up in our upcoming fMRIPrep roundup. @sgiavasis and Michael Milham’s group have poked fMRIPrep in many ways and also found several issues we should address. I believe they mostly found unreliabilities around CompCor results in the confounds file. In their whole-brain connectivity analysis of repeatability though, fMRIPrep did not qualify badly (actually, Steve can correct me if I say quite good here).

Lastly, I will take some issue with the claim “FMRIPREP is not reproducible” and hope everyone rather understands that “A version or versions of fMRIPrep has been found not to be run-to-run repeatable to the last bit for (yet) unknown reasons”. Although I agree the latter is also far from ideal, IMHO the reproducibility of scientific results is way much bigger than these variabilities you found (thanks for reporting, otherwise we would not focus our attention on this). This particular problem could range from a sneaky bug to a design flaw. But even in the latter case, the many contributors who have donated their time to the tool, especially in making it thoroughly described with documentation and improvements to the reports deserve that bug/issue reporting is fair and precise. At the very least, something like “fMRIPrep seemingly disrespects seeding parameters and therefore is not reproducible”. Also, the fact that your team reached this far in understanding the tool, the outputs, and the code (honestly, you made a long way) only speaks for the thoroughness of your research and also the transparency with which the fMRIPrep team has worked towards that very much desired reproducibility. I take it was meant to serve as bait to call for our attention and therefore read it without the blanket implications, but, if anything, it just put me off for a while (and I would guess some other potential respondents have felt similarly).

That said, I wish we had the financial muscle to support users and the tool immediately as questions arise. But we don’t have it for the time being, and all users of fMRIPrep and MRIQC are experiencing large latencies in receiving responses from any person related to the tools. This is an issue that we probably want to address in our meet-up.

I agree that this particular issue with repeatability is very important and we will try to set up the necessary resources as soon as possible to investigate it. We anticipatedly want to thank you for taking the time to report it and will appreciate very much any help as requested above with code or follow-ups about your experiences.

pbellec · April 5, 2021, 8:19pm

a couple of quick comments.

@oesteban as far as I understand the post, the variations arise when the random seeding is not controlled. These variations are, as Anderson said, expected. I am still not sure whether even with controlled seeds fmriprep is entirely reproducible, but I don’t think that’s what this is about.

@oesteban another thought, at least in some parts of the brain, the variations are fairly large. Ideally this would be reduced or even suppressed, but I am pretty sure this would happen with any fmri processing pipeline (with random seeding) atm, not an fmriprep issue per say.

@anderson My guess is that registration and distortion correction can lead to large change In a voxel signal. But that’s just a guess. As Oscar mentioned, @ltetrel myself and @glatard 's team have started quantifying systematically propagation of numerical noise across fmriprep runs. At the moment our objective is not to identify which part of the pipeline are causing instabilities, but rather to quantify how big one can expect it to be (just like you did). The objective is to offer users a way to test that their particular installation of fmriprep does produce expected results (with numerical rounding errors creeping in computations across OS / OS versions, even in a container).

Tracking down and reducing sources of variability is certainly of interest but not immediately on our radar - although likely to be soon. Happy to discuss the topic on the next fmriprep round up call, and curious to hear more about the results of your investigation.

winkler · April 19, 2021, 12:22am

Hi Oscar, hi Pierre,

Thank you for commenting. We did indeed invest a substantial amount of time, if not much in understanding, definitely in using FMRIPREP instead of other available pipelines. There was a lot of frustration at the beginning but eventually we managed to get it running – not without help here in the forum. We’ve now been using it for all projects in our group, and you can expect to see it cited in many papers coming from us (by “us” I mean our group at the NIH). For us, the main reason for using FMRIPREP over other pipelines is that, without the need for us to write our own scripts, FMRIPREP already produces a variety of outputs in both surface and volume representations. We liberate our postdocs’ time by having the tool doing it for them. Plus, most of the job is done with FSL and FreeSurfer tools, which is also great.

The issue in the question is that time series correlation between two seeds, which we’d expect to be negligible, turns out to be quite substantial, and when we consider the whole analysis, it does affect results. We are using FMRIPREP 20.2.1 in CentOS 7. We used to use a Conda environment, but because FMRIPREP and MRIQC have conflicting module dependencies (part of the frustration…), we surrendered to the use of the containerized versions of both.

In the initial post I ruled out ANTs but, in fact, it’s possible that that is the cause: if registration happens differently for two different seeds, then this will affect the time series correlations, i.e., it’s as if we were correlating one voxel not with its homologous but with some neighboring region that fell into the same place because of variation on registration.

We can’t share the data easily, but I can confirm that this affects all subjects (200+) and all seeds in this project. If you want, I can run for other datasets we have, with other sequences. Although we do have scripts, this is observed with a vanilla run of FMRIPREP, i.e., called directly from the command line.

The workaround is being to run FMRIPREP 20 times, each time with a different seed, then merge the relevant outputs that we need later on:

Labels and masks are merged by taking the mode across seeds.
Other imaging data (in NIFTI or GIFTI) are merged by taking the mean across seeds.
AROMA components (from the confounds file) are merged via CCA, producing a kind of “union” the spaces represented by these vectors.
DVARS are merged by averaging.

Then we move on to the between-subject analyses, with the hope that these 20 runs will be enough to minimize the seed variability.

Not sure what else to say…

Many thanks!

Cheers,

Anderson

winkler · April 19, 2021, 12:28am

I occurs to me that we’re running this on a cluster. Every time we run, the task falls into a different node. I don’t think there is any difference between nodes (and plus, we are using the container anyway), but if you think it would help diagnose the problem by ruling out computing node variability, I can run for a subject or two on the same computer, to then compare. Just let me know and I’ll happily do.

damadam · September 15, 2021, 1:22pm

I realize this thread has gone a bit cold, but let me add a few potentially useful references regarding measuring and addressing numerical instabilities in neuroimaging pipelines.

@gkiar thesis from earlier this year:
This is Your Brain on Disk: The Impact of Numerical Instabilities in Neuroscience

And this preprint posted in August:
Accurate simulation of operating system updates in neuroimaging using Monte-Carlo arithmetic
A Salari, Y Chatelain, G Kiar, @glatard

kolydic · October 18, 2021, 2:04pm

Hi there - I am trying to understand the variability in fmriprep, and I’m wondering what progress you made w/ this objective to quantify the expected differences. Do you have any updates? And since this post, have you taken on the task of trying to identify which part of the pipeline causes instabilities?

thanks!

pbellec · October 18, 2021, 2:57pm

hi @kolydic

There has been some progress, thanks for the ping! We have created a work group on https://mattermost.brainhack.org , check the channel ~fmriprep_reproducibility . We have met a number of times already. You are welcome to join the channel and the meetings if you want to follow and possibly contribute to the effort.

In brief @ltetrel has an app which can run fmriprep with / without arithmetic noise added, and quantify the variations across runs. We’ve been able to confirm that when controlling for random seedings and on a single system, fmriprep is 100% reproducible. We should have an easy-to-deploy test this fall. Next step is to quantify numerical variations on a series of systems, and then identify sources of variations, which will be led by Yohan Chatelain and @glatard 's group.

kolydic · April 8, 2022, 9:37pm

Hi @pbellec - sorry for the slow reply, I set aside this project for a bit. Could you clue me in on the latest updates? Thank you for the invitation! (I checked out the channel, but it is hard to succinctly piece together where things are)

Also I commented again here that I tried fmriprep with what have been mentioned as the flags necessary to make fmriprep deterministic, and yet I still see some (minor) variability in raw timecourses and anatomical reconstruction, if I start from scratch using the same dicoms and a random seed read from a file. Do you have any insight as to why this is not actually deterministic still? Thanks!!