Differences between FMRIPREP runs on same data. What causes them?

winkler · April 19, 2021, 12:22am

Hi Oscar, hi Pierre,

Thank you for commenting. We did indeed invest a substantial amount of time, if not much in understanding, definitely in using FMRIPREP instead of other available pipelines. There was a lot of frustration at the beginning but eventually we managed to get it running – not without help here in the forum. We’ve now been using it for all projects in our group, and you can expect to see it cited in many papers coming from us (by “us” I mean our group at the NIH). For us, the main reason for using FMRIPREP over other pipelines is that, without the need for us to write our own scripts, FMRIPREP already produces a variety of outputs in both surface and volume representations. We liberate our postdocs’ time by having the tool doing it for them. Plus, most of the job is done with FSL and FreeSurfer tools, which is also great.

The issue in the question is that time series correlation between two seeds, which we’d expect to be negligible, turns out to be quite substantial, and when we consider the whole analysis, it does affect results. We are using FMRIPREP 20.2.1 in CentOS 7. We used to use a Conda environment, but because FMRIPREP and MRIQC have conflicting module dependencies (part of the frustration…), we surrendered to the use of the containerized versions of both.

In the initial post I ruled out ANTs but, in fact, it’s possible that that is the cause: if registration happens differently for two different seeds, then this will affect the time series correlations, i.e., it’s as if we were correlating one voxel not with its homologous but with some neighboring region that fell into the same place because of variation on registration.

We can’t share the data easily, but I can confirm that this affects all subjects (200+) and all seeds in this project. If you want, I can run for other datasets we have, with other sequences. Although we do have scripts, this is observed with a vanilla run of FMRIPREP, i.e., called directly from the command line.

The workaround is being to run FMRIPREP 20 times, each time with a different seed, then merge the relevant outputs that we need later on:

Labels and masks are merged by taking the mode across seeds.
Other imaging data (in NIFTI or GIFTI) are merged by taking the mean across seeds.
AROMA components (from the confounds file) are merged via CCA, producing a kind of “union” the spaces represented by these vectors.
DVARS are merged by averaging.

Then we move on to the between-subject analyses, with the hope that these 20 runs will be enough to minimize the seed variability.

Not sure what else to say…

Many thanks!

Cheers,

Anderson