Fmriprep errors and error recovery?

toddt · October 17, 2017, 5:46pm

Using version 0.6.6 of fmriprep, I tried to run a batch of ~60 subjects on a cluster.

These ran until recon-all, and then failed when the BIDS/derivatives/freesurfer/fsaverage directory couldn’t be found/created. [1]

I manually created that directory, then ran fmriprep again, and this time the subjects all completed recon-all, then a sampler node failed during mri_vol2surf because the fsaverage5 directory could not be found. [2]

I manually copied the fsaverage5 directory into the appropriate location, then reran fMRIprep, but nothing seems to happen. fmriprep launches, warns me about missing fieldmaps, builds a base workflow for each BOLD run, then completes. [3]

So, my footnoted questions:
1,2 – I assume that fmriprep is supposed to create the fsaverage and fsaverage5 directories, and this seems to have worked when I ran a single subject for a different dataset. Any idea why it’d fail for a batch of subjects?

3 – I’d thought that after a nipype crash, rerunning the workflow would pick up where it left off. This seems to not be happening. Any thoughts on how to resolve it? I can delete my working directory, which I think will force it to rerun a bunch of time-consuming steps, but I’d rather not re-run the recon-alls.

Thanks!
Todd

ChrisGorgolewski · October 17, 2017, 5:56pm

Could you provide exact command lines you used and the logs with the failures you mentioned?

effigies · October 17, 2017, 5:57pm

Hi @toddt.

The most likely cause here is a race condition. You’re probably running a bunch of subjects independently in a new directory, so several are seeing that there is no directory, attempting to create it, and basically interfering with each other’s operation. I’ve had this crash some subjects’ runs, and so my strategy is to put a “sleep 1m” before all but the first run in the batch file, to give it time to do the common setup before interfering. However, that’s not a very good solution, and we really should fix this on our end by doing proper locking. I’ll open an issue over at GitHub about that.

That said, I’ve never run across a situation where the fsaverage(5) files failed to be copied when they were needed. Hopefully that gets fixed when we handle locking correctly, but it’s not clear to me what would be causing this issue.

As to your third question, I’ve noticed that, since 0.6.6, the majority of our outputs come during errors. We did some work to try to make it a bit quieter, as errors were getting lost in the noise of normal functioning, but may have overshot our target. So what you’re describing is consistent with a successful run. Can you check your outputs, to make sure that it didn’t complete succesfully?

toddt · October 17, 2017, 6:48pm

Hi, @ChrisGorgolewski and @effigies ,

the command was:

singularity exec -B $base:/mnt -B $scratch:/workdir docker://poldracklab/fmriprep
fmriprep $DATADIR $OUTDIR participant --participant_label $subject --nthreads 8
–mem_mb 10000 --ignore slicetiming -w $WORKDIR

And this was run 5 times. I’ve attached each of the outputs in a public dropbox folder, here:

slurm-run1.out – failed to find the BIDS/derivatives/freesurfer directory, so I created it, then…
slurm-run2.out – some of my BOLD run names are not BIDS-compliant, so I fixed them, then…
slurm-run3.out – the working directory knew about the old mis-named files, so I deleted the working dir, then…
slurm-run4.out – Everything runs well until it can’t find fsaverage5, so I copied that directory, then…
slurm-run5.out – No output, but the report looks good, so maybe this is actually ok.

Thanks for looking into this!
Todd