Fmriprep v1.0.12 hanging

Steve_Wilson · May 9, 2018, 8:14pm

Hi experts,
I am attempting to update from v1.0.3 to v1.0.12 of fmriprep and I’m hitting a snag. I wanted to reach out to see if anyone has encountered this issue before. Due to the constraints of our cluster, I’m working from a manually prepared environment. Things work fine with v1.0.3, but things appear to stall during processing when I try to run v1.0.12 (which is set up in a separate environment). I’m trying both versions on the same test data set. I’ve saved the workflow output for the successful run of v.1.0.3 here – https://www.dropbox.com/s/2dupcfzhhibnf8t/log_test_v1.0.3.pbs.o5964352.odt?dl=0. I’ve saved the output from a stalled run of v1.0.12 here – https://www.dropbox.com/s/wmh6ea8p1rxaw5q/log_test_v.1.0.12.pbs.o5974287.odt?dl=0. I am having some trouble figuring out what the issue is since I can’t seem to find any explicit errors – things just hang during surface reconstruction. Version 1.0.12 seemed to install properly, and all the relevant dependencies work without any issue for v1.0.3. Has anyone encountered anything like this before and/or any suggestions for debugging?
Thanks,
Steve

Steve_Wilson · May 9, 2018, 8:36pm

Oh, forgot to specify - the run with v1.0.3 finishes in a little over 14 hours; the one with v1.0.12 goes for as long as I have the walltime set (e.g., for over twice as long) without finishing.
Thanks!

effigies · May 9, 2018, 9:09pm

How much memory are you allocating, and are you able to see whether any cores are being used? There’s a known issue in Python where some processes killed by the operating system are not reported back to the parent process. The only short term fix we have for this is to increase memory availability.

Steve_Wilson · May 9, 2018, 9:38pm

Thanks, Chris. I didn’t specify the memory allocation, so the job would have been run with the default limit of 256gb. I requested 1 ppn/1 node. (I apologize if I’m not answering your question or not using the correct description here - I’m still fairly new to batch processing). This configuration has consistently worked with v1.0.3 (once I straightened out some issues caused by packages installed locally in my home dir) - Is the latest version more resource demanding than previous versions? Regardless, I will try requesting more memory to see if that works. Thanks for the help!

effigies · May 9, 2018, 10:16pm

Well 256GB ought to be plenty. It’s in the 8-10GB range where we see problems, usually.

Steve_Wilson · May 9, 2018, 10:29pm

Ah, well thanks for the suggestion anyway - I’ll keep searching!

oesteban · May 10, 2018, 1:17am

You can also either set --n-cpus to a low number (my impression is that 8 is fairly safe, but check out this beautiful post about the matter - How much RAM/CPUs is reasonable to run pipelines like fmriprep?) and/or limit the number of subjects processed in parallel with --participant-label.

Steve_Wilson · May 10, 2018, 12:35pm

Thanks for sharing, Oscar - that is a great post! Unfortunately, I am thinking that the issue I’ve hit may not be related to memory limitations, but the info in the post is still very useful.

oesteban · May 10, 2018, 3:01pm

It would be great if we could dismiss the option that you run out of memory. Even if you are really able to access the 256GB, if you have, say 64 cores, then probably FMRIPREP is certainly trying to allocate more than those 256GB (provided that you are running enough subjects to have 64 tasks in parallel). It is and extreme scenario, yet possible.

On the other hand, if your processes are dying for some other reason then we have to look into this closer because that would be new.

Steve_Wilson · May 10, 2018, 3:53pm

That makes sense - I’ve just started another test run with the --n-cpus set to 8, which seemed plenty safe for the test case based on post that you shared. I was also able to submit the job to a high memory node (requested 800gb for the job). I should be able to see if processing stalls after the first several hours, as it always seems to hang at about the same place.

The vexing thing is that everything works fine with v1.0.3 when I use the same fmriprep command on a copy of the same data (test set with a T1w, fieldmaps, and single functional run for one subject). A colleague in my dept was able to get v1.0.12 running on the cluster by setting things up with virtualenv (although, as I understand it, including the --write-graph option caused an issue). The IT support I worked with had me create a conda env instead. I have a ticket in to see if I need to switch to an alternative setup. In the meantime, I thought that I may be making an error when create a new env and installing fmriprep from scratch, so I also tried cloning the working conda env with v1.0.3 and running pip install –upgrade fmriprep, but that was also unsuccessful. Very confusing to me…I very much look forward to being able to use Singularity or Docker.

Thanks very much again for the help with debugging – I’ll post a follow up once the latest test either finishes or stalls.

oesteban · May 15, 2018, 8:31pm

Hi @Steve_Wilson any problems after limiting the number of cpus?

Steve_Wilson · May 15, 2018, 9:02pm

Thanks a lot for following up @oesteban. I kept hitting the same issue even after limiting the number of cpus and submitting the processing job to a high memory system. However, I think that I may have stumbled on the issue, or at least part of it, this morning. I made some adjustments and started a test run this morning, so I should know either way soon. I’ll try to summarize in case it may be useful for others if it does turn out I’m right about what was going on.
Processing always stalled during the second phase of Freesurfer processing. I don’t have experience with Freesurfer outside of fmriprep, so I’m still very much learning how to interpret the output. Long story short, I eventually noticed that a bunch of “XL defect” flags were popping up in the processing log, which I think I’ve narrowed down to an orientation mismatch issue.

For the previous version of fmprep I was using (v.1.0.3), the first Freesurfer processing step took the reoriented T1w image as input (work/anat_preproc_wf/anat_template_wf/t1_reorient/sub-##_T1w_ras.nii.gz). However, for the latest version I’m using (updated to v.1.0.13), it takes the T1w from the bids directory, which is prior to any change to the qform/sform. The qform and sform were not the same for my images, so this resulted in a mismatch between the image fed into the initial Freesurfer step and the mask from the T1w workflow, which I think led to the XL defect issues (and inability to finish reconstruction). At least, I hope that this was the problem!

Another very minor thing I noticed as I was trying to figure this out: the final report mentions that the qform has been copied from the sform for relevant images, but at least in my case, it looks like the sform was actually copied to the qform.

I will be sure to follow up either way once my current test run finished.
Thanks again!

oesteban · May 15, 2018, 9:26pm

Thanks a lot, this is incredibly useful feedback!. cc/ @effigies @ChrisGorgolewski

Steve_Wilson · May 16, 2018, 6:30pm

A quick update - The test run finished up this morning, and I am happy to say that it ran to completion. So, it looks like making the qform and sform equal for the images (I used fslorient) before starting preprocessing took care of the problem I was having with the steps involving freesurfer. I think that all of the relevant output looked fine. I did hit some error messages in the log, but I think they are unrelated to my previous problem. One error was that the program could not find ICA_AROMA.py. I export the path to the directory with the ICA-AROMA script in a file that is sourced at runtime, but maybe I need to add this information to my .bashrc in addition or instead.

The other error had to with the carpetplot outputs, which were not created. I’ve pasted the error output at the bottom of the post. I haven’t much of an opportunity to try to figure out what is going wrong, but I’m guessing that the creation of the carplots failed because the final confound tsv files were not produced due to the ICA-AROMA issue noted above. There are separate tsv files for the other confounds in the work directory, but the aggregate confound file was not produced. I have checked and the images/files referenced in the command all exist and look okay (.cache/stanford-crn/mni_icbm152_nlin_asym_09c/1mm_parc.nii.gz, bold_bold_trans_wf/bold_reference_wf/enhance_and_skullstrip_bold_wf/combine_masks/ref_image_corrected_brain_mask_maths.nii.gz, bold_reg_wf/bbreg_wf/fsl2itk_fwd/affine.txt, and anat_preproc_wf/t1_2_mni/ants_t1_to_mniComposite.h5).

My next step is to try to fix the ICA-AROMA issue and see if that takes care of the rest. I

Traceback (most recent call last):
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/nipype/pipeline/plugins/multiproc.py”, line 68, in run_node
result[‘result’] = node.run(updatehash=updatehash)
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py”, line 480, in run
result = self._run_interface(execute=True)
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py”, line 564, in _run_interface
return self._run_command(execute)
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py”, line 644, in _run_command
result = self._interface.run(cwd=outdir)
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/nipype/interfaces/base/core.py”, line 520, in run
runtime = self._run_interface(runtime)
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/interfaces/fixes.py”, line 24, in _run_interface
runtime, correct_return_codes)
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/nipype/interfaces/base/core.py”, line 1020, in _run_interface
self.raise_exception(runtime)
File “/gpfs/group/sjw42/default/sjw42_collab/sw/miniconda/20180514_v.1.0.13/lib/python3.6/site-packages/niworkflows/nipype/interfaces/base/core.py”, line 957, in raise_exception
).format(**runtime.dictcopy()))
RuntimeError: Command:
antsApplyTransforms --default-value 0 --dimensionality 3 --float 1 --input /storage/home/sjw42/.cache/stanford-crn/mni_icbm152_nlin_asym_09c/1mm_parc.nii.gz --interpolation MultiLabel --output 1mm_parc_trans.nii.gz --reference-image /gpfs/group/sjw42/default/ASH/test-fmriprep/work/v13b_fs/fmriprep_wf/single_subject_18_wf/func_preproc_task_cardguess_run_02_wf/bold_bold_trans_wf/bold_reference_wf/enhance_and_skullstrip_bold_wf/combine_masks/ref_image_corrected_brain_mask_maths.nii.gz --transform [ /gpfs/group/sjw42/default/ASH/test-fmriprep/work/v13b_fs/fmriprep_wf/single_subject_18_wf/func_preproc_task_cardguess_run_02_wf/bold_reg_wf/bbreg_wf/fsl2itk_fwd/affine.txt, 1 ] --transform [ /gpfs/group/sjw42/default/ASH/test-fmriprep/work/v13b_fs/fmriprep_wf/single_subject_18_wf/anat_preproc_wf/t1_2_mni/ants_t1_to_mniComposite.h5, 1 ]
Standard output:

Standard error:

Return code: 1

oesteban · May 16, 2018, 6:36pm

Thanks a lot! You are likely hitting https://github.com/poldracklab/fmriprep/issues/1127 now, and that problem was fixed with the release of 1.0.14. We had to release a quick hotfix.

oesteban · May 16, 2018, 6:37pm

As for the ICA_AROMA.py file you would need to add its path to the PATH env variable.

Steve_Wilson · May 18, 2018, 12:06am

I upgraded to 1.0.14 and everything with the carpetplots went fine, and I got everything sorted with setting the path to ICA_AROMA. Thanks @oesteban!