Fmriprep stalling during multiband fmri processing

Hi,

I am running fmriprep on an HPC system using singularity/2.6 and fmriprep-1.1.8. I have tried different variations of omp-threads, nthreads, and memory allocations.The workflow proceeds through freesurfer successfully and then begins pre-processing the series of multiband images (6 runs of 2X2X2mm data, where each run contains ~ 200 TRs). On the latest run I requested 32G of memory, 8 cpus and 4 omp_threads.

After processing 2 of the 6 multiband runs and before completing the third run, the job stalls. Slurm continues to show the process as active and python processes are still running on the node, but no external programs are running. One of the python processes is marked as “defunct”.

The final lines of slurm.out (before issuing scancel on the job) are:

181102-12:32:01,537 nipype.workflow INFO:
[Node] Setting-up “fmriprep_wf.single_subject_1013_wf.func_preproc_task_recog_run_01_wf.bold_confounds_wf.non_steady_state” in “/scratch/brad/projects/pcmri/1013_work/fmriprep_wf/single_subject_1013_wf/func_preproc_task_recog_run_01_wf/bold_confounds_wf/non_steady_state”.
181102-12:32:01,545 nipype.workflow INFO:
[Node] Running “validate” (“fmriprep.interfaces.images.ValidateImage”)
181102-12:32:01,571 nipype.workflow INFO:
[Node] Running “non_steady_state” (“nipype.algorithms.confounds.NonSteadyStateDetector”)
181102-12:32:01,597 nipype.workflow INFO:
[Node] Finished “fmriprep_wf.single_subject_1013_wf.func_preproc_task_recog_run_01_wf.bold_bold_trans_wf.bold_reference_wf.validate”.
181102-12:32:03,533 nipype.workflow INFO:
[Node] Setting-up “fmriprep_wf.single_subject_1013_wf.func_preproc_task_recog_run_01_wf.bold_bold_trans_wf.bold_reference_wf.gen_ref” in “/scratch/brad/projects/pcmri/1013_work/fmriprep_wf/single_subject_1013_wf/func_preproc_task_recog_run_01_wf/bold_bold_trans_wf/bold_reference_wf/gen_ref”
[Node] Running “gen_ref” (“niworkflows.interfaces.registration.EstimateReferenceImage”)

There are no log files containing a record of a crash in the output folder.

When I cancel the process, the following is added to the slurm output:

slurmstepd: error: *** JOB 8810392 ON gra337 CANCELLED AT 2018-11-02T09:57:07 ***
slurmstepd: error: Detected 1 oom-kill event(s) in step 8810392.batch cgroup.

Any thoughts how I can debug this? Do I need more than 32G of memory or the --lowmem flag?

thanks,

Brad

This looks like a memory issue… Are you running AROMA? Perhaps our memory estimation for the AROMA node is off. You can try artificially lowering the amount of memory fmriprep will see by passing --mem_mb 20000. This will reduce parallelization level, but might avoid stalling.

I’m not currently using AROMA. I previously had mem_mb set to 30000, so I will reduce to 15000. I note that the only time fmriprep has completed was when I ran it with 1 cpu.

I reran with mem_mb set to 16000 and keeping everything else the same (32G of memory, 8 cpus).

Things proceed fine until processing the third EPI run. The job remains active but there is no longer output to the slurm log file. When I ssh to the node running the job, I see using ‘top’ that “antsApplyTransform” is still running and using 98% cpu and .2 memory.

When I look at the “work” folder and search for the most recent output, I see that in the bold_mni_trans_wf/bold_to_mni_transform/ folder for one of the runs there are 152 files, with the last one being vol0152_xform-00152.nii.gz. The time stamp of this file coincides with the last utput in the slurm log file.

But there are 279 volumes in the run. So it appears that antsApplyTransforms is in some way tripping up.

I ran fmriprep on another dataset that is smaller (lower res EPI) and saw the same thing where antsApplyTransforms was still using cpu cycles and had produced partial output for a run before seemingly spinning its wheels for hours.

Any ideas?

thanks.

Brad

I rerun with 40G requested for the job, no freesurfer and using only 1 cpu. The dataset has 7 EPI runs, each 969626 with 300 TRs.

I ssh’d to the node. Then I run top. “susan” is running. But a “python” process, second to last, is using .031t of resident memory and 32.6G of virtual memory (not sure why top lists the former in terabytes).

This may be why 32G was insufficient. But still not clear to me why python is sucking up so much memory, is it holding the full dataset in memory?

15165 brad 20 0 3436476 3.247g 2308 R 98.0 0.6 4:03.40 susan
15340 brad 20 0 137448 2392 1232 R 1.3 0.0 0:02.41 top
32486 brad 20 0 1245908 249268 30944 S 0.3 0.0 4:42.35 fmriprep
15164 brad 20 0 4488 644 556 S 0.0 0.0 0:00.44 sh
15219 brad 20 0 185144 2884 1108 S 0.0 0.0 0:00.05 sshd
15220 brad 20 0 126624 3088 1716 S 0.0 0.0 0:00.05 bash
32484 brad 20 0 113336 1456 1132 S 0.0 0.0 0:00.03 slurm_script
32503 brad 20 0 48740 10600 3108 S 0.0 0.0 0:00.03 python
32504 brad 20 0 81064 12508 4040 S 0.0 0.0 0:00.05 python
32753 brad 20 0 32.692g 0.031t 33176 S 0.0 6.3 44:52.19 python
32762 brad 20 0 48740 8924 3108 S 0.0 0.0 0:00.04 python

This is even more mysterious. SUSAN is only part of the ICA AROMA workflow so it should not run at all. Perhaps the node is shared with another job?

Pinging @oesteban and @effigies - maybe you guys have some ideas what can be happening.

@bbuchsbaum one more thing you should try is passing --cleanenv to singularity (see https://fmriprep.readthedocs.io/en/stable/installation.html#running-a-singularity-image). Perhaps there is some weird interaction between host and container environment.

actually, sorry, I did run that job with AROMA, but I don’t think AROMA is the culprit since the same thing was happening without it. In the above case, “susan” actually succeeded but some time later the process stalled on antsApplyTransforms. And when I say “stalled”, I mean that antsApplyTransforms continued to run for hours but did not advance to the next step. The only hint of a memory issue occurs when I cancel the job and an oom message is appended to the slurm log.

Note, I do use the --cleanenv flag already.

One theory I have is that for some reason antsApplyTransforms malfunctions on your data (for some unknown reason) and goes into an infinite loop producing terminal output which is stored in memory by the python supervisor process (which would explain growing memory footprint). We would need to gain access to some example data for which this problem occurs to be able to track it down further.

This is clearly a memory issue. The reason why it takes that much is the poor parallelization of python. We are working on nipype to improve memory consumption.

As you were very well identifying, the problem is generally an excessive allocation of virtual memory. This happens because of the continuous forking, which duplicates the virtual memory allocation every time a fork happens (and that means thousands of times during fmriprep). Even though we try in nipype to clean up every so often, memory blows up with large datasets.

When running on PC or very permissive HPC, memory overcommitting is generally permitted and fmriprep does not get killed. But, in most of the cases, the oom killer will kick in. If one of the workers is the unfortunate process picked to die, then fmriprep is left hanging.

That antsApplyTransforms is one good candidate to fill memory up, especially when your BOLD are very long because it parallelizes internally and memory does not get cleaned up between forks.

can you post the command line, and how many subjects are you planning to process?

You can try using the LegacyMultiProc plugin, which is better able to handle accumulating memory within Python processes.

Nipype moved to a new MultiProc backend a few versions back, which doesn’t run into the failure where a process killed by the OS simply disappears and Python receives no signal and thus waits forever. However, it is less tweakable, and a lot of the optimizations that went into LegacyMultiProc became impossible to replicate. Unfortunately, it appears that whichever we choose as the default will cause problems for a different set of users.

I’m curious if this will help. There is something going with memory management for sure, but maybe this is not the whole story.

My money is still on something weird going on with antsApplyTransforms. From what I understand it’s always the third run that is stalling (for various combinations of memory and thread limits), and it’s the antsApplyTransforms that is running indefinitely not the python supervisor.

antsApplyTransforms seems to be the issue.

I reran with 1 cpu and 1 omp-threads, no aroma.

Everything is fine (top shows python < 1.5GB) until antsApplyTransforms runs. After the first run, it goes from 1.5GB to 5GB. Then some other processes intervene (flirt, antsRegistration, etc.) and the python process stays put at 5GB. But then when antsApplyTransforms is invoked for the second run, the memory starts climbing again at what appears to be a fairly steady rate.

So what happens is that once you get to the third or fourth run, the memory runs out. Some sort of resource is being used (per volume?) in the python process that is not released after each antsApplyTransforms run.

command line below:

singularity run --cleanenv -B /home -B /project -B /scratch -B /localscratch /home/brad/singularity/fmriprep-1.1.8.simg /project/6019291/brad/marie_video prepped participant -w /scratch/brad/projects/marie_video/work_1001 --mem_mb 16000 --output-space template --template-resampling-grid native --nthreads 1 --omp-nthreads 1 --use-syn-sdc --fs-license-file /home/brad/license.txt --fs-no-reconall --resource-monitor --participant-label 1001

Also one more detail. This is not an especially big dataset. Originally (as per the title) I was processing a larger multiband dataset. But then in the examples above I switched to a more modestly sized dataset:

7 EPI runs, 300 images per run, 96X96X26.

FixHeaderApplyTransforms seems like a suspect–at least to someone who has no familiarity with the codebase. The _copyxforms function is loading each image in to memory to change the header. Perhaps somehow these images are not getting garbage collected.

That appears to have done the trick. I still see memory climb during antsApplyTransforms but then it will go back down. The highest I saw the main python process get was 3GB, whereas before it would increase inexorably (even when nthreads=1).

With regard to the parameters in the plugin file, do they override the fmriprep command line parameters? For example, “n_procs” – does this override the equivalent fmriprep option? Same for memory_gb?

1 Like

No. If you specify the number of processes and amount of memory in both locations, the CLI options take precedence. The idea was that you might use a common plugin file, and occasionally want to tweak the other parameters.

I’m glad it helped. I’ve been reading that the new backend (concurrent.futures) is resulting in high memory consumption in other contexts, as well. It looks like we’ll need to look more deeply into its exact operation… annoying, as it was advertised as a drop-in replacement for the old, broken system.

Maybe we can try to push this forward? https://bugs.python.org/issue22393

I’m afraid that the big change from multiprocessing to concurrent.futures is that the latter does not use the forkserver mode. Therefore all the cleanup and garbage collection we do is dismissed :frowning:

Any thoughts on why the memory problem persisted with nthreads=1?
Also why it appears to be linked to antsApplyTransforms?

This fact was also observed by fphsFischmeister here:

“I further noticed from the log, that this seems to occur always then transforming bold images”

see:

I don’t think it’s exactly an antsApplyTransforms problem, but rather a Python problem related to a workaround we made for an antsApplyTransforms limitation.

antsApplyTransforms does not allow us to pass per-volume transforms, e.g. for head motion correction, in a list of concatenated transforms, so we have to break the BOLD series apart, run antsApplyTransforms for each volume independently, and re-concatenate the volumes into a series. For large datasets, treating these as individual nodes in the processing graph resulted in a massive proliferation of files that was causing users to hit quotas on the maximum number of files on clusters.

So we wrote an interface that performs the split internally, applies the transforms, and cleans up excess files. This uses the Python multiprocessing capabilities, effectively becoming a scheduler within the larger nipype scheduler. As memory builds up in the main process, each new execution will start by forking, making a full copy of the main process, before executing antsApplyTransforms.