I am running fmriprep on an HPC system using singularity/2.6 and fmriprep-1.1.8. I have tried different variations of omp-threads, nthreads, and memory allocations.The workflow proceeds through freesurfer successfully and then begins pre-processing the series of multiband images (6 runs of 2X2X2mm data, where each run contains ~ 200 TRs). On the latest run I requested 32G of memory, 8 cpus and 4 omp_threads.
After processing 2 of the 6 multiband runs and before completing the third run, the job stalls. Slurm continues to show the process as active and python processes are still running on the node, but no external programs are running. One of the python processes is marked as “defunct”.
The final lines of slurm.out (before issuing scancel on the job) are:
181102-12:32:01,537 nipype.workflow INFO:
[Node] Setting-up “fmriprep_wf.single_subject_1013_wf.func_preproc_task_recog_run_01_wf.bold_confounds_wf.non_steady_state” in “/scratch/brad/projects/pcmri/1013_work/fmriprep_wf/single_subject_1013_wf/func_preproc_task_recog_run_01_wf/bold_confounds_wf/non_steady_state”.
181102-12:32:01,545 nipype.workflow INFO:
[Node] Running “validate” (“fmriprep.interfaces.images.ValidateImage”)
181102-12:32:01,571 nipype.workflow INFO:
[Node] Running “non_steady_state” (“nipype.algorithms.confounds.NonSteadyStateDetector”)
181102-12:32:01,597 nipype.workflow INFO:
[Node] Finished “fmriprep_wf.single_subject_1013_wf.func_preproc_task_recog_run_01_wf.bold_bold_trans_wf.bold_reference_wf.validate”.
181102-12:32:03,533 nipype.workflow INFO:
[Node] Setting-up “fmriprep_wf.single_subject_1013_wf.func_preproc_task_recog_run_01_wf.bold_bold_trans_wf.bold_reference_wf.gen_ref” in “/scratch/brad/projects/pcmri/1013_work/fmriprep_wf/single_subject_1013_wf/func_preproc_task_recog_run_01_wf/bold_bold_trans_wf/bold_reference_wf/gen_ref”
[Node] Running “gen_ref” (“niworkflows.interfaces.registration.EstimateReferenceImage”)
There are no log files containing a record of a crash in the output folder.
When I cancel the process, the following is added to the slurm output:
slurmstepd: error: *** JOB 8810392 ON gra337 CANCELLED AT 2018-11-02T09:57:07 ***
slurmstepd: error: Detected 1 oom-kill event(s) in step 8810392.batch cgroup.
Any thoughts how I can debug this? Do I need more than 32G of memory or the --lowmem flag?