Memory allocation issues with fMRIPrep, Singularity and HPC

mitchem890 · October 29, 2018, 9:32pm

Hello,

I’ve been having issues running fmriprep through singularity on our HPC.

#!/bin/bash
#PBS -l nodes=1:ppn=16,walltime=48:00:00,mem=128gb,vmem=356gb
#PBS -N fMRI-Prep-178950
#PBS -j oe
#PBS -o fMRI-Prep-178950.log

module load singularity-2.5.2

singularity run --cleanenv -B /scratch/kqc/fMRIPrep/sub-178950:/tmp /home/ccp_hcp/fmriprep/SingularityImages/fmriprep-1.1.7.simg --fs-license-file /tmp/.license/freesurfer/license.txt -w /tmp /tmp/fMRIPrep_input /tmp/fMRIPrep_output participant --participant_label 178950 -n-cpus 12 --omp-nthreads 8 --mem-mb 120000 --low-mem -vvvv

it seems like fmriprep will run for awhile then it will get caught

181028-15:46:16,670 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).
	 Needed files: /tmp/fMRIPrep_output/fmriprep/sub-178950/ses-01/func/sub-178950_ses-01_task-Cuedts_acq-MB42p4PA_run-02_bold_space-MNI152NLin2009cAsym_preproc.nii;/tmp/fMRIPrep_output/fmriprep/sub-178950/ses-01/func/sub-178950_ses-01_task-Cuedts_acq-MB42p4PA_run-02_bold_space-MNI152NLin2009cAsym_preproc.mat;/tmp/fmriprep_wf/single_subject_178950_wf/func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf/func_derivatives_wf/ds_bold_mni/_0x13e3b85e385d1ce67651aafcea4e3c80_unfinished.json;/tmp/fmriprep_wf/single_subject_178950_wf/func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf/func_derivatives_wf/ds_bold_mni/_inputs.pklz;/tmp/fmriprep_wf/single_subject_178950_wf/func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf/func_derivatives_wf/ds_bold_mni/_node.pklz
181027-07:33:37,330 nipype.workflow DEBUG:
	 Needed dirs: /tmp/fmriprep_wf/single_subject_178950_wf/func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf/func_derivatives_wf/ds_bold_mni/_report
181027-07:33:37,332 nipype.workflow DEBUG:
	 Removing files: 
181027-07:33:37,335 nipype.workflow DEBUG:
	 saved results in /tmp/fmriprep_wf/single_subject_178950_wf/func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf/func_derivatives_wf/ds_bold_mni/result_ds_bold_mni.pklz
181027-07:33:37,336 nipype.workflow DEBUG:
	 [Node] Writing post-exec report to "/tmp/fmriprep_wf/single_subject_178950_wf/func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf/func_derivatives_wf/ds_bold_mni/_report/report.rst"
181027-07:33:37,339 nipype.workflow DEBUG:
	 Aggregate: False
181027-07:33:37,340 nipype.workflow INFO:
	 [Node] Finished "fmriprep_wf.single_subject_178950_wf.func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf.func_derivatives_wf.ds_bold_mni".
181027-07:33:37,340 nipype.workflow INFO:
	 [Job 852] Completed (fmriprep_wf.single_subject_178950_wf.func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf.func_derivatives_wf.ds_bold_mni).
181027-07:33:37,673 nipype.workflow DEBUG:
	 Cannot allocate job 879 (43.58GB, 8 threads).
181027-07:33:37,673 nipype.workflow DEBUG:
	 Cannot allocate job 883 (43.58GB, 8 threads).
181027-07:33:37,673 nipype.workflow DEBUG:
	 Cannot allocate job 908 (40.65GB, 8 threads).
181027-07:33:37,673 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).
181027-07:33:37,673 nipype.workflow DEBUG:
	 Cannot allocate job 937 (40.77GB, 8 threads).
181027-07:33:37,673 nipype.workflow DEBUG:
	 Cannot allocate job 941 (40.77GB, 8 threads).
181027-07:33:38,986 nipype.workflow DEBUG:
	 Progress: 1287 jobs, 1234/1/6 (done/running/ready), 1/52 (pending_tasks/waiting).
181027-07:33:38,987 nipype.workflow DEBUG:
	 Tasks currently running: 1. Pending: 1.
181027-07:33:38,995 nipype.workflow INFO:
	 [MultiProc] Running 1 tasks, and 6 jobs ready. Free memory (GB): 73.70/117.19, Free processors: 4/12.
                     Currently running:
                       * fmriprep_wf.single_subject_178950_wf.func_preproc_ses_01_task_Cuedts_acq_MB42p4PA_run_02_wf.bold_t1_trans_wf.bold_to_t1w_transform
81027-07:33:39,307 nipype.workflow DEBUG:
	 Cannot allocate job 879 (43.58GB, 8 threads).
181027-07:33:39,307 nipype.workflow DEBUG:
	 Cannot allocate job 883 (43.58GB, 8 threads).
181027-07:33:39,307 nipype.workflow DEBUG:
	 Cannot allocate job 908 (40.65GB, 8 threads).
181027-07:33:39,307 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).

then it will continue to repeat

181027-07:33:39,307 nipype.workflow DEBUG:
         Cannot allocate job 912 (40.65GB, 8 threads).

Cycling though the same set of job numbers. Until it hits the walltime and the job is killed.

Looking at the log files I’m thinking that this could be because I’m setting --omp_threads 8 while only allotting 12 cpus meaning that fmriprep could only have one 8-thread process running at a time. Is this the case?

effigies · October 30, 2018, 4:50pm

Yes, fMRIPrep will not attempt to run two 8-thread processes at the same time when a maximum of 12-cores may be allocated. (It’s not clear to me why you’re limiting --n-cpus 12 when you specify ppn:16 in your submission parameters, but there are good reasons to, if the 8-core jobs end up taking more memory than we expect.)

That said, the job that it indicates is running shouldn’t take ~48 hours. How many volumes do you have in your BOLD series? And are you sure that it doesn’t move on to other runs?

mitchem890 · October 30, 2018, 5:14pm

Okay I’ve resubmitted using all 16-cores with --omp-nthreads 2. I’ve been playing with the parameters for awhile trying to get it to run, I didn’t realize that each process would utilize the max number of threads given by the omp-nthreads. That’s why some of the parameters were wonky.

We have 8 Bold runs with an average of 600 volumes each

Looking through the log this is the final update before it begins to repeat the the “Cannot allocate job” message until the walltime.

 181027-07:37:05,491 nipype.workflow DEBUG:
	 Cannot allocate job 883 (43.58GB, 8 threads).
181027-07:37:05,493 nipype.workflow DEBUG:
	 Cannot allocate job 908 (40.65GB, 8 threads).
181027-07:37:05,493 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).
181027-07:37:05,493 nipype.workflow DEBUG:
	 Cannot allocate job 937 (40.77GB, 8 threads).
181027-07:37:05,493 nipype.workflow DEBUG:
	 Cannot allocate job 941 (40.77GB, 8 threads).
181027-07:37:31,433 nipype.workflow DEBUG:
	 Cannot allocate job 883 (43.58GB, 8 threads).
181027-07:37:31,444 nipype.workflow DEBUG:
	 Cannot allocate job 908 (40.65GB, 8 threads).
181027-07:37:31,445 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).
181027-07:37:31,445 nipype.workflow DEBUG:
	 Cannot allocate job 937 (40.77GB, 8 threads).
181027-07:37:31,445 nipype.workflow DEBUG:
	 Cannot allocate job 941 (40.77GB, 8 threads).
exception calling callback for <Future at 0x7fb9f25f2da0 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/local/miniconda/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 143, in _async_callback
    result = args.result()
  File "/usr/local/miniconda/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/local/miniconda/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
181027-07:37:31,793 nipype.workflow DEBUG:
	 Cannot allocate job 883 (43.58GB, 8 threads).
181027-07:37:31,794 nipype.workflow DEBUG:
	 Cannot allocate job 908 (40.65GB, 8 threads).
181027-07:37:31,794 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).
181027-07:37:31,794 nipype.workflow DEBUG:
	 Cannot allocate job 937 (40.77GB, 8 threads).
181027-07:37:31,794 nipype.workflow DEBUG:
	 Cannot allocate job 941 (40.77GB, 8 threads).
exception calling callback for <Future at 0x7fb9f25f20f0 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/local/miniconda/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 143, in _async_callback
    result = args.result()
  File "/usr/local/miniconda/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/local/miniconda/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
181027-07:37:33,788 nipype.workflow DEBUG:
	 Cannot allocate job 883 (43.58GB, 8 threads).
181027-07:37:33,789 nipype.workflow DEBUG:
	 Cannot allocate job 908 (40.65GB, 8 threads).
181027-07:37:33,789 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).
181027-07:37:33,789 nipype.workflow DEBUG:
	 Cannot allocate job 937 (40.77GB, 8 threads).
181027-07:37:33,789 nipype.workflow DEBUG:
	 Cannot allocate job 941 (40.77GB, 8 threads).
181027-07:37:35,670 nipype.workflow DEBUG:
	 Cannot allocate job 883 (43.58GB, 8 threads).
181027-07:37:35,670 nipype.workflow DEBUG:
	 Cannot allocate job 908 (40.65GB, 8 threads).
181027-07:37:35,670 nipype.workflow DEBUG:
	 Cannot allocate job 912 (40.65GB, 8 threads).
181027-07:37:35,670 nipype.workflow DEBUG:
	 Cannot allocate job 937 (40.77GB, 8 threads).

Thanks

effigies · October 30, 2018, 5:30pm

Most processes take up 1 core, but a number of them are able to take advantage of thread-level parallelism. --omp-nthreads controls the maximum for those processes, but no process will claim 8 if it only needs 1. But if a process is able to use 8, there’s no way to dynamically say “there are only 4 available, so tell this specific process to use 4”. (Though that would be nice…)

The BrokenProcessPool issue is one that we’ve been struggling with. It usually means excessive memory usage led to a process being killed by the operating system, and it’s related to the execution plugin. You can try using the LegacyMultiProc plugin, as described here:

mitchem890 · November 1, 2018, 3:12pm

Great! This fix seemed like it worked for me.

Thanks for the help!

effigies · July 19, 2019, 8:12pm

7 posts were split to a new topic: Large virtual memory allocations in fMRIPrep on PBS system