FMRIPREP on HPC: job exits without warning or error

dseok · April 21, 2020, 5:49pm

I am running fmriprep (fmriprep-20.0.6.simg) on an HPC system (SGE) using Singularity. My jobs exit without warning or error, and the SGE exit status is always zero. Here is my command:

singularity run --cleanenv -B ${data_root}:/data ${img}
${niidir} ${outdir} participant
–participant-label ${sub}
–bids-filter-file bids_filter_file_default.json
–fs-no-reconall
-vvv --resource-monitor
–low-mem
–fs-license-file ${fs_license}"

and here is how I submit my command:

qsub -cwd -l h_vmem=64.0G,h_rt=72:00:00
-j y -o fmriprep_${sub}_$JOB_ID.o
run_fmriprep.sh

Memory usage (as tracked by an in-house memrec utility, the max_vmem field in the SGE report and the “Memory available” note nipy) never seems to exceed the allocated 64G. None of the following options change this behavior:

–low_mem flag
–mem-mb option
–use-plugin plugin.yml, with a .yml file that specifies the appropriate # of threads (1) and memory
setting --nthreads 1 and --omp-nthreads 1

In all cases, my job runs for around 10 minutes and then unceremoniously exits, with a zero exit status from the scheduler and no notification in the stdout. Here are the last 10 lines of the stdout of one attempt:

200421-00:42:17,502 nipype.workflow DEBUG:
Tasks currently running: 1. Pending: 1.
200421-00:42:17,506 nipype.workflow INFO:
[MultiProc] Running 1 tasks, and 10 jobs ready. Free memory (GB): 60.50/60.55, Free processors: 0/1.
Currently running:
* fmriprep_wf.single_subject_300700_wf.func_preproc_ses_01_task_rest_run_001_wf.bold_hmc_wf.fsl2itk
200421-00:42:17,506 nipype.workflow DEBUG:
No resources available
200421-00:42:19,506 nipype.workflow DEBUG:
No resources available

mgxd · April 21, 2020, 7:06pm

@dseok it looks to me like your job may be timing out - try setting the soft run time limit (s_rt) in addition to h_rt.

dseok · April 22, 2020, 6:04pm

@mgxd It turns out I just needed to allocate a significantly higher amount of memory (~20GB) than the amount of memory I specify in the fmriprep call (5GB). Not sure if people have any ideas about why such a big discrepancy is needed.