Fmriprep (v1.5.4) runtime paradoxes and omp-threads

I like to report some strange findings on memory consumption and runtime for fmriprep.

Background:

I am tuning the runtime parameters for our server. We use SGE job scheduling with qsub commands to book CPU and memory. Fmriprep is run as a singularity container (converted from the docker container tag 1.5.4). Our subjects have all longitudinal data, with about 3 annual sessions with functional data, but a total of 7 sessions with no functional data the first 3-4 years (so I guess anatomical pipeline is contructing a template from 7+ T1w files form all sessions).

Findings:

Running a subject with such characteristics requires more than 8Gb suggested in the documentation. The RAM consumption on some test subjects have been 8.5-13.9Gb.
We are allotting 2 slots (i.e., 2 CPU cores) but since the server has multithreading enabled I pass --nthreads 4 to fmriprep (we know this speeds up ANTs registration). When it comes to --omp-threads, I remember it was advised to put NTHREADS - 1, but I tried once with OMPTHREADS=NTHREADS and for some strange reason the processing time was higher. So I ran a test on the same subject, at the same time, with all the parameters identical except --omp-threads. One test was done with omp=3, the other omp=4. I expected omp=4 to be faster, or at least the same. On the contrary, omp=4 was about 50 minutes slower (out of ~17hrs processing). The memory consumption was lower with omp=4. At this point I am a bit confused of these discrepancies in runtime and memory consumption, but though to report it here. Looks like I need to use omp=3 after all. Here are the details of the test.

singularity run --cleanenv $SINGULARITY_IMAGE \
    $BIDS $OUTFOLDER \
    participant \
    --participant-label $SUBJECTID \
    --longitudinal \
    --nthreads $NTHREADS \
    --omp-nthreads $OMPTHREADS \
    --write-graph \
    --force-syn \
    --fs-license-file $FS_LICENSE \
    --work-dir $WORKDIR \
    --cifti-output \
    --resource-monitor \
    --notrack \
    --no-submm-recon \
    --use-aroma \
    --output-spaces \
      MNI152NLin2009cAsym \
      MNI152NLin2009cAsym:res-2 \
      OASIS30ANTs \
      fsaverage \
      fsaverage5 \
      T1w \
      func

Runtime omp=3:

User             = dorian
 Queue            = all.q@mri
 Host             = mri
 Start Time       = 12/29/2019 14:53:42
 End Time         = 12/30/2019 07:58:16
 User Time        = 02:42:52
 System Time      = 00:00:39
 Wallclock Time   = 17:04:34
 CPU              = 40:03:21
 Max vmem         = 11.574G
 Exit Status      = 0

Runtime omp=4

 User             = dorian
 Queue            = all.q@mri
 Host             = mri
 Start Time       = 12/29/2019 14:53:57
 End Time         = 12/30/2019 08:52:46
 User Time        = 01:19:25
 System Time      = 00:00:43
 Wallclock Time   = 17:58:49
 CPU              = 42:29:57
 Max vmem         = 10.324G
 Exit Status      = 0

By the way, in an attempt to keep memory consumption low I tried --low-mem but it made little difference, there were still 10+Gb needed for a couple of test subjects. The only thing I have not tried is --mem-mb.

The speed advantage gained by having omp-nthreads < nthreads is that it leaves at least one core open to churn through single-thread jobs while large jobs are consuming the other cores.

So the speedup with less omp is normal then. Is the associated higher memory footprint also expected?