How much RAM/CPUs is reasonable to run pipelines like fmriprep?

fmriprep

#1

Hi,

I am trying to get an idea of what systems people are using for running things like fmriprep. At the moment I am on a system that has 4 CPUs and 16GB of RAM. fMRIprep (1.0.0-rc12) is currently taking about 20 hours per subject without ICA-AROMA (MB2 dataset, 6 runs, 216 volumes per run). I need to set the low-mem flag, otherwise I get memory errors. Before 1.0.0-rc12 I couldn’t run fmriprep at all without it crashing.

Our IT department is in the process of setting up a VM for the lab and I was wondering if anyone could advise on what kinds of systems they are using, or what specs (RAM, CPUs) are reasonable in general, to be able to run fmriprep, mriqc etc without running into memory errors all the time.

Thanks so much,
Kristina


Fmriprep: VMem exceeding limit on job
Memory usage of fmriprep
#2

Hi Kristina,

It makes all sense that it wouldn’t run at all before 1.0.0-rc12 with 16MB of RAM. We made great progress in that rc understanding how memory was being used. We are dealing with the issue and it is one of the lines in the roadmap with the highest priority.

With 4 CPUs, 16GB of RAM I would run fmriprep with the following options: --nthreads 2 --omp-nthreads 4 --mem-mb 16000.

To talk about performance, we first need to clarify that executing FreeSurfer’s recon-all will add 6-12 h. to your processing. In a machine like the one you describe (4CPUS) probably more. So that would account for most of your 20 hours, unless you were using the --no-freesurfer option.

That said, with FreeSurfer pre-run, I can tell you (based on my experience) that a dataset like the one you describe would take:

  • About 2h 30min in an 8CPU 32GB RAM desktop.
  • About 1h ~ 1h 30min in a 16 CPU, 64 GB RAM compute node of our cluster.

We just released FMRIPREP 1.0, and we are making a lot of progress about memory consumption. For example, before buying more memory, make sure that you can overcommit a lot of memory. This means that your /proc/sys/vm/overcommit_memory is set to 0 and/or that you have a large value in /proc/sys/vm/overcommit_ratio. If your system runs with cgroups, again make sure that you have large limits of virtual memory. Right now, the main bottleneck for FMRIPREP is the virtual memory. In terms of physical memory, I haven’t seen many datasets over 12 GB RSS, and those are datasets with a lot more and a lot lengthier runs compared to your dataset.

We also enabled a new --resource-monitor flag for FMRIPREP. You could use it to profile memory consumption in your settings. If you wanted to try that, we’ll be very happy to assist you and make good use of your feedback.

Thanks very much and happy fmriprep’ing,

Oscar


#3

Hey,

I have the same question but for running multiple samples in parallel. I want to run a large dataset (N ~ 1000) efficiently on an HPC. If I run each container on a full node, I have 32 cores and 128G of memory. If I ask for either 1 or 4 samples to be run by a fmriprep instance with 64G of memory and varying numbers of CPUs, I get roughly the same processing times as @oesteban.
image
At 32 cores, 1 sample and 4 samples take the same amount of completion time but the 4 sample instance uses more memory (not sure these memory numbers make too much sense but that’s what the scheduler tells me).
So there seems to be some minimal runtime that I can’t get below by throwing more cores at it - possibly because a portion of the pipeline has sequential jobs. So in a 32 cores and 64G memory instance I have some room left to run additional samples in parallel.
image

Possibly the number of threads each task can use would also have an influence here. But before I run more tests I’d like to ask if you have found some rule of thumb for running efficient instances. Something like: 3 samples at 16 cores and 32G of memory take an hour and have a consistently high CPU and RAM usage. I would like to use that to estimate how many jobs I will request on the HPC and for how long.

Also, I’d be interested to use the --resource-monitor flag but don’t know what to do with it after I set it. Happy to try out if you have some starting points.

Best,
Seb


#4

Hi Seb,

First of all, wow!. Pretty neat job of benchmarking here! I think you’ve taken this farther I’ve ever tried :D.

At 32 cores, 1 sample and 4 samples take the same amount of completion time but the 4 sample instance uses more memory (not sure these memory numbers make too much sense but that’s what the scheduler tells me).
So there seems to be some minimal runtime that I can’t get below by throwing more cores at it - possibly because a portion of the pipeline has sequential jobs.

Yes, there is a limit and the reason is what you mentioned: sequential jobs.

So in a 32 cores and 64G memory instance I have some room left to run additional samples in parallel.

This is partially true. FMRIPREP is still a bit inefficient when allocating memory. Even though we made a great job at keeping physical memory (typically referred to as RSS) low (we just need to look at your graph), FMRIPREP has some trouble with the virtual memory fingerprint.

If your system is very flexible about overcommitting memory this will probably not surface as a problem for you. But typically, HPCs will set hard limits and the OOM killer will kick in.

So, to really find the hard limit on your settings you will need to add more samples to your tests and look for the output logs of the scheduler. When FMRIPREP attempts to get pass your 64GB limit you will probably have some warnings. At some number of samples, the kernel will kill your job. However, given that you won’t go over 32 cores and seeing that the runtime is the same for 1 or 4 subjects at 32, maybe 32 is the maximum number of parallelizable tasks regardless the number of samples. If that’s correct, then the memory issues should not appear for you.

Possibly the number of threads each task can use would also have an influence here. But before I run more tests I’d like to ask if you have found some rule of thumb for running efficient instances. Something like: 3 samples at 16 cores and 32G of memory take an hour and have a consistently high CPU and RAM usage.

I generally run 1 subject per node. Since we started to face memory issues I haven’t tried more, but I’m likely under-using the HPC resource here. On our case, the memory allocation blows up after 10 parallel jobs (--n_procs 10).

By augmenting the number of threads per task (--omp_nthreads) you may speed up the process a lot (and also hit memory issues). But there are a couple of bottlenecks that will scale very well with the number of threads per task. So you are in the right path and your intuitions are impecable.

I’m very looking forward to hearing where you got with this :slight_smile:

Cheers,
Oscar


#5

Hey Oscar,

Thanks for the info. I had another look with the nthreads option. I kept the memory allocation fixed at 64G. There seems to be a bit of an effect on processing speed for 4 vs 8 threads. But overall I think 1h is the lowest this thing goes in my case:


Not sure how stable the numbers for nthreads > 8 are. They may well look the same if I kept running this again. But if not, something strange does indeed seem to happen around nthreads >= 10. For the 4 sample run, the runtime seems to converge towards 1.5 hours as opposed to 1h for the smaller samples. What is the default if nthreds is not set?

I had a look at the reported resource footprint. It seems that VMem scales more with the number of cores than the threads:


In fact, if you plot the the memory footprint over the number of parallel samples, it becomes clearer:

Not sure why that is. But it seems pretty stable. This is not (as much?) the case for physical memory footprint:

So I think what this tells me is that I am probably best off running one instance per subject and requesting 8 cores with 8 threads max and 10+safety G of memory. My compute nodes have 32 cores and 128G, so I can run 4 instances on one. While that’s probably a bit slower than running 4 subjects in parallel in one instance, I will get scheduled faster and I believe that the resource bill is also lower.

I uploaded the Benchmark data if you want to play around with it. Let me know if you want me to dump the full scheduler output for you somewhere.

Best,
Seb


Fmriprep v1.0.12 hanging
#6

This is awesome. We would really love that you submitted a pull request to our documentation where you add all this benchmarking. I think this is very useful and should be added to FMRIPREP.

Let me know if you would like to contribute it, I can help setting everything up.


#7

Hey Oscar,

sure, happy to - I’ll check it out over the weekend. How would you like me to proceed?

Best,
Seb