How much RAM/CPUs is reasonable to run pipelines like fmriprep?

kwiebels · December 7, 2017, 4:00am

Hi,

I am trying to get an idea of what systems people are using for running things like fmriprep. At the moment I am on a system that has 4 CPUs and 16GB of RAM. fMRIprep (1.0.0-rc12) is currently taking about 20 hours per subject without ICA-AROMA (MB2 dataset, 6 runs, 216 volumes per run). I need to set the low-mem flag, otherwise I get memory errors. Before 1.0.0-rc12 I couldn’t run fmriprep at all without it crashing.

Our IT department is in the process of setting up a VM for the lab and I was wondering if anyone could advise on what kinds of systems they are using, or what specs (RAM, CPUs) are reasonable in general, to be able to run fmriprep, mriqc etc without running into memory errors all the time.

Thanks so much,
Kristina

oesteban · December 7, 2017, 5:02am

Hi Kristina,

It makes all sense that it wouldn’t run at all before 1.0.0-rc12 with 16MB of RAM. We made great progress in that rc understanding how memory was being used. We are dealing with the issue and it is one of the lines in the roadmap with the highest priority.

With 4 CPUs, 16GB of RAM I would run fmriprep with the following options: --nthreads 2 --omp-nthreads 4 --mem-mb 16000.

To talk about performance, we first need to clarify that executing FreeSurfer’s recon-all will add 6-12 h. to your processing. In a machine like the one you describe (4CPUS) probably more. So that would account for most of your 20 hours, unless you were using the --no-freesurfer option.

That said, with FreeSurfer pre-run, I can tell you (based on my experience) that a dataset like the one you describe would take:

About 2h 30min in an 8CPU 32GB RAM desktop.
About 1h ~ 1h 30min in a 16 CPU, 64 GB RAM compute node of our cluster.

We just released FMRIPREP 1.0, and we are making a lot of progress about memory consumption. For example, before buying more memory, make sure that you can overcommit a lot of memory. This means that your /proc/sys/vm/overcommit_memory is set to 0 and/or that you have a large value in /proc/sys/vm/overcommit_ratio. If your system runs with cgroups, again make sure that you have large limits of virtual memory. Right now, the main bottleneck for FMRIPREP is the virtual memory. In terms of physical memory, I haven’t seen many datasets over 12 GB RSS, and those are datasets with a lot more and a lot lengthier runs compared to your dataset.

We also enabled a new --resource-monitor flag for FMRIPREP. You could use it to profile memory consumption in your settings. If you wanted to try that, we’ll be very happy to assist you and make good use of your feedback.

Thanks very much and happy fmriprep’ing,

Oscar

surchs · March 2, 2018, 2:02am

Hey,

I have the same question but for running multiple samples in parallel. I want to run a large dataset (N ~ 1000) efficiently on an HPC. If I run each container on a full node, I have 32 cores and 128G of memory. If I ask for either 1 or 4 samples to be run by a fmriprep instance with 64G of memory and varying numbers of CPUs, I get roughly the same processing times as @oesteban.

At 32 cores, 1 sample and 4 samples take the same amount of completion time but the 4 sample instance uses more memory (not sure these memory numbers make too much sense but that’s what the scheduler tells me).
So there seems to be some minimal runtime that I can’t get below by throwing more cores at it - possibly because a portion of the pipeline has sequential jobs. So in a 32 cores and 64G memory instance I have some room left to run additional samples in parallel.

Possibly the number of threads each task can use would also have an influence here. But before I run more tests I’d like to ask if you have found some rule of thumb for running efficient instances. Something like: 3 samples at 16 cores and 32G of memory take an hour and have a consistently high CPU and RAM usage. I would like to use that to estimate how many jobs I will request on the HPC and for how long.

Also, I’d be interested to use the --resource-monitor flag but don’t know what to do with it after I set it. Happy to try out if you have some starting points.

Best,
Seb

oesteban · March 2, 2018, 5:32pm

Hi Seb,

First of all, wow!. Pretty neat job of benchmarking here! I think you’ve taken this farther I’ve ever tried :D.

At 32 cores, 1 sample and 4 samples take the same amount of completion time but the 4 sample instance uses more memory (not sure these memory numbers make too much sense but that’s what the scheduler tells me).
So there seems to be some minimal runtime that I can’t get below by throwing more cores at it - possibly because a portion of the pipeline has sequential jobs.

Yes, there is a limit and the reason is what you mentioned: sequential jobs.

So in a 32 cores and 64G memory instance I have some room left to run additional samples in parallel.

This is partially true. FMRIPREP is still a bit inefficient when allocating memory. Even though we made a great job at keeping physical memory (typically referred to as RSS) low (we just need to look at your graph), FMRIPREP has some trouble with the virtual memory fingerprint.

If your system is very flexible about overcommitting memory this will probably not surface as a problem for you. But typically, HPCs will set hard limits and the OOM killer will kick in.

So, to really find the hard limit on your settings you will need to add more samples to your tests and look for the output logs of the scheduler. When FMRIPREP attempts to get pass your 64GB limit you will probably have some warnings. At some number of samples, the kernel will kill your job. However, given that you won’t go over 32 cores and seeing that the runtime is the same for 1 or 4 subjects at 32, maybe 32 is the maximum number of parallelizable tasks regardless the number of samples. If that’s correct, then the memory issues should not appear for you.

Possibly the number of threads each task can use would also have an influence here. But before I run more tests I’d like to ask if you have found some rule of thumb for running efficient instances. Something like: 3 samples at 16 cores and 32G of memory take an hour and have a consistently high CPU and RAM usage.

I generally run 1 subject per node. Since we started to face memory issues I haven’t tried more, but I’m likely under-using the HPC resource here. On our case, the memory allocation blows up after 10 parallel jobs (--n_procs 10).

By augmenting the number of threads per task (--omp_nthreads) you may speed up the process a lot (and also hit memory issues). But there are a couple of bottlenecks that will scale very well with the number of threads per task. So you are in the right path and your intuitions are impecable.

I’m very looking forward to hearing where you got with this

Cheers,
Oscar

surchs · March 7, 2018, 2:43am

Hey Oscar,

Thanks for the info. I had another look with the nthreads option. I kept the memory allocation fixed at 64G. There seems to be a bit of an effect on processing speed for 4 vs 8 threads. But overall I think 1h is the lowest this thing goes in my case:

Not sure how stable the numbers for nthreads > 8 are. They may well look the same if I kept running this again. But if not, something strange does indeed seem to happen around nthreads >= 10. For the 4 sample run, the runtime seems to converge towards 1.5 hours as opposed to 1h for the smaller samples. What is the default if nthreds is not set?

I had a look at the reported resource footprint. It seems that VMem scales more with the number of cores than the threads:

In fact, if you plot the the memory footprint over the number of parallel samples, it becomes clearer:

Not sure why that is. But it seems pretty stable. This is not (as much?) the case for physical memory footprint:

So I think what this tells me is that I am probably best off running one instance per subject and requesting 8 cores with 8 threads max and 10+safety G of memory. My compute nodes have 32 cores and 128G, so I can run 4 instances on one. While that’s probably a bit slower than running 4 subjects in parallel in one instance, I will get scheduled faster and I believe that the resource bill is also lower.

I uploaded the Benchmark data if you want to play around with it. Let me know if you want me to dump the full scheduler output for you somewhere.

Best,
Seb

oesteban · March 8, 2018, 6:00pm

This is awesome. We would really love that you submitted a pull request to our documentation where you add all this benchmarking. I think this is very useful and should be added to FMRIPREP.

Let me know if you would like to contribute it, I can help setting everything up.

surchs · March 9, 2018, 2:01pm

Hey Oscar,

sure, happy to - I’ll check it out over the weekend. How would you like me to proceed?

Best,
Seb

Carlos_Z · October 10, 2020, 12:49am

Hello,

I was interested to get an update on this topic with the newer versions of fmriprep. I was also curious as to whether there is a way to run the recon-all command in parallel, as shown in this link

https://andysbrainbook.readthedocs.io/en/latest/FreeSurfer/FS_ShortCourse/FS_04_ReconAllParallel.html

According to his freesurfer tutorial, Andy Jahn reommends running in parallel according to the number of logical cores (in his example, 8) and claims this reduces the time substantially. If the recon-all command is what takes up the most time in fmriprep, can an option be set to run in parallel, and do you think this would reduce the overall preprocessing time?

Alternatively, would running fmriprep without recon-all work, and if so, what would be the disavantages during group level analyses?

I will be using a 4 core, 8GB RAM machine (It’s all I have access to right now ), and I need to preprocess 110 subjects, but can’t afford to wait ~100 days to start second levels.

Another limitation is I have been assigned a windows machine. Would you recommend running fmriprep on a docker container within windows, or perhaps running it in an ubuntu environment through virtual box / wsl ?

Any advice would be greatly appreciated!!!

ahutton · October 10, 2020, 3:22pm

If the recon-all command is what takes up the most time in fmriprep, can an option be set to run in parallel, and do you think this would reduce the overall preprocessing time?

Your hardware is effectively limited to running 1 subject at a time anyway, so I don’t think you’ll see much gain in running 4 subjects at 1 core/2GB vs. using 4 cores/8GB per subject. I’m not sure it’ll even run on 2GB of memory.

Alternatively, would running fmriprep without recon-all work, and if so, what would be the disavantages during group level analyses?

Yes. The disadvantage is that you wouldn’t have the surface reconstruction. If your analysis doesn’t need/use it, there’s no disadvantage.

Another limitation is I have been assigned a windows machine. Would you recommend running fmriprep on a docker container within windows, or perhaps running it in an ubuntu environment through virtual box / wsl ?

Docker on Windows.

Carlos_Z · October 10, 2020, 4:07pm

Thanks so much for your helpful reply!!

If you don’t mind, I had a couple of follow up question to your last answer. I’m still learning how to use docker, but I noticed in the fMRIprep set up guide that some dependencies need to be pre-installed (fsl, afni, freesurfer, etc). Some of them don’t seem to run on windows, so would they have to be containerized also? Or how exactly would this work?

Also my neuroimaging experience is exclusively Linux-based, and added to the fact so many of the tools and documentation are more Linux/MacOS oriented (than windows at least), if at all possible I would much prefer to work on Linux. So is your advice against VM/wsl Linux based on significant performance decreases or other factors, and would it be a strong suggestion against, or do you think it’s workable?

Thanks again!

ahutton · October 10, 2020, 4:37pm

fmriprep has a Docker image with all software dependencies installed (link). If you build that image, almost everything is taken care of.

My advice was based on you being on a Windows machine already. You should be using the Docker/Singularity image of fmriprep regardless of OS, which means you would start on Windows, start an Linux VM, then run an Ubuntu-based Docker container. There’s no benefit to that middle part. (And if you’re on Windows, it already has an Ubuntu Bash terminal.)
In general, I recommend that you use whatever you have. VMs aren’t really necessary, and pipelines really ought to be containerized anyway. The only hiccup for you is getting Docker to work on Windows, and the rest is just shoving pipelines into Docker images.

Carlos_Z · October 10, 2020, 7:45pm

Ahh ok, it all makes sense now. Again, thanks so much!