HPC: how many resources to allocate, and running parallel processes

blgeerae · August 26, 2020, 4:41pm

Hi again,

Tractoflow has produced good results for me in ~70 datasets with single-shell data (b=750). Now I am aiming to process a subset of 41 subjects with multishell data. My datasets have 8 b0 volumes, 32 b=750 volumes, and 60 b=2000 directions.

I plan to run this on my local HPC server, but I’m running into memory issues when I reach the Eddy step. Here’s an example command and my process so far:

nextflow -c /home/blgeerae/Programs/tractoflow-2.0.0/singularity.conf run /home/blgeerae/Programs/tractoflow-2.0.0/main.nf --root /home/blgeerae/Data/tractoflow/Multishell --dti_shells "0 750" --fodf_shells "0 750 2000" -with-singularity /home/blgeerae/Programs/tractoflow-2.0.0/tractoflow_2.0.0_8b39aee_2019_04_26.img -resume

I have tried running tractoflow on the full data cohort (41 datasets, 100 volumes each), allocating 40 cpus and 185GB memory, the process had insufficient memory and stalled, I cancelled it after 6 days of processing. Here’s the command:
Next I thought I would run subjects in parallel, submitting 41 separate jobs. To facilitate this, all subject data was moved into individual folders and each script was pointed to a different --root directory. All scripts were executed from the same directory, I hoped that would mean that all results would be deposited in the same /results/ folder (but also all scripts use the same /work/ folder). For these I tried adding the --processes 8 or --processes 4 flag with 8 cpus and 32GB memory allocated, but no job has successfully completed yet. A couple different errors have popped up:
- Process requirement exceed available CPUs -- req:8; avail: 4, this appeared on a job with 8 cpus allocated and the --processes 4 flag, oddly
- ERROR ~ Unable to acquire lock on session with ID 48e8c351-9ab9-46b1-ae63-8anaanf61503 or a similar ID. This has appeared on many different jobs.
- Other jobs have continued to run until they reached the time limit, with none completing and none successfully completing the eddy step.

So, a couple questions:

Can anyone comment on how much memory I should allocate to successfully process multishell data with 92 directions and 8 b0s? Either one unified script or a separate script for each subject.
Is this ‘unable to acquire lock…’ error due to multiple tractoflow jobs running in the same directory, accessing the same /work/ and /results/ folders? I hoped this would be simpler than creating a separate folder for each job and consolidating results later.

I know I can skip the eddy step but if possible I’d prefer to keep it, to keep methods the same between my single shell and multishell processing.

Thank you for your time!

Steven · August 26, 2020, 7:30pm

You shouldn’t try to run one script for each subject. Tractoflow will crash as soon as it notices more than one instance on the same working directory, as you have already seen with the “acquire lock” error. I suppose it would be possible to put each subject’s data in a different folder on the same level as your results and work folders (such that each subject can have its own job, and all results will still go to same place) but you shouldn’t have to resort to that. This would cause problems considering that the mean FRF is calculated and used for tractography, and I don’t think the mean would be calculated right running each subject separately.

Have you tried running on multiple nodes using the code listed in the documentation?

blgeerae · August 26, 2020, 8:02pm

Thank you, that makes sense to me. Running on multiple nodes is a new concept to me, I have not tried it! Sounds like a good option moving forward, although the question of how many cpus and memory will be required for a job of this size remains. Unfortunately I’m not familiar with the available tools to estimate that.

Steven · August 26, 2020, 8:20pm

Are you running Tractoflow as a background job with slurm/sbatch? If so, I think there are specific commands to see detailed usage information. I don’t know them on the top of my head though.

I recommend keeping --processes as default, this maximizes the number of parallel jobs that are running, which should speed things up overall (unless there is not enough memory per job). You can try changing --processes_eddy , but be mindful with changing the number of threads. It is my understanding that they are set in order to be more reproducible.

blgeerae · August 26, 2020, 8:47pm

Yes, I’m using slurm and only marginally familiar with the tool, but I’m looking further into this and asking my local resources for some support.

Thank you for the suggestion on --processes, I’ve submitted a job requesting resources from two nodes, and --processes have been left at the default, so fingers crossed, there. Perhaps two nodes with 40 cpus and 185GB of memory each will be enough to complete this work!

blgeerae · August 28, 2020, 7:44pm

Making good progress learning to work with the available tools. Thanks for bearing with me. I’ve got processing to work in subsets of subjects (1 and 5 datasets) but I’m still running into eddy failures when running tractoflow on the whole (n=41) data set, which I understand to be the best approach now. Currently, my jobs are running on one node with 120GB memory available (short wait times on these nodes, has helped me troubleshoot more quickly).

When eddy fails, an exit status of (2) or (139) are often reported. Can anyone help to interpret these? So far the best explanation I’ve found is:

(2): incorrect usage, generally a missing argument or invalid option provided
(139): could be to access memory that the program does not have access to, or can also be caused by running out of memory

So I’m continuing to try running these jobs with as much memory as I can muster. Any other suggestions?

Thank you for your continued help!

Steven · August 29, 2020, 3:17pm

Did you make any edits to the main.nf file? Also, maybe try updating to the newest Tractoflow version; I have not used 2.0.0. If you do update, you do not need the -c argument to specify the singularity.conf file.

Guillaumeth · August 31, 2020, 12:33pm

Hi @blgeerae,

What is your voxel size for your diffusion. This parameter is very important to take into account. To give you an example one of our database has approximately 100 directions and 7 or 8 b0 and run easily (with eddy) on 40 cpus and 192G of RAM.

Best

Guillaume

blgeerae · August 31, 2020, 4:33pm

No edits made to the main.nf file, updating to the newest version of tractoflow sounds wise, I can give that a shot.

The voxel size for my diffusion data is 0.86 x 0.86 x 2.2mm, with a 256x256x62 matrix size. Is this comparable to the database you were referencing?

My test jobs have been running on 16 cores 120GB memory, and I’ve been seeing ~50% success rate on eddy processes. I’ll go ahead and submit a job on a larger node with 40 cores and more memory, and cross my fingers!

abore · January 7, 2021, 3:24pm

Hi @blgeerae ,

Were you able to make tractoflow work properly ?
It’s true your dataset is a little different from what we use to see (2x2x2 raw DWI).

If you manage to make it work can you reply by giving the parameters you used and finally mark this thread as solved ? It helps sort the different issues.

Thank you in advance
Arnaud

blgeerae · January 11, 2021, 5:18pm

Indeed, updating tractoflow from 2.0.0 to 2.1.1 helped my processing to run more smoothly. In the end I also bypassed the eddy processing step for my data as this proved to be the largest hangup for my data and, in my trials, did not change results for my datasets much.