Anyone have any experience running ANTs’ template construction inside a Singularity container? I’m running it on our HPC with ~20 mouse brains and it keeps terminating the job due to lack of memory. I’m running it using local parallelization and threw a ton of memory at it (1 job with 28 cores, 1.2TB of RAM) and it still failed even using just six 6 local jobs. It appears to get through the rigid registration before failing (~25 minutes)
Here’s the call I’m using inside the container:
${ANTSPATH}/antsMultivariateTemplateConstruction.sh \
-d 3 \
-o ${OUTDIR}/t_ \
-i 4 \
-g 0.2 \
-j 6 \
-c 2 \
-k 1 \
-w 1 \
-m 100x75x50x10 \
-r 1 \
-s CC \
-t GR \
${INPUTDIR}/*LPI.nii.gz
And here’s the error message:
--------------------------------------------------------------------------------------
Starting ANTS registration on max 6 cpucores. Iteration: 1 of 4
Progress can be viewed in job*_0_metriclog.txt
--------------------------------------------------------------------------------------
Using max 6 parallel threads
Running sh /derivatives/study_template/output_new/job0_r.sh
Running sh /derivatives/study_template/output_new/job10_r.sh
Running sh /derivatives/study_template/output_new/job11_r.sh
Running sh /derivatives/study_template/output_new/job12_r.sh
Running sh /derivatives/study_template/output_new/job13_r.sh
Running sh /derivatives/study_template/output_new/job14_r.sh
=>> PBS: job killed: mem 1322081096kb exceeded limit 1310720000kb
I was able to get it to run on 3 cores but it took ~ 5 days get through the first of 4 iterations, so that is far from ideal.
If anyone has thoughts, that’d be great.
The singularity container is the big issue, this pipeline can submit jobs to the cluster and parallelize the job, but not inside a container.
I have a modified version which uses qbatch (https://github.com/pipitone/qbatch) and I added a “container breakout” mode to qbatch which with some tooling may solve the issue.
For an example of how to “wrap” a qbatch-enabled singularity pipeline, see https://github.com/CoBrALab/MAGeTDocker/blob/master/mb-container
I’ll find a place to stick my qbatch-enabled template building version.
But a followup, what resolution is this data? I can’t conceive of running out of ram on 1.2TB, I run high resolution ANTs registrations on 128GB, with my typical runs on 32GB machines.
Hi @gdevenyi,
I need to look more closely at qbatch but for my quick glance through it, I wasn’t sure how it worked to submit the multiple jobs that ANTs template builder would end up creating, particularly since jobs get created after each iteration. If you do come up with the qbatch-enabled template building version and don’t mind sharing, I’d definitely be interested.
As far as the images go, they are 75 micron isotropic mouse brains, each weighing in at ~10mb. So I agree them mem’ing out, even running the local parallel execution method, seems odd.
I need to look more closely at qbatch but for my quick glance through it, I wasn’t sure how it worked to submit the multiple jobs that ANTs template builder would end up creating, particularly since jobs get created after each iteration.
Sorry, I promised my modified version:
If you specify the SLURM (-c 5) mode, it uses qbatch to submit jobs instead.
I think the best you can do here is figure how how to get ANTs to talk to the cluster.
Alternatively, I’d try to request the allocation of the whole machine (28 cores and all the ram, and run 28 parallel registrations)
75 micron isotropic is well within the memory limits of what I’ve seen for scans.
I don’t know what -t GR
is in your call, it doesn’t seem to be an option
Thanks for your thoughts @gdevenyi, I do appreciate it. I’ll keep monkeying with it and see if I can get it to work. Ironically, I did request the whole machine, all the ram, and 21 registrations (all the brains that were reasonable to register) and it still died out. I’m sure there’s a flag or option I’m neglecting to keep it from bloating the ram. I just haven’t found it yet.
Regarding the script file you provided, I’d still need to run that from outside the container, right?
And lastly, -t GR
… That should be SyN. I don’t know where the older version of the option snuck in from. Thanks for catching that.