MRIQC freezes on Compute Canada

Liza_Levitis · October 18, 2018, 5:47pm

Hi,

I’m running MRIQC on Compute Canada and have managed to get further along than I previously did (220/336 subjects) by setting the --cpus-per-task=16 and --mem-per-cpu=4G. But it once again seems to have frozen, and there’s no error message. The last statement was: 2018-10-18 16:39:10,378 nipype.workflow:INFO [MultiProc] Running 16 tasks, and 1832 jobs ready. Free memory (GB): 28.07/30.00, Free processors: 0/16 followed by the list of what it’s currently running. Does anyone know whether it may restart once processors are freed up or if I should change the mem & cpu settings? It’s been about an hour since that last statement, and I have 17 hours left for this sbatch request.

oesteban · October 18, 2018, 5:50pm

Hi @Liza_Levitis thanks for the feedback.

Can you provide the exact command line you are submitting to the queue? You may benefit from splitting the work in subjects and collating tasks as a task-array. Is compute canada a slurm system?

Liza_Levitis · October 18, 2018, 5:51pm

Hi @oesteban, thanks for the super quick reply!

Yes, CC is a slurm system.

This is what I had submitted:

`#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --account=rpp-aevans-ab
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=4G

module load singularity
singularity run -B /project ~/projects/rpp-aevans-ab/llevitis/sing_images/mriqc-0.14.2.simg ~/projects/rpp-aevans-ab/llevitis/DIAN/Nifti/ ~/projects/rpp-aevans-ab/llevitis/DIAN/mriqc_results/ participant --n_procs 16 --ants-nthreads 8 -f --mem_gb 30 -vv -w ~/projects/rpp-aevans-ab/llevitis/DIAN/mriqc_results/`

oesteban · October 18, 2018, 6:01pm

Great.

I would definitely recommend splitting the job in a jobarray (if CC supports it). A starter is found here https://gist.github.com/oesteban/5947d28caf6c3750e0a2f2aa09102702#file-mriqc-subjects-sh. You’ll need to create a text file with all your subjects (one participant ID per line) and correct the name in the script. Then you also need to set the appropriate values (number of subjects/lines in the text file) in the script header (e.g. for 265 subjects -> #SBATCH --array=1-265). The downside of this strategy is that you’ll need to manually run mriqc data/ out/ group to generate the group level reports.

One other thing you can do to make sure MRIQC is not stalled is increasing the verbosity of the output with -vv or -vvv. The more vs, the more verbose the output will be.

ChrisGorgolewski · October 18, 2018, 6:04pm

You can tell MRIQC to analyze a particular participant using --participant_label with different value for each subjob.

emdupre · October 18, 2018, 6:06pm

I’ve submitted similar jobs @Liza_Levitis and had it continue even after “stalling” for a couple of hours. The safer strategy would definitely be to re-submit with more verbose reporting and individual subject jobs, but I suspect it may still finish as-is !

Elizabeth

Liza_Levitis · October 18, 2018, 6:11pm

thank you all for the advice! I’ll wait it out and see if the job starts back up - if not, I’ll split the job into a job array as @oesteban suggested.

oesteban · October 18, 2018, 6:35pm

You can always ssh into one of the nodes you requested and top to check whether MRIQC seems to be doing anything relevant. Since you are using singularity, your processes will be exposed directly (meaning: you’ll see antsRegistration, flirt, 3dSkullStrip, etc. operating directly, not just a python process).

Liza_Levitis · October 18, 2018, 8:36pm

@oesteban - quick question, if you’re dividing the job into a job array, would it be sufficient to indicate ~3 hours instead of 24 hours? my understanding is that it would be 3 hours per slurm array task.

oesteban · October 18, 2018, 8:56pm

That really depends on how many images you have per subject.

For T1w images only, each subject will take around 10 min per T1w.

For BOLD runs, I think it was more like 20min per run.

You can use the -m T1w and -m bold to split the load further. You can also split by sessions, tasks and/or runs.

Cheers

surchs · October 19, 2018, 3:14pm

Just a general thing regarding compute canada: they have recurring file system problems (I know firsthand of cedar, graham seems similar) which can make things slow down to a crawl in all kinds of ways. So there is a good chance this could be related to the hpc acting up for a while after which it gets better. I found that re-submitting helps in these cases.

Job-array also helps because if your jobs are all submitted at the same time and then stall, they just hit the wall time and quit. With an array you can set a maximum number of jobs to be run at the same time. So even if some jobs stall and die, the remaining ones will still be able to run if the problem is short lived. Also helps with getting scheduled faster.

Liza_Levitis · October 19, 2018, 4:43pm

@surchs yep, lesson learned to submit big jobs as a job array on CC. I did that yesterday, and MRIQC ran without any problems!

@oesteban - I got this message for some scans: Detected a zero-filled frame, has the original image been rotated?…but these scans don’t look rotated, and there are no blank slices. I saw that this had been reported as a bug before (https://github.com/poldracklab/mriqc/issues/637). Was this bug ever addressed?

oesteban · October 19, 2018, 5:02pm

No, I haven’t found time to address that issue and I agree it is pretty annoying.

Happy to take contributions to fix it!

Liza_Levitis · October 19, 2018, 5:32pm

Cool, will look into it!