QSIPrep not finishing processing a participant under 24 hours

jhlegarreta · January 10, 2025, 2:39pm

Summary of what happened:

Hi,
I am back to running QSIPrep on on an HPC system on some in-house data that contains T1w and AP/PA fieldmaps and dMRI data.

The running time for a single subject seems to be hard to predict even if my allocation request is staying the same across my job submission for the set of BIDS studies in my dataset.

I had a job running for 12 hours on Wednesday for a study having a single participant and the job did not complete; all anat processing seem to have completed:

sub-ID_desc-aseg_dseg.nii.gz
sub-ID_desc-brain_mask.nii.gz
sub-ID_desc-preproc_T1w.nii.gz
sub-ID_dseg.nii.gz
sub-ID_from-MNI152NLin2009cAsym_to-T1w_mode-image_xfm.h5
sub-ID_from-orig_to-T1w_mode-image_xfm.txt
sub-ID_from-T1wACPC_to-T1wNative_mode-image_xfm.mat
sub-ID_from-T1wNative_to-T1wACPC_mode-image_xfm.mat
sub-ID_from-T1w_to-MNI152NLin2009cAsym_mode-image_xfm.h5

but when the allocated time finished the dwi folder contained only:

sub-ID_acq-dir99_confounds.tsv
sub-ID_acq-dir99_desc-SliceQC_dwi.json

I had a participant with the same pre-processing steps whose job got completed in slightly over 6 hours.

Yesterday I re-submitted the job but allocated runtime for 24 hours at 9:35 AM. After the job time allocation has finished today, I see the last log trace was from 9:36 AM yesterday with the message:

(...)

### References

250109-09:36:43,35 nipype.workflow INFO:
	 [Node] Setting-up "qsiprep_wf.single_subject_ID_wf.anat_preproc_wf.output_grid_wf.deoblique_autobox" in "/workdir/qsiprep_wf/single_subject_ID_wf/anat_preproc_wf/output_grid_wf/deoblique_autobox".

That is, it is not showing almost any entries for the processing that it has actually done. I am really confused by this.

I see in the output directory that by 9:37 AM yesterday it had already pre-processed the anat data and supposedly all data files had been written.

However, the dwi folder contains the same two files as on Wednesday; i.e. no real progress has been done during a reasonable number of hours (>12).

So:

It seems weird that for the same allocation (except for the time) the tool varies all that much across participant data whose scanning protocol was the same. Am I missing something?
It seems weird to me that across different runs the log files vary that much in their contents. Am I missing something?
I am wondering what I may be missing to have the pre-processing completed with this participant, as there is no trace of failure in the logs.

Thanks.

Command used (and if a helper script was used, a link to the helper script or the command generated):

I am running the following command in my SLURM script:

cmd="singularity run --cleanenv \
  --bind ${in_bids_dirname} \
  --bind ${out_dirname} \
  --bind ${fs_license_fname}:${mapped_fs_license_fname} \
  --bind ${work_dirname} \
  ${qsiprep_singularity_fname} \
  ${in_bids_dirname} \
  ${out_dirname} \
  participant \
  --participant-label ${participant} \
  --output-resolution ${output_resolution} \
  --fs-license-file ${mapped_fs_license_fname} \
  --work_dir ${work_dirname} \
  --skip_bids_validation"

Version:

0.19.0

Environment (Docker, Singularity / Apptainer, custom installation):

Singularity

Data formatted according to a validatable standard? Please provide the output of the validator:

PASTE VALIDATOR OUTPUT HERE

Relevant log outputs (up to 20 lines):

The log file contained the following last lines

(...)
250108-18:27:15,268 nipype.workflow INFO:
	 [Node] Finished "split_eddy_lps", elapsed time 113.47129s.
250108-18:27:16,206 nipype.workflow INFO:
	 [Node] Setting-up "qsiprep_wf.single_subject_ID_wf.dwi_finalize_acq_dir99_wf.transform_dwis_t1.get_interpolation" in "/workdir/qsiprep_wf/single_subject_ID_wf/dwi_finalize_acq_dir99_wf/transform_dwis_t1/get_interpolation".
250108-18:27:16,436 nipype.workflow INFO:
	 [Node] Executing "get_interpolation" <qsiprep.interfaces.images.ChooseInterpolator>
[1. 1. 1.] [1.3499999 1.3499999 1.3499999]
250108-18:27:16,548 nipype.interface WARNING:
	 Using BSpline interpolation for upsampling
250108-18:27:16,549 nipype.workflow INFO:
	 [Node] Finished "get_interpolation", elapsed time 0.109116s.

Screenshots / relevant information:

Steven · January 10, 2025, 2:56pm

Hi @jhlegarreta,

In the future, it is helpful to use the software support post template. You can see I edited it back in for you this time.

It is not clear what resources you are devoting to the job, supplying that would help.

Also that version is old, we recommend updating.

Best,
Steven

jhlegarreta · January 10, 2025, 3:28pm

In the future, it is helpful to use the software support post template. You can see I edited it back in for you this time.

Sorry for that. Did not realize templates existed. Thanks.

It is not clear what resources you are devoting to the job, supplying that would help.

Requesting one node with 32 GB.

#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --gpus=0
#SBATCH --mem=32G

Also that version is old, we recommend updating.

Will see if I manage to use a newer version.

Steven · January 10, 2025, 3:29pm

Hi @jhlegarreta,

Does seff $jobid reveal any memory leaks? Or did the job stay within the allotment?

Best,
Steven

jhlegarreta · January 10, 2025, 3:35pm

This is what I get from sacct (seff is not installed in the system):

$ sacct -j <my_job_id> --format=JobID,JobName,State,Elapsed,CPUTime,AllocCPUS,MaxRSS,MaxVMSize,ReqMem
JobID           JobName      State    Elapsed    CPUTime  AllocCPUS     MaxRSS  MaxVMSize     ReqMem 
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 
7422244_0    qsiprep_p+    TIMEOUT 1-00:00:22 8-00:02:56          8                              32G 
7422244_0.b+      batch  CANCELLED 1-00:00:23 8-00:03:04          8   3418676K   4561824K

I launched an array job, in this case having only one participant.

Steven · January 10, 2025, 3:36pm

Thanks. Okay, lets revisit after seeing if the error persists on the most recent version (1.0.0rc1).

Best,
Steven

jhlegarreta · January 10, 2025, 3:56pm

The latest version seems to be 0.24.0:
https://hub.docker.com/r/pennbbl/qsiprep/tags

Am I missing something?

Steven · January 10, 2025, 3:56pm

The latest versions going forward are on the pennlinc dockerhub repo.

jhlegarreta · January 11, 2025, 11:22pm

Using 1.0.0rc1 the process stalled again.

The last few lines of the log output read:

250110-23:17:35,606 nipype.workflow INFO:
	 [Node] Finished "raw_src_qc", elapsed time 43.038833s.
250110-23:22:43,595 nipype.interface INFO:
	 Generating denoising visual report

This time I allocated resources for 12 hours; sacct shows the following:

 JobID           JobName      State    Elapsed    CPUTime  AllocCPUS     AveCPU     AveRSS     MaxRSS  MaxVMSize     ReqMem TRESUsageInAve 
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- -------------- 
7427540_0    qsiprep_p+    TIMEOUT   12:00:11 4-00:01:28          8                                                    32G                
7427540_0.b+      batch OUT_OF_ME+   12:00:13 4-00:01:44          8 1-06:09:11  26716128K  26716128K  33798976K            cpu=1-06:09:1+

Not sure why some items above exceed the 12 hours, but the job started at 17:54, so from the last log entry at 23:22 until 05:54, not sure what the pipeline has been doing for 6 hours.

The last DWI file was written at 23:14, and I see the following files:

sub-ID_from-ACPC_to-anat_mode-image_xfm.mat
sub-ID_from-ACPC_to-MNI152NLin2009cAsym_mode-image_xfm.h5
sub-ID_from-anat_to-ACPC_mode-image_xfm.mat
sub-ID_from-MNI152NLin2009cAsym_to-ACPC_mode-image_xfm.h5
sub-ID_from-orig_to-anat_mode-image_xfm.txt
sub-ID_space-ACPC_desc-aseg_dseg.nii.gz
sub-ID_space-ACPC_desc-brain_mask.nii.gz
sub-ID_space-ACPC_desc-preproc_T1w.json
sub-ID_space-ACPC_desc-preproc_T1w.nii.gz
sub-ID_space-ACPC_dseg.nii.gz

Any suggestion how to move forward with this?

Steven · January 12, 2025, 2:08pm

Is it still the exact same subjects stalling on the new version?

jhlegarreta · January 12, 2025, 3:22pm

Is it still the exact same subjects stalling on the new version?

Yes.

Steven · January 12, 2025, 3:23pm

Any chance you can share the raw data from a subject?

jhlegarreta · January 12, 2025, 3:40pm

Any chance you can share the raw data from a subject?

Unfortunately not Steven.

Steven · January 12, 2025, 4:08pm

Then besides trying to increase memory allotment, I’m not sure what else I can suggest, sorry.

mattcieslak · January 12, 2025, 5:29pm

Hi @jhlegarreta, I think this is one of the cases where there is a subprocess that crashes and nipype just continues to hang indefinitely. Is there anything in your log directory that shows there was a crash? It might not even show up in standard error in the terminal

jhlegarreta · January 12, 2025, 6:19pm

Thanks for chiming in @mattcieslak.

Had not checked the err file TBH. Attaching it. Quickly skimming through the log file of the slurm job, there is no apparent trace of any crash or error; the err file of the slurm job does show some errors, though. There seems to be some memory-related message in there (oom), and there is other messages informing about issues, but I am not sure about the step causing these, or whether the only possible interpretation is an insufficient memory issue: 32 GB seems generous enough to me to pre-process a participant whose DWI does not seem to be excessively big (100 directions, 260 MB at 1.5 mm voxel size).

That being said, I have increased the memory to 48 GB and have re-run the job for this participant, and the job has completed in 23 minutes. For other participants, I had seen a similar behavior when having a job that had hit the time limit, and having re-run it with the same memory allocation (32 GB) but allowing a longer runtime (e.g. 8 or 12 hours vs 6). I was already puzzled by such tremendous and seemingly random runtime difference.

I am fine allocating 48 GB going forward, but I would like to understand what’s going on here, what should I look at (besides the err file :-S) or try when trying to find the source of stalling behavior, or whether I am missing something in all this.

error_7427540_0.txt (6.7 KB)

Thanks.

mattcieslak · January 12, 2025, 6:36pm

The memory blow-up happens usually because of synthseg or synthstrip. Nothing significant has changed in eddy or the other processing steps. We’ve seen synthseg use 60GB of memory in some cases. Hopefully there will be an efficiency improvement soon and we can include it

jhlegarreta · January 13, 2025, 3:10pm

I am not sure how synthseg/synthstrip interfere in this case, as I am not requesting GPU-aware resources, so unless they have a fallback option that does not require a GPU, that should not be an issue here.

I will try with other participants/datasets with the increased memory allocation.