QSIRecon working directory

Summary of what happened:

Hi! I’m running QSIRecon on a cluster where jobs in a certain partition are pre-empted every four hours. They get canceled and requeued. I had thought that QSIRecon would find the work directory and pick up from where the previous job had left off, but that doesn’t seem to be the case. Is QSIRecon supposed to recognize old work directories, or am I doing something wrong?

Command used (and if a helper script was used, a link to the helper script or the command generated):

apptainer run -e -B ${input_dir} -B ${fs_license_path} -B /gscratch/scrubbed/mphagen $qsirecon_container \
        $input_dir ${output_dir} participant \
        --fs-subjects-dir "${input_dir}/${subject_id}/T1w" \
        --participant-label $subject_id \
        --input-type hcpya \
        --atlases 4S156Parcels \
        --stop-on-first-crash \
        --resource-monitor \
        --n-cpus 16 \
        -w "/gscratch/scrubbed/mphagen/${subject_id}" \
        --recon-spec mrtrix_multishell_msmt_ACT-hsvs \
        --fs-license-file "${fs_license_path}/fs_license.txt" \
        -vvv

Version:

qsirecon_1.1.0

Environment (Docker, Singularity / Apptainer, custom installation):

Apptainer

Screenshots / relevant information:

Full sbatch
#!/bin/bash

#SBATCH --job-name=qsirecon

#SBATCH --mail-type=END

#SBATCH --mail-user=mphagen@uw.edu

#SBATCH --mem=30G

#SBATCH --account=stf

#SBATCH --partition=ckpt

#SBATCH --nodes=1

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=16

#SBATCH --time=28:00:00 # Max runtime in DD-HH:MM:SS format.

#SBATCH --output=qsirecon_logs/%j_%a.out # where STDOUT goes

#SBATCH --error=qsirecon_logs/%j_%a.error # where STDERR goes

#SBATCH --export=NONE

#SBATCH --array=45-55

eval "$(/gscratch/escience/mphagen/miniforge/bin/conda shell.bash hook)"

# Your programs to run.

#Print current script for debugging

cat $0

#Activate conda environment

conda activate datalad_env

#Define paths and variables

fs_license_path=/gscratch/escience/mphagen/connectivity-processing/code

qsirecon_container="/gscratch/escience/gkolpin/qsirecon_1.1.0.sif"

input_dir="/gscratch/escience/mphagen/connectivity-processing/data/human-connectome-project-openaccess/HCP1200"

output_dir=${input_dir}/derivatives/qsirecon

subject_file="/gscratch/escience/mphagen/connectivity-processing/code/test_subjects.txt"

subject_id=$( sed -n ${SLURM_ARRAY_TASK_ID}p $subject_file )

echo $subject_id

#Get our data

bash datalad_get.sh $input_dir $subject_id

#Run QSIPREP

apptainer run -e -B ${input_dir} -B ${fs_license_path} -B /gscratch/scrubbed/mphagen $qsirecon_container \

$input_dir ${output_dir} participant \

--fs-subjects-dir "${input_dir}/${subject_id}/T1w" \

--participant-label $subject_id \

--input-type hcpya \

--atlases 4S156Parcels \

--stop-on-first-crash \

--resource-monitor \

--n-cpus 16 \

-w "/gscratch/scrubbed/mphagen/${subject_id}" \

--recon-spec mrtrix_multishell_msmt_ACT-hsvs \

--fs-license-file "${fs_license_path}/fs_license.txt" \

The sbatch logs from the original jobs unfortunately have been getting overwritten by the re-queued jobs, so I only have the re-queued logs.

Log from requeued job
250807-08:19:08,48 cli INFO:
	 Telemetry system to collect crashes and errors is enabled - thanks for your feedback! Use option ``--notrack`` to opt out.
Subject(s) to run: ['300719']
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.input_node, ingress2qsirecon_single_subject_300719_wf.parse_layout_node): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.input_node, ingress2qsirecon_single_subject_300719_wf.parse_layout_node): new edge data: {'connect': [('subject_layout', 'subject_layout')]}
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_dwi): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_dwi): new edge data: {'connect': [('dwi', 'dwi_in_file'), ('bvals', 'bval_in_file'), ('bvecs', 'bvec_in_file'), ('bids_dwi', 'dwi_out_file'), ('bids_bvals', 'bval_out_file'), ('bids_bvecs', 'bvec_out_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): new edge data: {'connect': [('bval_out_file', 'bvals_file'), ('bvec_out_file', 'bvecs_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): new edge data: {'connect': [('bids_bmtxt', 'bmtxt_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bfile): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bfile): new edge data: {'connect': [('bval_out_file', 'bval_file'), ('bvec_out_file', 'bvec_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bfile): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bfile): new edge data: {'connect': [('bids_b', 'b_file_out')]}
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.template_dimensions): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.template_dimensions): new edge data: {'connect': [('t1w_brain', 't1w_list')]}
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_t1w): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_t1w): new edge data: {'connect': [('target_shape', 'target_shape'), ('target_zooms', 'target_zooms')]}
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_t1w): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_t1w): new edge data: {'connect': [('t1w_brain', 'in_file'), ('bids_t1w_brain', 'out_file')]}
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_mask): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_mask): new edge data: {'connect': [('brain_mask', 'in_file'), ('bids_brain_mask', 'out_file')]}
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_mask): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_mask): new edge data: {'connect': [('target_shape', 'target_shape'), ('target_zooms', 'target_zooms')]}
250807-08:19:14,28 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_dwiref): No edge data
250807-08:19:14,28 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_dwiref): new edge data: {'connect': [('bvals', 'bval_file'), ('bids_dwi', 'dwi_series'), ('bids_dwiref', 'b0_average')]}
250807-08:19:14,50 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_t1w, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): No edge data
250807-08:19:14,50 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_t1w, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): new edge data: {'connect': [('out_file', 'moving_image')]}
250807-08:19:14,50 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_mask, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): No edge data
250807-08:19:14,50 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.conform_mask, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): new edge data: {'connect': [('out_file', 'moving_mask')]}
250807-08:19:14,50 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization, ingress2qsirecon_single_subject_300719_wf.save_outputs_node): No edge data
250807-08:19:14,50 nipype.workflow DEBUG:
	 (ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization, ingress2qsirecon_single_subject_300719_wf.save_outputs_node): new edge data: {'connect': [('composite_transform', 'to_template_nonlinear_transform_in'), ('inverse_composite_transform', 'from_template_nonlinear_transform_in')]}
250807-08:19:14,74 nipype.workflow DEBUG:
	 Creating flat graph for workflow: ingress2qsirecon_wf
250807-08:19:14,76 nipype.workflow DEBUG:
	 expanding workflow: ingress2qsirecon_wf
250807-08:19:14,76 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_wf.ingress2qsirecon_single_subject_300719_wf
250807-08:19:14,76 nipype.workflow DEBUG:
	 expanding workflow: ingress2qsirecon_wf.ingress2qsirecon_single_subject_300719_wf
250807-08:19:14,76 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.input_node
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.parse_layout_node
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.conform_dwi
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.create_bmatrix
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.create_bfile
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.template_dimensions
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.conform_t1w
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.conform_mask
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.create_dwiref
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization
250807-08:19:14,77 nipype.workflow DEBUG:
	 processing node: ingress2qsirecon_single_subject_300719_wf.save_outputs_node
250807-08:19:14,77 nipype.workflow DEBUG:
	 finished expanding workflow: ingress2qsirecon_wf.ingress2qsirecon_single_subject_300719_wf
250807-08:19:14,77 nipype.workflow DEBUG:
	 finished expanding workflow: ingress2qsirecon_wf
250807-08:19:14,77 nipype.workflow INFO:
	 Workflow ingress2qsirecon_wf settings: ['check', 'execution', 'logging', 'monitoring']
250807-08:19:14,79 nipype.workflow DEBUG:
	 PE: expanding iterables
250807-08:19:14,79 nipype.workflow DEBUG:
	 [Node] parse_layout_node - setting input subject_layout = {'original_name': '300719', 'subject': '300719', 'session': None, 'path': PosixPath('/gscratch/escience/mphagen/connectivity-processing/data/human-connectome-project-openaccess/HCP1200/300719'), 'bids_base': PosixPath('/gscratch/scrubbed/mphagen/300719/bids/sub-300719'), 'MNI_template': 

Middle of log truncated because of character limits.

 [Node] Finished "ds_report_odfs", elapsed time 0.432812s.
250807-09:27:07,159 nipype.workflow DEBUG:
	 Needed files: /gscratch/escience/mphagen/connectivity-processing/data/human-connectome-project-openaccess/HCP1200/derivatives/qsirecon/derivatives/qsirecon-MRtrix3_act-HSVS/sub-300719/figures/sub-300719_space-T1w_desc-wmFOD_odfs.png;/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_0x2a20698d9cf3cc5d79d1f4dd257735c8_unfinished.json;/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_inputs.pklz;/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_node.pklz
250807-09:27:07,159 nipype.workflow DEBUG:
	 Needed dirs: /gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_report
250807-09:27:07,159 nipype.workflow DEBUG:
	 Removing files: 
250807-09:27:07,160 nipype.workflow DEBUG:
	 Saving results file: '/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/result_ds_report_odfs.pklz'
250807-09:27:07,161 nipype.workflow DEBUG:
	 [Node] Writing post-exec report to "/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_report/report.rst"
250807-09:27:07,162 nipype.workflow INFO:
	 [Job 48] Completed (qsirecon_1_1_wf.sub-300719_mrtrix_multishell_msmt_hsvs.sub_300719_space_T1w_desc_preproc_recon_wf.msmt_csd.ds_report_odfs).
250807-09:27:07,268 nipype.workflow DEBUG:
	 Progress: 58 jobs, 48/1/0 (done/running/ready), 1/9 (pending_tasks/waiting).
250807-09:27:07,268 nipype.workflow DEBUG:
	 Tasks currently running: 1. Pending: 1.
250807-09:27:07,269 nipype.workflow INFO:
	 [MultiProc] Running 1 tasks, and 0 jobs ready. Free memory (GB): 169.50/169.70, Free processors: 8/16.
                     Currently running:
                       * qsirecon_1_1_wf.sub-300719_mrtrix_multishell_msmt_hsvs.sub_300719_space_T1w_desc_preproc_recon_wf.track_ifod2.tractography


Hi @mckenziephagen,

It should recognize working directories, but not every process can be resumed from where they left off, including track_ifod2.tractography. ingress2qsirecon also does not get cached, but if you have in your temporary space a successful ingress2qsirecon directory, then you can just pass that in as your input directory with default “qsiprep” input type (instead of hcpya) since the file organization has already been converted. Unless, was there another process you are worrying is not getting cached properly?

Best,
Steven

Okay, after some troubleshooting, the tractography ended up being the bottleneck and wasn’t completing within four hours, even with setting the intermediary bids directory as the input for re-queued jobs. so my jobs just kept getting stuck in an endless loop. I tried adding omp-nthreads 14, and now they’re speeding through at 2.5 hours and avoiding the whole issue.

FWIW, I also tried --n-cpus=32 with --omp-nthreads 30 on two subjects as a test, and those also finished in 2.5 hours, so at least on my cluster with this data and this version of QSIRecon it seems like n-cpus=16 might be point of diminishing returns.