Summary of what happened:
Hi! I’m running QSIRecon on a cluster where jobs in a certain partition are pre-empted every four hours. They get canceled and requeued. I had thought that QSIRecon would find the work directory and pick up from where the previous job had left off, but that doesn’t seem to be the case. Is QSIRecon supposed to recognize old work directories, or am I doing something wrong?
Command used (and if a helper script was used, a link to the helper script or the command generated):
apptainer run -e -B ${input_dir} -B ${fs_license_path} -B /gscratch/scrubbed/mphagen $qsirecon_container \
$input_dir ${output_dir} participant \
--fs-subjects-dir "${input_dir}/${subject_id}/T1w" \
--participant-label $subject_id \
--input-type hcpya \
--atlases 4S156Parcels \
--stop-on-first-crash \
--resource-monitor \
--n-cpus 16 \
-w "/gscratch/scrubbed/mphagen/${subject_id}" \
--recon-spec mrtrix_multishell_msmt_ACT-hsvs \
--fs-license-file "${fs_license_path}/fs_license.txt" \
-vvv
Version:
qsirecon_1.1.0
Environment (Docker, Singularity / Apptainer, custom installation):
Apptainer
Screenshots / relevant information:
Full sbatch
#!/bin/bash
#SBATCH --job-name=qsirecon
#SBATCH --mail-type=END
#SBATCH --mail-user=mphagen@uw.edu
#SBATCH --mem=30G
#SBATCH --account=stf
#SBATCH --partition=ckpt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --time=28:00:00 # Max runtime in DD-HH:MM:SS format.
#SBATCH --output=qsirecon_logs/%j_%a.out # where STDOUT goes
#SBATCH --error=qsirecon_logs/%j_%a.error # where STDERR goes
#SBATCH --export=NONE
#SBATCH --array=45-55
eval "$(/gscratch/escience/mphagen/miniforge/bin/conda shell.bash hook)"
# Your programs to run.
#Print current script for debugging
cat $0
#Activate conda environment
conda activate datalad_env
#Define paths and variables
fs_license_path=/gscratch/escience/mphagen/connectivity-processing/code
qsirecon_container="/gscratch/escience/gkolpin/qsirecon_1.1.0.sif"
input_dir="/gscratch/escience/mphagen/connectivity-processing/data/human-connectome-project-openaccess/HCP1200"
output_dir=${input_dir}/derivatives/qsirecon
subject_file="/gscratch/escience/mphagen/connectivity-processing/code/test_subjects.txt"
subject_id=$( sed -n ${SLURM_ARRAY_TASK_ID}p $subject_file )
echo $subject_id
#Get our data
bash datalad_get.sh $input_dir $subject_id
#Run QSIPREP
apptainer run -e -B ${input_dir} -B ${fs_license_path} -B /gscratch/scrubbed/mphagen $qsirecon_container \
$input_dir ${output_dir} participant \
--fs-subjects-dir "${input_dir}/${subject_id}/T1w" \
--participant-label $subject_id \
--input-type hcpya \
--atlases 4S156Parcels \
--stop-on-first-crash \
--resource-monitor \
--n-cpus 16 \
-w "/gscratch/scrubbed/mphagen/${subject_id}" \
--recon-spec mrtrix_multishell_msmt_ACT-hsvs \
--fs-license-file "${fs_license_path}/fs_license.txt" \
The sbatch logs from the original jobs unfortunately have been getting overwritten by the re-queued jobs, so I only have the re-queued logs.
Log from requeued job
250807-08:19:08,48 cli INFO:
Telemetry system to collect crashes and errors is enabled - thanks for your feedback! Use option ``--notrack`` to opt out.
Subject(s) to run: ['300719']
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.input_node, ingress2qsirecon_single_subject_300719_wf.parse_layout_node): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.input_node, ingress2qsirecon_single_subject_300719_wf.parse_layout_node): new edge data: {'connect': [('subject_layout', 'subject_layout')]}
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_dwi): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_dwi): new edge data: {'connect': [('dwi', 'dwi_in_file'), ('bvals', 'bval_in_file'), ('bvecs', 'bvec_in_file'), ('bids_dwi', 'dwi_out_file'), ('bids_bvals', 'bval_out_file'), ('bids_bvecs', 'bvec_out_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): new edge data: {'connect': [('bval_out_file', 'bvals_file'), ('bvec_out_file', 'bvecs_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bmatrix): new edge data: {'connect': [('bids_bmtxt', 'bmtxt_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bfile): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_dwi, ingress2qsirecon_single_subject_300719_wf.create_bfile): new edge data: {'connect': [('bval_out_file', 'bval_file'), ('bvec_out_file', 'bvec_file')]}
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bfile): No edge data
250807-08:19:14,26 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_bfile): new edge data: {'connect': [('bids_b', 'b_file_out')]}
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.template_dimensions): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.template_dimensions): new edge data: {'connect': [('t1w_brain', 't1w_list')]}
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_t1w): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_t1w): new edge data: {'connect': [('target_shape', 'target_shape'), ('target_zooms', 'target_zooms')]}
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_t1w): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_t1w): new edge data: {'connect': [('t1w_brain', 'in_file'), ('bids_t1w_brain', 'out_file')]}
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_mask): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.conform_mask): new edge data: {'connect': [('brain_mask', 'in_file'), ('bids_brain_mask', 'out_file')]}
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_mask): No edge data
250807-08:19:14,27 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.template_dimensions, ingress2qsirecon_single_subject_300719_wf.conform_mask): new edge data: {'connect': [('target_shape', 'target_shape'), ('target_zooms', 'target_zooms')]}
250807-08:19:14,28 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_dwiref): No edge data
250807-08:19:14,28 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.parse_layout_node, ingress2qsirecon_single_subject_300719_wf.create_dwiref): new edge data: {'connect': [('bvals', 'bval_file'), ('bids_dwi', 'dwi_series'), ('bids_dwiref', 'b0_average')]}
250807-08:19:14,50 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_t1w, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): No edge data
250807-08:19:14,50 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_t1w, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): new edge data: {'connect': [('out_file', 'moving_image')]}
250807-08:19:14,50 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_mask, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): No edge data
250807-08:19:14,50 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.conform_mask, ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization): new edge data: {'connect': [('out_file', 'moving_mask')]}
250807-08:19:14,50 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization, ingress2qsirecon_single_subject_300719_wf.save_outputs_node): No edge data
250807-08:19:14,50 nipype.workflow DEBUG:
(ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization, ingress2qsirecon_single_subject_300719_wf.save_outputs_node): new edge data: {'connect': [('composite_transform', 'to_template_nonlinear_transform_in'), ('inverse_composite_transform', 'from_template_nonlinear_transform_in')]}
250807-08:19:14,74 nipype.workflow DEBUG:
Creating flat graph for workflow: ingress2qsirecon_wf
250807-08:19:14,76 nipype.workflow DEBUG:
expanding workflow: ingress2qsirecon_wf
250807-08:19:14,76 nipype.workflow DEBUG:
processing node: ingress2qsirecon_wf.ingress2qsirecon_single_subject_300719_wf
250807-08:19:14,76 nipype.workflow DEBUG:
expanding workflow: ingress2qsirecon_wf.ingress2qsirecon_single_subject_300719_wf
250807-08:19:14,76 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.input_node
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.parse_layout_node
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.conform_dwi
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.create_bmatrix
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.create_bfile
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.template_dimensions
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.conform_t1w
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.conform_mask
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.create_dwiref
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.anat_nlin_normalization
250807-08:19:14,77 nipype.workflow DEBUG:
processing node: ingress2qsirecon_single_subject_300719_wf.save_outputs_node
250807-08:19:14,77 nipype.workflow DEBUG:
finished expanding workflow: ingress2qsirecon_wf.ingress2qsirecon_single_subject_300719_wf
250807-08:19:14,77 nipype.workflow DEBUG:
finished expanding workflow: ingress2qsirecon_wf
250807-08:19:14,77 nipype.workflow INFO:
Workflow ingress2qsirecon_wf settings: ['check', 'execution', 'logging', 'monitoring']
250807-08:19:14,79 nipype.workflow DEBUG:
PE: expanding iterables
250807-08:19:14,79 nipype.workflow DEBUG:
[Node] parse_layout_node - setting input subject_layout = {'original_name': '300719', 'subject': '300719', 'session': None, 'path': PosixPath('/gscratch/escience/mphagen/connectivity-processing/data/human-connectome-project-openaccess/HCP1200/300719'), 'bids_base': PosixPath('/gscratch/scrubbed/mphagen/300719/bids/sub-300719'), 'MNI_template':
Middle of log truncated because of character limits.
[Node] Finished "ds_report_odfs", elapsed time 0.432812s.
250807-09:27:07,159 nipype.workflow DEBUG:
Needed files: /gscratch/escience/mphagen/connectivity-processing/data/human-connectome-project-openaccess/HCP1200/derivatives/qsirecon/derivatives/qsirecon-MRtrix3_act-HSVS/sub-300719/figures/sub-300719_space-T1w_desc-wmFOD_odfs.png;/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_0x2a20698d9cf3cc5d79d1f4dd257735c8_unfinished.json;/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_inputs.pklz;/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_node.pklz
250807-09:27:07,159 nipype.workflow DEBUG:
Needed dirs: /gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_report
250807-09:27:07,159 nipype.workflow DEBUG:
Removing files:
250807-09:27:07,160 nipype.workflow DEBUG:
Saving results file: '/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/result_ds_report_odfs.pklz'
250807-09:27:07,161 nipype.workflow DEBUG:
[Node] Writing post-exec report to "/gscratch/scrubbed/mphagen/300719/qsirecon_1_1_wf/sub-300719_mrtrix_multishell_msmt_hsvs/sub_300719_space_T1w_desc_preproc_recon_wf/msmt_csd/ds_report_odfs/_report/report.rst"
250807-09:27:07,162 nipype.workflow INFO:
[Job 48] Completed (qsirecon_1_1_wf.sub-300719_mrtrix_multishell_msmt_hsvs.sub_300719_space_T1w_desc_preproc_recon_wf.msmt_csd.ds_report_odfs).
250807-09:27:07,268 nipype.workflow DEBUG:
Progress: 58 jobs, 48/1/0 (done/running/ready), 1/9 (pending_tasks/waiting).
250807-09:27:07,268 nipype.workflow DEBUG:
Tasks currently running: 1. Pending: 1.
250807-09:27:07,269 nipype.workflow INFO:
[MultiProc] Running 1 tasks, and 0 jobs ready. Free memory (GB): 169.50/169.70, Free processors: 8/16.
Currently running:
* qsirecon_1_1_wf.sub-300719_mrtrix_multishell_msmt_hsvs.sub_300719_space_T1w_desc_preproc_recon_wf.track_ifod2.tractography