fMRIprep hangs during recon-all

AustinBipolar · May 15, 2023, 2:46pm

Summary of what happened:

Hi,
I am trying to run fMRI prep on a HPC and everything seems to be working well but it keeps hanging on the freesurfer processes. It doesn’t exit, just hangs.

Command used (and if a helper script was used, a link to the helper script or the command generated):

Here is my setup…


#!/bin/bash
#
#-----------------------------------------------------------------------------
#
#SBATCH -J lastest		      						# Job name
#SBATCH -o 128_max_tempflow_fMRIprep_.%j 						    # Name of stdout output file (%j expands to jobId)
#SBATCH -p normal                           		# Queue name
#SBATCH -N 1                                		# Total number of nodes requested (68 cores/node)
#SBATCH -n 1                                		# Total number of mpi tasks requested
#SBATCH -t 48:00:00                           		# Run time (hh:mm:ss) - 30 minutes
#SBATCH -A IBN22006      	                  		# allocation to run under

# created by Jennifer May, 2023
# -----------------------------------------------------------------------------------------
SLURM_CPUS_PER_TASK=64  #Cores = 128 (64 cores / socket)
SLURM_MEM_PER_NODE=128000 #total is 128 GB of RAM per node
TEMPLATEFLOW_HOST_HOME=/home1/06953/jes6785/.cache/templateflow



# location and user inputs - make sure this is up to date for your computer
#----------------------------------------------------------------------------------
module load tacc-apptainer
# DIRECTORY LOCATIONS 
BIDS_DIR=/scratch/06953/jes6785/NECTARY_DATA/
OUTPUT_DIR=${BIDS_DIR}derivatives/fmriprep-v23.0.2/



find ${OUTPUT_DIR}sourcedata/freesurfer/sub-B043/ -name "*IsRunning*" -type f -delete
unset PYTHONPATH;
apptainer run -B /scratch/06953/jes6785/NECTARY_DATA/:/scratch/06953/jes6785/NECTARY_DATA/ \
-B  /scratch/06953/jes6785/working_dir/:/scratch/06953/jes6785/working_dir/  \
-B /scratch/06953/jes6785/NECTARY_DATA/derivatives/fmriprep-v23.0.2/code:/scratch/06953/jes6785/NECTARY_DATA/derivatives/fmriprep-v23.0.2/code \
-B /home1/06953/jes6785/.cache/templateflow:/opt/templateflow  --cleanenv \
/work/06953/jes6785/Containers/fmriprep_23.0.2.sif  \ 
/scratch/06953/jes6785/NECTARY_DATA/ \ /scratch/06953/jes6785/NECTARY_DATA/derivatives/fmriprep-v23.0.2/  \
participant --participant-label B043 -w /scratch/06953/jes6785/working_dir/  \ 
--fs-license-file /scratch/06953/jes6785/NECTARY_DATA/derivatives/fmriprep-v23.0.2/code/license_2.txt \
--skip_bids_validation -vvv --nprocs $SLURM_CPUS_PER_TASK \
--mem_mb $SLURM_MEM_PER_NODE  \ 
--bids-filter-file /scratch/06953/jes6785/NECTARY_DATA/derivatives/fmriprep-v23.0.2/code/ses-01_bf.json

Version:

23.0.2

Environment (Docker, Singularity, custom installation):

Singularity/Apptainer

Data formatted according to a validatable standard? Please provide the output of the validator:

Relevant log outputs (up to 20 lines):

This is the output I keep getting over and over again…

[Node] Up-to-date cache found for "fmriprep_23_0_wf.single_subject_B043_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_confounds_wf.tcc_metadata_fmt".
230515-09:23:04,909 nipype.workflow DEBUG:
         Checking hash "fmriprep_23_0_wf.single_subject_B043_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_confounds_wf.tcc_metadata_fmt" locally: cached=True, updated=True.
230515-09:23:04,909 nipype.workflow DEBUG:
         Skipping cached node fmriprep_23_0_wf.single_subject_B043_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_confounds_wf.tcc_metadata_fmt with ID 195.
230515-09:23:04,910 nipype.workflow INFO:
         [Job 195] Cached (fmriprep_23_0_wf.single_subject_B043_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_confounds_wf.tcc_metadata_fmt).
230515-09:23:06,803 nipype.workflow DEBUG:
         Progress: 376 jobs, 199/1/0 (done/running/ready), 1/176 (pending_tasks/waiting).
230515-09:23:06,804 nipype.workflow DEBUG:
         Tasks currently running: 1. Pending: 1.
230515-09:23:06,806 nipype.workflow INFO:
         [MultiProc] Running 1 tasks, and 0 jobs ready. Free memory (GB): 123.00/128.00, Free processors: 56/64.
                     Currently running:
                       * fmriprep_23_0_wf.single_subject_B043_wf.anat_preproc_wf.surface_recon_wf.autorecon_resume_wf.autorecon2_vol

Screenshots / relevant information:

How can I fix this, please help.

What I have tried…

I’ve tried messing around with the --nprocs $SLURM_CPUS_PER_TASK --mem_mb $SLURM_MEM_PER_NODE, but this doesn’t seem to resolve the problem. What am I missing here?

Steven · May 15, 2023, 2:57pm

Hi @AustinBipolar,

I have relabeled your post as Software Support, formatted your text as code (using the </> button in the text editor) and added in the prepopulated template. For future software-related questions, please post under this category and format code/log outputs accordingly.

Looks like you are only letting your jobs run for 20 minutes.

This is a bit redundant - since you are not renaming your drives in the container, you can simply use -B /scratch/06953/jes6785/NECTARY_DATA/ (and repeat for all other mounts).

For what it’s worth, what has seemed to work well for us is the following:

#SBATCH --mem=45GB
#SBATCH --cpus-per-task=16

paired with this fmriprep setting: --mem_mb 40000 --nprocs 16 --omp-nthreads 8

Best,
Steven

AustinBipolar · May 15, 2023, 4:20pm

I have 00:20:00 listed because the job hangs every time (for the full 48 hours and had trouble getting it to load and check changes). I have updated to reflect the typical time I normally run.

Also, thank you for the suggestions. I tried the suggested changes to #SBATCH --mem, --cpus-per-task, --mem_mb --nprocs and --omp-nthreads and still hanging. You will of course see that the free memory and free processors has changed though in the log output from the hang …

Skipping cached node fmriprep_23_0_wf.single_subject_B043_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_confounds_wf.tcc_metadata_fmt with ID 195.
230515-11:11:39,740 nipype.workflow INFO:
         [Job 195] Cached (fmriprep_23_0_wf.single_subject_B043_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_confounds_wf.tcc_metadata_fmt).
230515-11:11:41,622 nipype.workflow DEBUG:
         Progress: 376 jobs, 199/1/0 (done/running/ready), 1/176 (pending_tasks/waiting).
230515-11:11:41,623 nipype.workflow DEBUG:
         Tasks currently running: 1. Pending: 1.
230515-11:11:41,624 nipype.workflow INFO:
         [MultiProc] Running 1 tasks, and 0 jobs ready. Free memory (GB): 35.00/40.00, Free processors: 8/16.
                     Currently running:
                       * fmriprep_23_0_wf.single_subject_B043_wf.anat_preproc_wf.surface_recon_wf.autorecon_resume_wf.autorecon2_vol

Steven · May 15, 2023, 4:34pm

Is this error subject specific or true for everyone? What if you try running recon-all separately from fMRIPrep and use those precompted outputs? Just for testing purposes (do not recommend analyzing any outputs from this), you can also try skipping recon-all (--fs-no-reconall).

AustinBipolar · May 15, 2023, 5:10pm

It ran for 2 participants last week, but has broken ever since for every participant I try to run. Here was the exact fMRIprep command that I ran through apptainer/sing (RUN THAT WORKED).

/opt/conda/bin/fmriprep /data/ /data/derivatives/fmriprep-v23.0.2/ participant --participant-label A073 --fs-license-file /data/derivatives/fmriprep-v23.0.2/code/license_2.txt --skip_bids_validation -w /workd/ -vvv --mem-mb 50000 --n-cpus 8 --bids-filter-file /data/derivatives/fmriprep-v23.0.2/code/session01_bf.json

when I run the same script with --fs-no-reconall - it does seem to move and made it as far as…

230515-12:34:13,317 nipype.workflow INFO:

     [MultiProc] Running 2 tasks, and 0 jobs ready. Free memory (GB): 39.60/40.00, Free processors: 14/16.
                 Currently running:
                   * fmriprep_23_0_wf.single_subject_A087_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_std_trans_wf.bold_reference_wf.enhance_and_skullstrip_bold_wf.unifize
                   * fmriprep_23_0_wf.single_subject_A087_wf.func_preproc_ses_01_task_cyb_dir_AP_wf.bold_t1_trans_wf.bold_reference_wf.enhance_and_skullstrip_bold_wf.unifize.

AustinBipolar · May 15, 2023, 9:39pm

I am currently running recon all and will let you know how easy this is to pull into fMRI prep.

AustinBipolar · May 16, 2023, 3:37pm

recon all seems to work when I run it outside of fMRI prep. Would you recommend doing the preprocessing that way? @Steven

Steven · May 16, 2023, 3:48pm

Do whatever works for you! No idea why it’s not working in fMRIPrep, but if it works outside then that seems like a good alternative.