Fmriprep HPC crashes on recon

dominikkraft · August 21, 2020, 12:24pm

Hi all,

I´m having some trouble running fmriprep-20.1.1. on an HPC cluster using singularity(3.5.2) as it crashes after ~25 min on autorecon. I´m currently processing only 1 subject for testing purposes:

Crash Message:

Node:   fmriprep_wf.single_subject_aam063_wf.anat_preproc_wf.surface_recon_wf.autorecon1
Working directory: /working_dir/fmriprep_wf/single_subject_aam063_wf/anat_preproc_wf/surface_recon_wf/autorecon1

Node inputs:

FLAIR_file = <undefined>
T1_files = <undefined>
T2_file = <undefined>
args = <undefined>
big_ventricles = <undefined>
brainstem = <undefined>
directive = autorecon1
environ = {}
expert = <undefined>
flags = <undefined>
hemi = <undefined>
hippocampal_subfields_T1 = <undefined>
hippocampal_subfields_T2 = <undefined>
hires = <undefined>
mprage = <undefined>
mri_aparc2aseg = <undefined>
mri_ca_label = <undefined>
mri_ca_normalize = <undefined>
mri_ca_register = <undefined>
mri_edit_wm_with_aseg = <undefined>
mri_em_register = <undefined>
mri_fill = <undefined>
mri_mask = <undefined>
mri_normalize = <undefined>
mri_pretess = <undefined>
mri_remove_neck = <undefined>
mri_segment = <undefined>
mri_segstats = <undefined>
mri_tessellate = <undefined>
mri_watershed = <undefined>
mris_anatomical_stats = <undefined>
mris_ca_label = <undefined>
mris_fix_topology = <undefined>
mris_inflate = <undefined>
mris_make_surfaces = <undefined>
mris_register = <undefined>
mris_smooth = <undefined>
mris_sphere = <undefined>
mris_surf2vol = <undefined>
mrisp_paint = <undefined>
openmp = 8
parallel = <undefined>
steps = <undefined>
subject_id = recon_all
subjects_dir = <undefined>
talairach = <undefined>
use_FLAIR = <undefined>
use_T2 = <undefined>
xopts = <undefined>

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 516, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 635, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 741, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 397, in run
    runtime = self._run_interface(runtime)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 792, in _run_interface
    self.raise_exception(runtime)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 723, in raise_exception
    ).format(**runtime.dictcopy())
RuntimeError: Command:
recon-all -autorecon1 -i /data/sub-aam063/anat/sub-aam063_T1w.nii -noskullstrip -openmp 8 -subjid sub-aam063 -sd /output/freesurfer
Standard output:

Standard error:
/home/fmriprep/fmriprep_wf/single_subject_aam063_wf/anat_preproc_wf/surface_recon_wf/autorecon1: No such file or directory.
Return code: 1

The additional slurm_out file using fmripreps -vv flag:
slurm-381709_0.txt (239.8 KB)

And my #SBATCH header I used in my .sh script:

#!/bin/bash
#
#SBATCH --array 0
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=80
#SBATCH --mem-per-cpu=1024 #1GB
#SBATCH --time=02:00:00
#SBATCH --no-requeue
#SBATCH --mail-type=ALL
# ------------------------------------------

I have no problems when running fmriprep on a different server; over there I usually use the mem and threads flags …

Any ideas what might cause the problems? maybe @oesteban or @effigies ?

Thank you very much in advance,

Best,
Dominik

effigies · August 21, 2020, 12:33pm

Had you previously tried to run it and moved the working directory? I’m not sure why your working directory seems to be /working_dir/fmriprep_wf/single_subject_aam063_wf/anat_preproc_wf/surface_recon_wf/autorecon1 and then you get:

/home/fmriprep/fmriprep_wf/single_subject_aam063_wf/anat_preproc_wf/surface_recon_wf/autorecon1: No such file or directory.

dominikkraft · August 21, 2020, 12:36pm

no this was pretty much the first shot after debugging all singularity stuff - tried it a second time after deleting any output and using find ${FREESURFER_HOST_CACHE}/$subject/ -name "*IsRunning*" -type f -delete, but did not move anything

effigies · August 21, 2020, 12:39pm

That’s a very strange behavior that I haven’t seen before. I would try deleting your working directory (or at least the surface_recon_wf portion of it) and the FreeSurfer subject, if it exists in your FreeSurfer subjects directory, and try again.

If you’re able to reproduce it, could you share your Singularity image? That might be the easiest way for one of us to reproduce the issue.

dominikkraft · August 21, 2020, 9:02pm

I think I pretty much ended up with the same error:

Standard error:
/home/fmriprep/fmriprep_wf/single_subject_aam063_wf/anat_preproc_wf/surface_recon_wf/autorecon1: No such file or directory.
Return code: 1

Slurm output:

200821-22:25:55,917 nipype.workflow INFO:
	 [Node] Running "autorecon1" ("smriprep.interfaces.freesurfer.ReconAll"), a CommandLine Interface with command:
recon-all -autorecon1 -i /data/sub-aam063/anat/sub-aam063_T1w.nii -noskullstrip -openmp 8 -subjid sub-aam063 -sd /output/freesurfer
200821-22:25:56,343 nipype.workflow WARNING:
	 Storing result file without outputs
200821-22:25:56,344 nipype.workflow WARNING:
	 [Node] Error on "fmriprep_wf.single_subject_aam063_wf.anat_preproc_wf.surface_recon_wf.autorecon1" (/working_dir/fmriprep_wf/single_subject_aam063_wf/anat_preproc_wf/surface_recon_wf/autorecon1)
200821-22:25:57,129 nipype.workflow ERROR:
	 Node autorecon1 failed to run on host node45-004.cm.cluster.

I´m currently removing the working dir at the end of my .sh script. I will attach my current .sh script below, maybe I´m missing something very obvious (which would be nice to quickly solve the issue ). My colleague is pretty much using the same script but without freesurfer and it is running smoothly on our HPC cluster … and I havent had any issues with version 20.1.1 when processing another subject on a different server (singularity, but not slurm/sbatch) …

#!/bin/bash
#
#SBATCH --array 0
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=80
#SBATCH --mem-per-cpu=1024
#SBATCH --time=02:00:00
#SBATCH --no-requeue
#SBATCH --mail-type=ALL
# ------------------------------------------

echo "Loading singularity..."
spack load singularity@3.5.2


for subject in sub-aam063
do


	# Remove IsRunning files from FreeSurfer
	find ${FREESURFER_HOST_CACHE}/$subject/ -name "*IsRunning*" -type f -delete

	# sleep 1-320 seconds
	#echo "sleeping a bit..."
	#sleep $[ ( $RANDOM / 100 )  + 100 ]s

	# Set up some more variables
        export SINGULARITY_TMPDIR=/scratch/brainimage/kraft/rest/fmriprep_temp/$subject
        export SINGULARITYENV_subject=$subject
        export SINGULARITYENV_TEMPLATEFLOW_HOME=/templateflow
        export SINGULARITY_CACHEDIR=/scratch/brainimage/kraft/rest/.cache
        export TEMP_DIR=/scratch/brainimage/kraft/rest/fmriprep_temp/$subject

        echo "creating temp directory in"
        echo $TEMP_DIR
        mkdir -p $TEMP_DIR

	# Setup done, run the command
	echo -e "\n"
	echo "Starting fmriprep."
	echo "subject: $subject"
	echo -e "\n"

	singularity run \
    	-B /scratch/brainimage/kraft/rest:/home/fmriprep \
    	-B /scratch/brainimage/kraft/rest/BIDS:/data:ro \
    	-B /scratch/brainimage/kraft/rest/templateflow:/templateflow \
    	-B /scratch/brainimage/kraft/rest/BIDS/derivatives:/output \
    	-B /scratch/brainimage/kraft/rest/license:/lic \
    	-B $TEMP_DIR:/working_dir \
    	--home /home/fmriprep --cleanenv \
        /scratch/brainimage/kraft/rest/fmriprep-20.1.1.simg \
    	/data /output participant \
    	--notrack \
	-vv \
	--resource-monitor \
    	--ignore slicetiming \
    	--output-spaces T1w MNI152NLin2009cAsym:res-native fsnative fsaverage \
    	--fs-license-file /lic/license.txt \
    	--participant-label $subject \
    	--work-dir /working_dir


	echo -e "\n"
	echo "Removing temp directory..."
	rm -r $TEMP_DIR

done

But happy to share my files, what do you need and whats the best way to send it to you?

Thanks,

Dominik

oesteban · August 22, 2020, 9:42am

A few thoughts:

This binding -B /scratch/brainimage/kraft/rest:/home/fmriprep could be overshadowing the temporal folder. Since you are resetting templateflow, you’d be fine doing something like -B /scratch/brainimage/kraft/rest/.fmriprep:/home/fmriprep/.cache/fmriprep
Does your HCP have a local scratch (some temporal ephemeral space in the compute node) that you can use for the work directory? (and avoid the mkdir/rm -r fiddling)
Not sure this is relevant, but reading the slurm log it seems that ${FREESURFER_HOST_CACHE} is unset - therefore that find -delete is pretty ineffective.
In debugging this, I would attempt to simplify the sbatch script as much as you can (i.e., get rid of the temporal folder handling).

dominikkraft · August 22, 2020, 2:13pm

thanks a lot @oesteban!

I changed the binding -B as you suggested and it seems to work now, i.e., no strange auto-recon behavior. I had to stop the process as it was about to exceed the max. time at this partition. But, I´m not 100% sure why this helps, do you mind elaborating a bit?

Our HPC comes with

the non-shared local storage (i.e. only accessible from the compute node it’s connected to, max. 1.4 TB, slow) under /local/$SLURM_JOB_ID on each compute node

which I guess is what you are refering to. I´m not sure whether the HPC people like when we use it as a temporary directory, especially as I get numbers like this when checking quota:

 used    |    hard  ||  used files   |  hard   
--------------|------||------------|------------||---------|---------
14.88 GiB| 5.00 TiB|| 118480| 800000

I also attach my slurm output, which seems to be ok:

slurm-382551_0.txt (258.8 KB)

But following on this and me being new to HPC, would it be better to use the --n-threads and --mem flags in my fmriprep code and if so, what would be a good choice (especially as I´m planning to scale it to more participants, i.e., before reaching the max file limit /scratch which seems to be a boundary I have to deal with)? Right now it seems that most ressources are not used: e.g., from slurm.out: Free memory (GB): 160.07/168.84, Free processors: 62/80. I also used the --resource-monitor but I havent yet find any output related to it.

Thanks again for helping me out !

hannesm · October 11, 2020, 4:09pm

Hi,

I am experiencing the same problem as posted here also while testing fmriprep on Singularity on a HPC system. I am using the latest version (20.2.0), and built the .simg using docker2singularity as suggested in the documentation.

I’ve used basically the same script to run which is posted on the fmriprep Singularity pages, so I’m not quite sure I understand why the solution suggested up here works. My understanding of the error is that recon-all somehow looks for the relevant input in my $HOME folder instead of work.

So; fmriprep is running perfectly with --fs-no-reconall FYI.
I don’t have any trouble with i/o errors in general and it seems that my $HOME directory is automatically bound by Singularity. So I don’t set the --home manually like described here.
I am creating a fresh work folder with every job so deleting that beforehand is obviously not relevant.

I first tried setting --home as I understood the solution would be here (basically $HOME/.fmriprep:/home/fmriprep/.cache/fmriprep), but then it just complained that .fmriprep does not exist (should it have been $HOME.cache/fmriprep)?
I also tried another solution, where I tried to set --home in singularity to /work, because I figured Freesurfer probably thinks the work directory should be under $HOME… Interestingly, when I do that, recon-all it tries to use a different input, which is in the original data folder (see below)… Adding to the confusion, I have no idea what the input really should be as I’ve never run recon-all before.

Running my original script (without binding anything on $HOME), the input to recon-all is “/work/fmriprep_wf/single_subject_6000_wf/anat_preproc_wf/ anat_template_wf/t1w_merge/sub-6103_ses-T2_T1w_template.nii.gz”. When I try to bind /work as the “home” directory, the input it tries to use is “/data/sub-6000/ses-T2/anat/sub-6000_ses-T2_T1w.nii.gz”… So the original T1. And in the end it then crashes on something else, which is mri_nu_correct.mni…

I am curious as to why this original error happening in the first place. Obviously the attempt to bind work as home is not optimal, so I am thinking providing the full script/error for what I mentioned above is not really relevant. It was just interesting that the input to recon-all then changed to data instead of work…

Here is my script where autorecon crashes:

BIDS_DIR="/cluster/projects/[project]/hanne/BIDS/test3"
DERIVS_DIR="$BIDS_DIR/derivatives"
LOCAL_FREESURFER_DIR="$DERIVS_DIR/freesurfer"

    # Prepare some writeable bind-mount points.
TEMPLATEFLOW_HOST_HOME=/cluster/projects/[project]/hanne/templateflow
FMRIPREP_HOST_CACHE=$HOME/.cache/fmriprep

    # Prepare scratchdir
WORKDIR=$USERWORK/$SLURM_JOB_ID
mkdir -p ${WORKDIR}

    # Prepare derivatives folder
mkdir -p ${LOCAL_FREESURFER_DIR}

    # Make sure FS_LICENSE is defined in the container
export SINGULARITYENV_FS_LICENSE=$HOME/FSlicense.txt

    # Designate a templateflow bind-mount point
export SINGULARITYENV_TEMPLATEFLOW_HOME="/templateflow"
export SINGULARITYENV_WORK="/work"

    # SINGULARITY BINDS
 SINGULARITY_CMD="singularity run --cleanenv \
 -B $BIDS_DIR:/data \
 -B $DERIVS_DIR:/out \
 -B ${TEMPLATEFLOW_HOST_HOME}:${SINGULARITYENV_TEMPLATEFLOW_HOME} \
 -B $WORKDIR:${SINGULARITYENV_WORK} \  
 -B ${LOCAL_FREESURFER_DIR}:/fsdir $HOME/fmriprep.simg"

    # Remove IsRunning files from FreeSurfer
 find ${LOCAL_FREESURFER_DIR}/sub-$1/ -name "*IsRunning*" -type f -delete

    # Compose the command line
cmd="${SINGULARITY_CMD} /data /out participant --participant_label $1 \
-w /work --verbose --output-spaces func --fs-subjects-dir /fsdir --notrack"

    # Setup done, run the command
echo Running task $1
echo Commandline: $cmd
eval $cmd
exitcode=$?

And here is the relevant output (can post the whole thing if you are interested):

RuntimeError: Command:
recon-all -autorecon1 -i /work/fmriprep_wf/single_subject_6000_wf/anat_preproc_wf/anat_template_wf/t1w_merge/sub-6000_ses-T2_T1w_template.nii.gz -noskullstrip -openmp 8 -subjid sub-6000 -sd /fsdir
Standard output:
Standard error:
/cluster/home/hannesm/fmriprep_wf/single_subject_6000_wf/anat_preproc_wf/surface_recon_wf/autorecon1: No such file or directory.
Return code: 1

hannesm · October 13, 2020, 7:54am

So following up on my previous post - a friend pointed me to the possible solution of trying --no-home with the Singularity command. I also then CD’d into the data directory before running the script, as I understood Singularity would bind whatever folder you executed the script in… And now it seems to work fine?

Oh, and the different inputs to recon-all - turns out it uses original T1 data (input = data/T1) when there is only a single session and the merged T1 template (work/merge) when there are several sessions. So that was a result of poor testing routines on my part, I guess.

@effigies @oesteban Can you shed some light on why this solution seems to work? I was not aware that the folder I executed the script in had an impact on the pipeline.
And can you confirm what I gathered about what the input to recon-all should be (single session vs. multiple sessions)?

effigies · October 13, 2020, 2:48pm

From a quick read-through, I believe you’re running into the cases that are addressed in https://fmriprep.org/en/stable/singularity.html#templateflow-and-singularity. Could you check through that, and see if you still have questions?

hannesm · October 13, 2020, 4:52pm

Hi,

Thanks for the suggestion. Well, I’ve read it again but I’m not quite sure that’s it.
So I spent quite a few days reading, testing and fixing the set-up so that TemplateFlow and the rest would run smoothly. This error is specific to recon-all, i.e., fmriprep runs fine without it. Also as I wrote, Singularity binds the $HOME folder just fine so I kind of thought this part of the documentation was in regards to cases where that doesn’t happen. Also, unfortunately, sentences like " A particularly problematic case arises when the home directory is mounted in the container, but the $HOME environment variable is not correspondingly updated" doesn’t really tell me much as I a complete n00b wrt Singularity

So again, the problem is that specifically for recon-all, suddenly it is looking for the input in the $HOME/user folder.
It’s also not clear to me, when in the documentation it is referred to binding $HOME to home/fmriprep (as was also done in the original post here). What is /home/fmriprep supposed to refer to? My situation is that the image, data and work directories are in completely different places.

claytonjschneider · July 20, 2022, 9:51pm

Running into this same error now @hannesm. Hopefully you fixed it, if you have an updated singularity script you can share, that might be very helpful. Thanks!

hannesm · August 29, 2022, 7:35am

Hi @claytonjschneider , sorry for replying so late!

I have an updated script, hopefully it will work for you. Just substitute your own paths / folders / singularity image. I’m sure I could’ve cleaned up my environment as there are a lot of bindings involved, but this is how it ended up working for me and I dare not change it Among other things I made an index of the database which I put in a random folder (first thing that is bound in the script), following the instructions here: FAQ - Frequently Asked Questions — fmriprep version documentation

Below is my script, let me know if you have any questions. Hope it helps!

#!/bin/bash
#SBATCH --account=ACCOUNTNUMBER
#SBATCH -J JOBNAME
#SBATCH --time=12:00:00
#SBATCH -n 1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
# Outputs ----------------------------------
#SBATCH --mail-user=USEREMAIL
#SBATCH --mail-type=ALL
#SBATCH -o /path/to/log/%x-%A_%a.out
#SBATCH -e /path/to/log/%x-%A_%a.err
# ------------------------------------------

#mount a directory (here called 'BIDS') which contains an image of the database to be able to skip BIDS validation
BIDS_DIR="/path/to/BIDS"
#mount data, derivatives and freesurfer folders
DATA_DIR="/path/to/DATA"
OUT_DIR="$DATA_DIR/derivatives/OUTDIR"
LOCAL_FREESURFER_DIR="$OUT_DIR/freesurfer"

#mount folder which contains FS licence; singularity image; templateflow
FMRIPREP_DIR="/path/to/FMRIPREP_DIR"

# Prepare some writeable bind-mount points.
TEMPLATEFLOW_HOST_HOME=$FMRIPREP_DIR/templateflow
FMRIPREP_HOST_CACHE=/path/to/userhome/.cache/fmriprep

# Parse the participants.tsv file and extract one subject ID from the line corresponding to this SLURM task.
subject=$( sed -n -E "$((${SLURM_ARRAY_TASK_ID}+1))s/sub-(\S*)\>.*/\1/gp" ${DATA_DIR}/participants.tsv )
echo $subject

# Prepare scratchdir
WORKDIR=$USERWORK/${subject}
mkdir -p ${WORKDIR}

# Prepare derivatives folder
mkdir -p ${OUT_DIR}/fmriprep

# Make sure FS_LICENSE is defined in the container.
export SINGULARITYENV_FS_LICENSE="$FMRIPREP_DIR/FSlicense.txt"

# Designate a templateflow bind-mount point
export SINGULARITYENV_TEMPLATEFLOW_HOME="/templateflow"
export SINGULARITYENV_WORK="/work"
export SINGULARITYENV_BIDS="/BIDS"
export SINGULARITYENV_FMRIPREP="/fmriprep"

# SINGULARITY BINDS
SINGULARITY_CMD="singularity run --no-home --cleanenv -B $BIDS_DIR:${SINGULARITYENV_BIDS} -B $FMRIPREP_DIR:${SINGULARITYENV_FMRIPREP} -B $DATA_DIR:/data -B $OUT_DIR:/out -B ${TEMPLATEFLOW_HOST_HOME}:${SINGULARITYENV_TEMPLATEFLOW_HOME} -B $WORKDIR:${SINGULARITYENV_WORK} -B ${LOCAL_FREESURFER_DIR}:/fsdir $FMRIPREP_DIR/SINGULARITYIMAGE.simg"

# Remove IsRunning files from FreeSurfer
find ${LOCAL_FREESURFER_DIR}/sub-$subject/ -name "*IsRunning*" -type f -delete

# Compose the command line
cmd="${SINGULARITY_CMD} /data /out participant --participant_label $subject -w /work --fs-subjects-dir /fsdir --notrack --output-spaces func MNI152NLin6Asym:res-2 --fs-license-file /fmriprep/FSlicense.txt --verbose --skip-bids-validation --task-id rest --bids-database-dir /BIDS/bids_db"

# Setup done, run the command
echo Running task $subject
echo Commandline: $cmd
eval $cmd
exitcode=$?

# Output results to a table
echo "SUBJECT   SLURM_ARRAY_ID   EXITCODE \sub-$subject   ${SLURM_ARRAY_TASK_ID}    $exitcode" \
      >> $BIDS_DIR/logs/${SLURM_JOB_NAME}.${SLURM_ARRAY_JOB_ID}.tsv
echo Finished task ${SLURM_ARRAY_TASK_ID} with exit code $exitcode
exit $exitcode