Nipype IOerror when running on SLURM

I created a small nipype pipeline for preprocessing, and it works when running with MultiProcess plugin.

However, whenever I try to submit the job to our SLURM cluster, I get IO errors at seemingly random nodes, with the following messages;

IOError: Job id (xxxxx) finished or terminated, but results file does not exist after (20.0) seconds. Batch dir contains crashdump file if node raised an exception.

The strange thing is, if I look at the node that supposedly crashed, I can find the output files, and there is no mention of errors or crash in the slurm out file in the batch folder. If I re-submit the workflow, it will randomly crash at some other nodes with the same message.

I’ve set the ‘job_finished_timeout’ to 20 sec, ‘poll_sleep_duration’ to 5sec in the workflow execution cofiguration, hoping that this would help it see the output of each node, but it does not seem to change the frequency of these crashes.

Can someone suggest me what I can try to debug this issue?

I am using nipype 0.13.0 with Python 2.7.

Thank you for your help!

@atsuchida there are a couple bugs currently with the SLURM plugin - I started up a PR awhile ago https://github.com/nipy/nipype/pull/1853 that hoped to address this, but temporarily resorted to submitting workflows with the MultiProc plugin through individual sbatch calls.

However, we are planning to fix this in a future release!

Thank you for your response! But could you elaborate a bit more about the temporary fix you suggested?

Right now, I have a script that generates my workflow and runs it with SLURM plugin (with sbatch args), so that simply running this script would submit the jobs to the SLURM clusters.

When you say “submitting workflows with the MultiProc plugin through individual sbatch calls”, you mean I change the plugin to MultiProc but execute the script with something like “sbatch run.sh” where run.sh contains a call to my python script?

yes - for example: I use two bash scripts to submit individual jobs for each subject in a dataset (using SLURM’s job arrays)

submission.sh

#!/bin/bash

base=/project/base

# first go to data directory, grab all subjects,
# and assign to an array
data=$base/data/
pushd $data
subjs=($(ls sub-* -d -1))
popd

# take the length of the array
# this will be useful for indexing later
len=$(expr ${#subjs[@]} - 1) # len - 1

echo Spawning ${#subjs[@]} sub-jobs.

sbatch --array=0-$len levelone.sh $base ${subjs[@]}

levelone.sh

#!/bin/bash

# run level one script for one subject
# this script takes in args
#   $1 - base directory
#   $2 - array of subjects

#SBATCH -J levelone
#SBATCH -t 22:00:00
#SBATCH --mem=35GB
#SBATCH --cpus-per-task=5

base=$1
args=($@)
subjs=(${args[@]:1}) # drop initial arg (base)

scratch=/path/to/workdir

# check if directory exists and if not, make it
if [[ ! -d ${scratch} ]]; then
  mkdir -p ${scratch}
fi


subject=${subjs[${SLURM_ARRAY_TASK_ID}]}

echo 'Submitted job for: '${subject}

python python_script_here.py \
-d ${base}/data -m 1 -s $subject \
-o ${base}/data/derivatives/levelone \
-w ${scratch} \
-p 'MultiProc' \

where -p is the plugin for nipype that will be used.

HTH

Thank you so much!

I’m not very fluent in bash scripts, so this example is very helpful…!

By the way, is this bug introduced in the newer version of nipype (0.13.0 and up)? In our lab we have a much more complicated pipeline developed in 0.12.0 that gets submitted to SLURM cluster without any problems. We are in the process of updating the pipeline with python3 and the latest nipype, but I believe it hasn’t be tested with SLURM since it has other issue related to spm standalone (which is why I was using 0.13.0).

Anyway, I hope these issues would be fixed in the future release!

@atsuchida yes, I believe this was introduced after 0.12.0 released

@atsuchida If you have time, could you try to run your pipeline with the latest master branch of nipype and confirm you are still encountering the problem?

pip install -U https://github.com/nipy/nipype/archive/master.zip

Hi again,

Sorry for not getting back to you. I was working on something else, and now am revisiting the issue. I was hoping that the newer version of nipype has resolved the issue, but it seems that I’m still encountering the similar problems. I am working on a different pipeline, but similarly small pipe performing normalization of DWI data using spm coregister and Normalize12.

I am now using nipype 1.0.2 (and also tried 1.0.1) with python 2.7.

The problem is exactly as before. I get seemingly random crashes, with the same message, generally, and I cannot find any error message in the corresponding slurm out file.

If there is anything I can do to track the problem, please let me know…!