I created a small nipype pipeline for preprocessing, and it works when running with MultiProcess plugin.
However, whenever I try to submit the job to our SLURM cluster, I get IO errors at seemingly random nodes, with the following messages;
IOError: Job id (xxxxx) finished or terminated, but results file does not exist after (20.0) seconds. Batch dir contains crashdump file if node raised an exception.
The strange thing is, if I look at the node that supposedly crashed, I can find the output files, and there is no mention of errors or crash in the slurm out file in the batch folder. If I re-submit the workflow, it will randomly crash at some other nodes with the same message.
I’ve set the ‘job_finished_timeout’ to 20 sec, ‘poll_sleep_duration’ to 5sec in the workflow execution cofiguration, hoping that this would help it see the output of each node, but it does not seem to change the frequency of these crashes.
Can someone suggest me what I can try to debug this issue?
I am using nipype 0.13.0 with Python 2.7.
Thank you for your help!