I’m wondering if there’s a way for nipype to restart stalled nodes? I’m using a preprocessing pipeline (code here) we wrote in-lab and every so often, one of the nodes (typically
merge_epis, which are MapNodes that call
fsl.Merge, respectively) will stall out. The MapNode spawns a huge number of sub-processes (in the 100s), each of which only takes a minute or two at max. But sometimes, the node will stall out and cause the entire pipeline to freeze. From the log, it looks like the pipeline is just waiting to hear back from the sub-process (it’s not that it ran out of memory or anything), but I haven’t checked whether the issue is with nipype not realizing the sub-process finished or with the sub-process itself. This doesn’t happen when the plugin is Linear, only when it’s either MultiProc or SLURM, and I believe it only happens when run on our HPC cluster, which uses the SLURM job scheduler (I say “I believe” because no one else has run into this issue and the others in the lab run the pipeline locally – but it’s too memory-intensive for my machine, so I’ve never done a local run myself to compare).
The way I typically handle this is kill the whole pipeline and restart it. This has worked every time. It doesn’t take as long as running from scratch, because it finds the cached outputs and continues from there, but it would obviously be more convenient if I could just restart the problem sub-process instead of the whole thing. Is there a way to do so? Something like, specify the max length of time to let a sub-process run for and, if it exceeds that, kill it and try it again?
Given that it only happens when parallelizing the job on the cluster and restarting it fixes it every time, I suspect the output is being created and nipype is just not realizing it. But I feel like a fix for this would be more complicated and the issue may be setup-specific, whereas the above fix might be more general.