#nipype workflow fails on different nodes each re-run without generating crashfile

Claar727 · May 24, 2023, 4:57pm

Hello there,

I have developed a nipype workflow to run a first and second level analysis on some task-based fMRI data. I tested the workflow on a set of 5 subjects from my different patient groups and it worked. Then I went to run it on all my participants and each time I do it fails in a different spot with a similar error about not being able to find a pickled results file. It fails on a different node and participant each time, but the error is the same:

Traceback (most recent call last):

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/plugins/tools.py", line 27, in report_crash

    result = node.result

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 224, in result

    op.join(self.output_dir(), "result_%s.pklz" % self.name)

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/engine/utils.py", line 291, in load_resultfile

    raise FileNotFoundError(results_file)

FileNotFoundError: /blue/vabfmc/data/working/robert727/cda2/data_processing/results/emotion_frontal_limbic_mask/l1_analyses/_subject_id_114/_fwhm_6/l1_contrast_estimate/result_l1_contrast_estimate.pklz

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "spm_l1_and_l2_wf.py", line 427, in <module>

    l1_analyses.run(plugin='SLURM', plugin_args=plugin_args)

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/engine/workflows.py", line 638, in run

    runner.run(execgraph, updatehash=updatehash, config=self.config)

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/plugins/base.py", line 166, in run

    self._clean_queue(jobid, graph, result=result)

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/plugins/base.py", line 242, in _clean_queue

    crashfile = self._report_crash(self.procs[jobid], result=result)

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/plugins/base.py", line 226, in _report_crash

    return report_crash(node, traceback=tb)

  File "/blue/vabfmc/data/working/robert727/virtual_environments/cda2_nipype_wf/lib64/python3.6/site-packages/nipype/pipeline/plugins/tools.py", line 33, in report_crash

    keepends=True

TypeError: must be str, not list

The issue above states it can’t find a results file associated with the contrast_estimate node although this node ran successfully for multiple subjects. In the past it failed on my node that gathers scan data and this node ran successfully for many other participants in that attempt. In the most recent run of the workflow there was also this message at the very end of the error and log file:

slurmstepd: error: Unable to unlink domain socket `/tmp/slurmd/c0706a-s11_64650483.4294967291`: No such file or directory

I have uploaded some log files and my scripts as text files. I am using:

a HPC using SLURM
Linux
nipype version 1.7.1
SPM version 12.7771
FSL version 6.0.6
I use a sbatch script to open a python virtual environment and run the workflow python script within that environment.

Any help with this would be greatly appreciated. I feel like I’ve been running into the same wall repeatedly for weeks. Please let me know any other information that could be needed and if I am allowed I can upload my scripts and the log files (some are around 5000 lines).
spm_l1_and_l2_wf_python_script.txt (27.5 KB)
spm_l1_and_l2_wf_sbatch_script.txt (1.3 KB)
spm_l1_and_l2_wf_yaml_config_file.txt (550 Bytes)
spm_l1_l2_wf_64583446.txt (188.1 KB)
spm_l1_l2_wf_64650483.txt (289.4 KB)

Claar727 · May 30, 2023, 6:02pm

An update on this for whenever someone gets to it. I have found that if I continue to re-run the workflow, the nodes that failed before run successfully and eventually all nodes will run successfully or the results will be read from the cache of the last successful run. This is obviously not a solution but figured it might be helpful. Also in the script I posted above, the level2 analysis portion is commented out as I have just been running the level1 analysis portion to make sure it works before proceeding to level2 analyses.

Claar727 · June 5, 2023, 2:52pm

I feel very silly but I re-ran my workflow using nipype version 1.8.6 and the error I kept getting is no longer present.