Qsiprep: Node eddy failed to run on host sms

egor.levchenko · March 23, 2022, 4:45pm

Hi everyone!

I am trying to run qsiprep==0.15.1 on my cluster server, but qsiprep returns an error during the processing.

I run qsiprep with the next command:

module load singularity
singularity run --cleanenv ~/qsiprep-0.15.1.sif ~/Entropy/raw_data/patients/ ~/Entropy/derivatives_qsiprep/patients/ participant --participant-label OAS30001 -w ~/Entropy/work_qsiprep/patients/ --fs-license-file ~/freesurfer_license.txt --skip_bids_validation --output-resolution 1.2

And at some point it shows an error in console:

220323-19:12:17,180 nipype.workflow WARNING:
	 Storing result file without outputs
220323-19:12:17,182 nipype.workflow WARNING:
	 [Node] Error on "qsiprep_wf.single_subject_OAS30001_wf.dwi_preproc_ses_d3132_wf.hmc_sdc_wf.eddy" (/home/elevchenko/Entropy/work_qsiprep/controls/qsiprep_wf/single_subject_OAS30001_wf/dwi_preproc_ses_d3132_wf/hmc_sdc_wf/eddy)
220323-19:12:17,902 nipype.workflow ERROR:
	 Node eddy failed to run on host sms.
220323-19:12:17,908 nipype.workflow ERROR:
	 Saving crash info to /home/elevchenko/Entropy/derivatives_qsiprep/controls/qsiprep/sub-OAS30001/log/20220323-183543_7ee8194c-48a1-4cd1-8a21-a41a5560aede/crash-20220323-191217-elevchenko-eddy-eb687278-80d4-47f3-ae09-796d2fee515c.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 521, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 639, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 750, in _run_command
    raise NodeExecutionError(
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node eddy.

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 454, in aggregate_outputs
    setattr(outputs, key, val)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/traits_extension.py", line 330, in validate
    value = super(File, self).validate(objekt, name, value, return_pathlike=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/traits_extension.py", line 135, in validate
    self.error(objekt, name, str(value))
  File "/usr/local/miniconda/lib/python3.8/site-packages/traits/base_trait_handler.py", line 74, in error
    raise TraitError(
traits.trait_errors.TraitError: The 'out_parameter' trait of an ExtendedEddyOutputSpec instance must be a pathlike object or string representing an existing file, but a value of '/home/elevchenko/Entropy/work_qsiprep/controls/qsiprep_wf/single_subject_OAS30001_wf/dwi_preproc_ses_d3132_wf/hmc_sdc_wf/eddy/eddy_corrected.eddy_parameters' <class 'str'> was specified.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 401, in run
    outputs = self.aggregate_outputs(runtime)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 461, in aggregate_outputs
    raise FileNotFoundError(msg)
FileNotFoundError: No such file or directory '/home/elevchenko/Entropy/work_qsiprep/controls/qsiprep_wf/single_subject_OAS30001_wf/dwi_preproc_ses_d3132_wf/hmc_sdc_wf/eddy/eddy_corrected.eddy_parameters' for output 'out_parameter' of a ExtendedEddy interface

I have attached two txt log files.

crash-20220323-190443-elevchenko-eddy-420e4866-4b76-4384-9fe1-48269ca61aa9.txt (3.8 KB)
crash-20220323-191217-elevchenko-eddy-eb687278-80d4-47f3-ae09-796d2fee515c.txt (3.8 KB)

Any ideas are helpful!

mattcieslak · March 23, 2022, 5:35pm

Hi Egor,

The error message isn’t very specific here, but it indicates that the eddy run failed and didn’t produce outputs.

Your error logs show that only 8 threads were being used by eddy, which should be fine, but how many subjects were being processed simultaneously? You may need to use the --nthreads flag to specify the maximum number of cpus that can be used by qsiprep. You may also have run out of memory. 8-threaded eddy can use 32G+ of memory.

Steven · March 23, 2022, 6:31pm

A few additional questions:

Is this error subject specific?
Does this subject have BIDS valid data?
Do you have ample storage and read/write permissions in your ~ drive? Some clusters have strict memory limits on home drive storage.
Related to the previous post, what kind of memory/cpu resources are you devoting?

Thanks,
Steven

egor.levchenko · March 23, 2022, 7:53pm

Thank you for your answers!

@mattcieslak I run it only for one subject.

@Steven,

I have tested on one more subject and error looks exactly the same. I can try to test on more subjects to be sure that this is not something subject specific.
All subjects (including tested one) are BIDS valid data.
Yes, overall I have around 0.8TB of freespace on the drive and all needed permissions.
I specify only number of cpus (–cpus-per-task=8).

Best,
Egor

Steven · March 23, 2022, 8:34pm

Good to know! Let’s focus on the last point. How are you submitting these jobs? Does your cluster have a job scheduler that you can use to specify job resources?

Also, have you been able to run other large nipype-style pipelines (e.g. fmriprep)?

egor.levchenko · March 24, 2022, 11:42am

On our cluster we use Slurm management system and I use pretty default way to run jobs:

I created a batch file:

#!/bin/bash
#SBATCH --time=3-0:0
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1

module load singularity
singularity run --cleanenv ~/qsiprep-0.15.1.sif ~/Entropy/raw_data/patients/ ~/Entropy/derivatives_qsiprep/patients/ participant --participant-label OAS30001 -w ~/Entropy/work_qsiprep/patients/ --fs-license-file ~/freesurfer_license.txt --skip_bids_validation --output-resolution 1.2

And send it using sbatch command.

Yes, I run fmriprep several times before and it works perfectly!

Steven · March 24, 2022, 12:14pm

Can you go down to 4 cpu and add

SBATCH —mem=32GB

egor.levchenko · March 24, 2022, 1:27pm

Changed cpu to 4 and added --mem=32GB, but slurm returns an error:

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

Which is strange for me, cause I can see a lot of freenodes.

Steven · March 24, 2022, 1:28pm

hmm, have you been able to submit jobs with 32 GB before? Perhaps your cluster has a job memory limit. Try 20GB?

egor.levchenko · March 24, 2022, 1:59pm

I tried 20GB and 8GB - the same error

Steven · March 24, 2022, 2:02pm

This might be something to ask your server admin about… In the meantime, if you are open to sending data I can try running it on my cluster and seeing if I run into the same problem. You can also upload data to brainlife.io and run QSIPrep there; it is a cloud-based neuroimaging processing infrastructure.

egor.levchenko · March 24, 2022, 9:14pm

@Steven, thank you!

Just to let you know that my admin answered e about the memory issue (cHARISMa is the name of the cluster):

“”"
By default, all free memory of the compute node is available to the task. Since cHARISMa has a very large RAM of nodes, there are no problems with a lack of memory for user tasks, and it makes no sense to limit users once again.

You can see the memory usage statistics of your tasks through the TaskMaster system. For example, the graph “Main memory usage (whole node)” shows how little the task takes memory relative to the total free memory of the computing node
“”"

Basically, it is impossible to specify the amount of memory for a task on our cluster now. That is why it returns an error every time if there is a “–mem=” parameter in batch file.

So, looks like the error which occurs during qsiprep is not related to memory issues.

Steven · March 24, 2022, 9:35pm

Got it. There is a —resource-manager argument that might help see if qsiprep is somehow exceeding all available memory.

egor.levchenko · March 27, 2022, 12:30pm

Thank you! I think you mean --resource-monitor, am I right?

I added this argument and run qsiprep on the same subject. I think there is kind of the same error that appears now (I attach the log and some graphs about the memory usage of HPC). According to them, I guess that the problem is not in memory usage.

output_qsiprep_ctrl1.txt (98.4 KB)

Do you have any ideas what it could be if it was not a memory issue?

egor.levchenko · April 14, 2022, 1:02pm

Hey!

I haven’t solved the issue with eddy but just switched to another head motion correction algorithm using a --hmc_model 3dSHORE line.

Probably it will be useful for somebody in the future!

Best,
Egor