Eddy_cuda error: 'parallel_for failed: cudaErrorMemoryAllocation: out of memory'

ZitengHan · March 20, 2024, 4:14pm

Summary of what happened:

Hi, experts. I am using QSIPrep to preprocess DTI data from the HCP-development database and wish to use eddy_cuda to accelerate the SDC process. The HPC I’m using has 512G of RAM and 8 GPUs, each with 16GB of memory.

During the execution of eddy_cuda, I have observed several errors related to GPU memory, accompanied by the generation of crash files. Interestingly, the official Docker container continues to operate with five active eddy_cuda processes. My first question pertains to whether the eddy_cuda processes that fail are automatically retried?

Additionally, how can I reduce the number of parallel eddy_cuda processes from 8 to a smaller number to avoid such errors?

Command used (and if a helper script was used, a link to the helper script or the command generated):

export HOME=/home/user7/Datapool/part3
docker run -ti --rm \
    --gpus all  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all \
    -v $HOME/data:/data \
    -v $HOME/output:/output \
    -v /home/data/user7//HCD_QSIPrep_working_dir/part3:/working_dir \
    -v ${FREESURFER_HOME}/license.txt:/usr/local/freesurfer/license.txt \
    pennbbl/qsiprep:0.20.0 \
    /data /output participant \
    --fs-license-file /usr/local/freesurfer/license.txt \
    --output-resolution 1.5 \
    --distortion-group-merge average \
    --skip-anat-based-spatial-normalization \
    --eddy-config /output/eddy_params.json \
    -w /working_dir -v -v

Version:

pennbbl/qsiprep:0.20.0

Environment (Docker, Singularity / Apptainer, custom installation):

Docker

Relevant log outputs (up to 20 lines):

240320-08:34:48,518 nipype.workflow INFO:
	 [MultiProc] Running 4 tasks, and 187 jobs ready. Free memory (GB): 451.35/452.15, Free processors: 8/40.
                     Currently running:
                       * qsiprep_wf.single_subject_HCD1197757_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
                       * qsiprep_wf.single_subject_HCD1106728_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
                       * qsiprep_wf.single_subject_HCD0969072_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
                       * qsiprep_wf.single_subject_HCD0643345_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
240320-08:34:49,757 nipype.workflow INFO:
	 [Node] Setting-up "qsiprep_wf.single_subject_HCD1227134_wf.dwi_preproc_wf.hmc_sdc_wf.eddy" in "/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy".
240320-08:34:49,784 nipype.workflow INFO:
	 [Node] Executing "eddy" <qsiprep.interfaces.eddy.ExtendedEddy>
240320-08:34:50,465 nipype.workflow INFO:
	 [MultiProc] Running 5 tasks, and 186 jobs ready. Free memory (GB): 451.15/452.15, Free processors: 0/40.
                     Currently running:
                       * qsiprep_wf.single_subject_HCD1227134_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
                       * qsiprep_wf.single_subject_HCD1197757_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
                       * qsiprep_wf.single_subject_HCD1106728_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
                       * qsiprep_wf.single_subject_HCD0969072_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
                       * qsiprep_wf.single_subject_HCD0643345_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
240320-09:22:58,394 nipype.workflow INFO:
	 [Node] Finished "eddy", elapsed time 2888.60715s.
240320-09:22:58,395 nipype.workflow WARNING:
	 Storing result file without outputs
240320-09:22:58,396 nipype.workflow WARNING:
	 [Node] Error on "qsiprep_wf.single_subject_HCD1227134_wf.dwi_preproc_wf.hmc_sdc_wf.eddy" (/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy)
240320-09:22:58,523 nipype.workflow ERROR:
	 Node eddy failed to run on host 51d0bb00b888.
240320-09:22:58,536 nipype.workflow ERROR:
	 Saving crash info to /output/qsiprep/sub-HCD1227134/log/20240320-040316_c86b7929-7dfd-4fbc-aaf7-d548aad58a65/crash-20240320-092258-root-eddy-b86fccae-a7aa-455e-8c8b-c1f8684b2d73.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/engine/nodes.py", line 527, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/engine/nodes.py", line 645, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/engine/nodes.py", line 771, in _run_command
    raise NodeExecutionError(msg)
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node eddy.

Cmdline:
	eddy_cuda  --cnr_maps --field=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/topup/fieldmap_HZ --field_mat=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/topup_to_eddy_reg/topup_reg_image_flirt.mat --flm=linear --ff=10.0 --acqp=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/gather_inputs/eddy_acqp.txt --bvals=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/pre_hmc_wf/rpe_concat/merge__merged.bval --bvecs=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/pre_hmc_wf/rpe_concat/merge__merged.bvec --imain=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/pre_hmc_wf/rpe_concat/merge__merged.nii.gz --index=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/gather_inputs/eddy_index.txt --mask=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/pre_eddy_b0_ref_wf/synthstrip_wf/mask_to_original_grid/topup_imain_corrected_avg_trans_mask_trans.nii.gz --interp=spline --data_is_shelled --resamp=jac --niter=10 --nvoxhp=1000 --out=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy/eddy_corrected --repol --slm=linear
Stdout:

	...................Allocated GPU # 0...................
	parallel_for failed: cudaErrorMemoryAllocation: out of memory
	EDDY:::  cuda/CudaVolume.cu:::  void EDDY::CudaVolume::common_assignment_from_newimage_vol(const NEWIMAGE::volume<float>&, bool):  Exception thrown
	EDDY:::  cuda/CudaVolume.h:::  EDDY::CudaVolume::CudaVolume(const NEWIMAGE::volume<float>&, bool):  Exception thrown
	EDDY:::  cuda/EddyInternalGpuUtils.cu:::  static void EDDY::EddyInternalGpuUtils::detect_outliers(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, std::shared_ptr<EDDY::DWIPredictionMaker>, const NEWIMAGE::volume<float>&, const EDDY::ECScanManager&, EDDY::ReplacementManager&, EDDY::DiffStatsVector&):  Exception thrown
	EDDY:::  cuda/EddyGpuUtils.cu:::  static EDDY::DiffStatsVector EDDY::EddyGpuUtils::DetectOutliers(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, std::shared_ptr<EDDY::DWIPredictionMaker>, const NEWIMAGE::volume<float>&, const EDDY::ECScanManager&, EDDY::ReplacementManager&):  Exception thrown
	EDDY:::  eddy.cpp:::  EDDY::ReplacementManager* EDDY::Register(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, unsigned int, const std::vector<float, std::allocator<float> >&, EDDY::SecondLevelECModel, bool, EDDY::ECScanManager&, EDDY::ReplacementManager*, NEWMAT::Matrix&, NEWMAT::Matrix&):  Exception thrown
	EDDY::: Eddy failed with message EDDY:::  eddy.cpp:::  EDDY::ReplacementManager* EDDY::DoVolumeToVolumeRegistration(const EDDY::EddyCommandLineOptions&, EDDY::ECScanManager&):  Exception thrown
Stderr:
	thrust::system_error thrown in CudaVolume::common_assignment_from_newimage_vol after resize() with message: parallel_for failed: cudaErrorMemoryAllocation: out of memory
Traceback:
	Traceback (most recent call last):
	  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/core.py", line 453, in aggregate_outputs
	    setattr(outputs, key, val)
	  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/traits_extension.py", line 330, in validate
	    value = super(File, self).validate(objekt, name, value, return_pathlike=True)
	  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/traits_extension.py", line 135, in validate
	    self.error(objekt, name, str(value))
	  File "/usr/local/miniconda/lib/python3.10/site-packages/traits/base_trait_handler.py", line 74, in error
	    raise TraitError(
	traits.trait_errors.TraitError: The 'out_parameter' trait of an ExtendedEddyOutputSpec instance must be a pathlike object or string representing an existing file, but a value of '/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy/eddy_corrected.eddy_parameters' <class 'str'> was specified.

	During handling of the above exception, another exception occurred:

	Traceback (most recent call last):
	  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/core.py", line 400, in run
	    outputs = self.aggregate_outputs(runtime)
	  File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/core.py", line 460, in aggregate_outputs
	    raise FileNotFoundError(msg)
	FileNotFoundError: No such file or directory '/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy/eddy_corrected.eddy_parameters' for output 'out_parameter' of a ExtendedEddy interface

Screenshots / relevant information:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.95.01    Driver Version: 440.95.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   56C    P0    54W / 250W |   4511MiB / 16280MiB |     36%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   41C    P0    27W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:3D:00.0 Off |                    0 |
| N/A   39C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:3E:00.0 Off |                    0 |
| N/A   41C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-PCIE...  Off  | 00000000:88:00.0 Off |                    0 |
| N/A   47C    P0    31W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-PCIE...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   42C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-PCIE...  Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   49C    P0    32W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-PCIE...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   45C    P0    29W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    105944      C   python                                          1233MiB |
|    0    115362      C   python                                          1233MiB |
|    0    128182      C   eddy_cuda                                    375MiB |
|    0    137458      C   eddy_cuda                                    383MiB |
|    0    149039      C   eddy_cuda                                    431MiB |
|    0    149759      C   eddy_cuda                                    469MiB |
|    0    155268      C   eddy_cuda                                    477MiB |

ZitengHan · March 20, 2024, 4:17pm

eddy_params.json:

{
“flm”: “linear”,
“slm”: “linear”,
“fep”: false,
“interp”: “spline”,
“nvoxhp”: 1000,
“fudge_factor”: 10,
“dont_sep_offs_move”: false,
“dont_peas”: false,
“niter”: 10,
“method”: “jac”,
“repol”: true,
“num_threads”: 40,
“is_shelled”: true,
“use_cuda”: true,
“cnr_maps”: true,
“residuals”: false,
“output_type”: “NIFTI_GZ”,
“args”: “”
}

Steven · March 20, 2024, 4:20pm

Hi @ZitengHan,

You can set the --omp-nthreads argument, which says the maximum number of cpus for any single task.

Also you can reduce “num_threads”: 40 in your config.

How are you allocating these resources? E.g., are you setting memory and cpus somewhere in your job scheduler (like with an SBATCH header for Slurm systems).

Best,
Steven

ZitengHan · March 20, 2024, 4:29pm

Our Linux server has 40-core CPUs. How should I set the --omp-nthreads and num_threads ?

Maybe it depends on the dynamic resource allocation of Linux. Our Linux server isn’t configured with an environment like Slurm systems.

Steven · March 20, 2024, 6:04pm

Hi @ZitengHan

Perhaps --nthreads 32 and --omp-nthreads 8?

ZitengHan · March 21, 2024, 3:16pm

Got it, Steven. nthreads divided by --omp-nthreads gives the number of eddy_cuda that a Docker run simultaneously. Thanks a lot!

Steven · March 21, 2024, 3:23pm

Hi @ZitengHan,

Not quite. --omp-nthreads is the maximum of cpus for any single process in QSIPrep (such as eddy_cuda), where as --nthreads is the total number of cpus allowed across all concurrent tasks in QSIPrep.

Best,
Steven