Summary of what happened:
Hi, experts. I am using QSIPrep to preprocess DTI data from the HCP-development database and wish to use eddy_cuda
to accelerate the SDC process. The HPC I’m using has 512G of RAM and 8 GPUs, each with 16GB of memory.
During the execution of eddy_cuda
, I have observed several errors related to GPU memory, accompanied by the generation of crash files. Interestingly, the official Docker container continues to operate with five active eddy_cuda
processes. My first question pertains to whether the eddy_cuda
processes that fail are automatically retried?
Additionally, how can I reduce the number of parallel eddy_cuda
processes from 8 to a smaller number to avoid such errors?
Command used (and if a helper script was used, a link to the helper script or the command generated):
export HOME=/home/user7/Datapool/part3
docker run -ti --rm \
--gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all \
-v $HOME/data:/data \
-v $HOME/output:/output \
-v /home/data/user7//HCD_QSIPrep_working_dir/part3:/working_dir \
-v ${FREESURFER_HOME}/license.txt:/usr/local/freesurfer/license.txt \
pennbbl/qsiprep:0.20.0 \
/data /output participant \
--fs-license-file /usr/local/freesurfer/license.txt \
--output-resolution 1.5 \
--distortion-group-merge average \
--skip-anat-based-spatial-normalization \
--eddy-config /output/eddy_params.json \
-w /working_dir -v -v
Version:
pennbbl/qsiprep:0.20.0
Environment (Docker, Singularity / Apptainer, custom installation):
Docker
Relevant log outputs (up to 20 lines):
240320-08:34:48,518 nipype.workflow INFO:
[MultiProc] Running 4 tasks, and 187 jobs ready. Free memory (GB): 451.35/452.15, Free processors: 8/40.
Currently running:
* qsiprep_wf.single_subject_HCD1197757_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
* qsiprep_wf.single_subject_HCD1106728_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
* qsiprep_wf.single_subject_HCD0969072_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
* qsiprep_wf.single_subject_HCD0643345_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
240320-08:34:49,757 nipype.workflow INFO:
[Node] Setting-up "qsiprep_wf.single_subject_HCD1227134_wf.dwi_preproc_wf.hmc_sdc_wf.eddy" in "/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy".
240320-08:34:49,784 nipype.workflow INFO:
[Node] Executing "eddy" <qsiprep.interfaces.eddy.ExtendedEddy>
240320-08:34:50,465 nipype.workflow INFO:
[MultiProc] Running 5 tasks, and 186 jobs ready. Free memory (GB): 451.15/452.15, Free processors: 0/40.
Currently running:
* qsiprep_wf.single_subject_HCD1227134_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
* qsiprep_wf.single_subject_HCD1197757_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
* qsiprep_wf.single_subject_HCD1106728_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
* qsiprep_wf.single_subject_HCD0969072_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
* qsiprep_wf.single_subject_HCD0643345_wf.dwi_preproc_wf.hmc_sdc_wf.eddy
240320-09:22:58,394 nipype.workflow INFO:
[Node] Finished "eddy", elapsed time 2888.60715s.
240320-09:22:58,395 nipype.workflow WARNING:
Storing result file without outputs
240320-09:22:58,396 nipype.workflow WARNING:
[Node] Error on "qsiprep_wf.single_subject_HCD1227134_wf.dwi_preproc_wf.hmc_sdc_wf.eddy" (/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy)
240320-09:22:58,523 nipype.workflow ERROR:
Node eddy failed to run on host 51d0bb00b888.
240320-09:22:58,536 nipype.workflow ERROR:
Saving crash info to /output/qsiprep/sub-HCD1227134/log/20240320-040316_c86b7929-7dfd-4fbc-aaf7-d548aad58a65/crash-20240320-092258-root-eddy-b86fccae-a7aa-455e-8c8b-c1f8684b2d73.txt
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
result["result"] = node.run(updatehash=updatehash)
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/engine/nodes.py", line 527, in run
result = self._run_interface(execute=True)
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/engine/nodes.py", line 645, in _run_interface
return self._run_command(execute)
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/pipeline/engine/nodes.py", line 771, in _run_command
raise NodeExecutionError(msg)
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node eddy.
Cmdline:
eddy_cuda --cnr_maps --field=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/topup/fieldmap_HZ --field_mat=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/topup_to_eddy_reg/topup_reg_image_flirt.mat --flm=linear --ff=10.0 --acqp=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/gather_inputs/eddy_acqp.txt --bvals=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/pre_hmc_wf/rpe_concat/merge__merged.bval --bvecs=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/pre_hmc_wf/rpe_concat/merge__merged.bvec --imain=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/pre_hmc_wf/rpe_concat/merge__merged.nii.gz --index=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/gather_inputs/eddy_index.txt --mask=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/pre_eddy_b0_ref_wf/synthstrip_wf/mask_to_original_grid/topup_imain_corrected_avg_trans_mask_trans.nii.gz --interp=spline --data_is_shelled --resamp=jac --niter=10 --nvoxhp=1000 --out=/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy/eddy_corrected --repol --slm=linear
Stdout:
...................Allocated GPU # 0...................
parallel_for failed: cudaErrorMemoryAllocation: out of memory
EDDY::: cuda/CudaVolume.cu::: void EDDY::CudaVolume::common_assignment_from_newimage_vol(const NEWIMAGE::volume<float>&, bool): Exception thrown
EDDY::: cuda/CudaVolume.h::: EDDY::CudaVolume::CudaVolume(const NEWIMAGE::volume<float>&, bool): Exception thrown
EDDY::: cuda/EddyInternalGpuUtils.cu::: static void EDDY::EddyInternalGpuUtils::detect_outliers(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, std::shared_ptr<EDDY::DWIPredictionMaker>, const NEWIMAGE::volume<float>&, const EDDY::ECScanManager&, EDDY::ReplacementManager&, EDDY::DiffStatsVector&): Exception thrown
EDDY::: cuda/EddyGpuUtils.cu::: static EDDY::DiffStatsVector EDDY::EddyGpuUtils::DetectOutliers(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, std::shared_ptr<EDDY::DWIPredictionMaker>, const NEWIMAGE::volume<float>&, const EDDY::ECScanManager&, EDDY::ReplacementManager&): Exception thrown
EDDY::: eddy.cpp::: EDDY::ReplacementManager* EDDY::Register(const EDDY::EddyCommandLineOptions&, EDDY::ScanType, unsigned int, const std::vector<float, std::allocator<float> >&, EDDY::SecondLevelECModel, bool, EDDY::ECScanManager&, EDDY::ReplacementManager*, NEWMAT::Matrix&, NEWMAT::Matrix&): Exception thrown
EDDY::: Eddy failed with message EDDY::: eddy.cpp::: EDDY::ReplacementManager* EDDY::DoVolumeToVolumeRegistration(const EDDY::EddyCommandLineOptions&, EDDY::ECScanManager&): Exception thrown
Stderr:
thrust::system_error thrown in CudaVolume::common_assignment_from_newimage_vol after resize() with message: parallel_for failed: cudaErrorMemoryAllocation: out of memory
Traceback:
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/core.py", line 453, in aggregate_outputs
setattr(outputs, key, val)
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/traits_extension.py", line 330, in validate
value = super(File, self).validate(objekt, name, value, return_pathlike=True)
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/traits_extension.py", line 135, in validate
self.error(objekt, name, str(value))
File "/usr/local/miniconda/lib/python3.10/site-packages/traits/base_trait_handler.py", line 74, in error
raise TraitError(
traits.trait_errors.TraitError: The 'out_parameter' trait of an ExtendedEddyOutputSpec instance must be a pathlike object or string representing an existing file, but a value of '/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy/eddy_corrected.eddy_parameters' <class 'str'> was specified.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/core.py", line 400, in run
outputs = self.aggregate_outputs(runtime)
File "/usr/local/miniconda/lib/python3.10/site-packages/nipype/interfaces/base/core.py", line 460, in aggregate_outputs
raise FileNotFoundError(msg)
FileNotFoundError: No such file or directory '/working_dir/qsiprep_wf/single_subject_HCD1227134_wf/dwi_preproc_wf/hmc_sdc_wf/eddy/eddy_corrected.eddy_parameters' for output 'out_parameter' of a ExtendedEddy interface
Screenshots / relevant information:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.95.01 Driver Version: 440.95.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:1A:00.0 Off | 0 |
| N/A 56C P0 54W / 250W | 4511MiB / 16280MiB | 36% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:1B:00.0 Off | 0 |
| N/A 41C P0 27W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000000:3D:00.0 Off | 0 |
| N/A 39C P0 29W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000000:3E:00.0 Off | 0 |
| N/A 41C P0 28W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P100-PCIE... Off | 00000000:88:00.0 Off | 0 |
| N/A 47C P0 31W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla P100-PCIE... Off | 00000000:89:00.0 Off | 0 |
| N/A 42C P0 28W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla P100-PCIE... Off | 00000000:B1:00.0 Off | 0 |
| N/A 49C P0 32W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla P100-PCIE... Off | 00000000:B2:00.0 Off | 0 |
| N/A 45C P0 29W / 250W | 10MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 105944 C python 1233MiB |
| 0 115362 C python 1233MiB |
| 0 128182 C eddy_cuda 375MiB |
| 0 137458 C eddy_cuda 383MiB |
| 0 149039 C eddy_cuda 431MiB |
| 0 149759 C eddy_cuda 469MiB |
| 0 155268 C eddy_cuda 477MiB |