I am running fmriprep docker on multiple subjects on a google cloud instance, n1-standard-4 (4 vCPUs, 15 GB memory). After fmriprep runs for couple of hours or days, it crashes with error "error waiting for container: unexpected EOF"
I am running fmriprep docker with the following command:
I have also tried above command with --low-mem flag, but I still get this error after few hours of run. Some forums suggest it could be a memory issue. Has anyone experienced this and able to resolve it?
Apologies for late response. It seemed that my response never went through. Posting my response below
For most it is when it starts anat_preproc_wf.anat_norm_wf.registration workflow after successfully finishing both anat and func workflows for some subjects, and for few it is while running autorecon, and for one of them it is while running func_preproc_task_rest_wf.bold_std_trans_wf.bold_to_std_transfor workflow. So it is not necessarily crashing at a particular point. Also, right before starting these workflows the fmriprep pipeline displays free memory, and it seems to be available, see below for few of the instances:
211003-04:02:07,311 nipype.workflow INFO:
[MultiProc] Running 1 tasks, and 142 jobs ready. Free memory (GB): 8.19/13.19, Free processors: 1/4.
Currently running:
* _autorecon_surfs1
ERRO[553279] error waiting for container: unexpected EOF
211006-06:12:18,366 nipype.workflow INFO:
[MultiProc] Running 1 tasks, and 92 jobs ready. Free memory (GB): 11.49/13.19, Free processors: 1/4.
Currently running:
* fmriprep_wf.single_subject_091004s20140805_wf.func_preproc_task_rest_wf.bold_std_trans_wf.bold_to_std_transform
ERRO[636774] error waiting for container: unexpected EOF
211001-10:54:22,130 nipype.workflow INFO:
[MultiProc] Running 1 tasks, and 145 jobs ready. Free memory (GB): 8.19/13.19, Free processors: 1/4.
Currently running:
* _autorecon_surfs0
ERRO[551680] error waiting for container: unexpected EOF
Since my last response above (to which I got a delivery failure ), I tried re-running same command with --low-mem flag few more times and finally they all ran with no more error. It may be memory related, but hard to confirm as there used to be random disconnections from the cloud instances.