Fmriprep stalling

JarodRoland · August 9, 2018, 4:27pm

I am trying to get a single subject run through fmriprep (my first time using fmriprep). I am running it through Singularity on an HPC. I am tee’ing the output to a file to follow the progress, but it consistently stalls at this Merge step:

[Node] Setting-up "fmriprep_wf.single_subject_EP199_wf.func_preproc_ses_pre_task_rest_run_001_wf.bold_reg_wf.merge" in "/scratch/jarod.roland/fmriprep-scratch/fmriprep_wf/single_subject_EP199_wf/func_preproc_ses_pre_task_rest_run_001_wf/bold_reg_wf/merge". 180809-05:34:35,156 nipype.workflow INFO: [Node] Outdated cache found for "fmriprep_wf.single_subject_EP199_wf.func_preproc_ses_pre_task_rest_run_001_wf.bold_reg_wf.merge". 180809-05:34:35,200 nipype.workflow INFO: [Node] Running "merge" ("fmriprep.interfaces.nilearn.Merge")

Here is a link to the full output: https://www.dropbox.com/s/kdbkug4ux871r4e/EP199Sing.o4832686?dl=0

Any thoughts or suggestions?

effigies · August 9, 2018, 4:46pm

I’m not sure what’s causing the stalling, but you’re getting errors early on in your run because you’re placing your outputs inside your dataset. It looks like you must be doing the equivalent of:

fmriprep /home/jarod.roland/StudyEP199 /home/jarod.roland/StudyEP199/fmriprep-output participant [...]

This is causing your outputs to be indexed along with your inputs, and the aparc/aseg ROI files to be loaded as if they were lesion masks.

I would recommend either placing your outputs in a separate directory or using the BIDS convention of storing outputs in <bids-root>/derivatives/, like so:

fmriprep /home/jarod.roland/StudyEP199 /home/jarod.roland/StudyEP199/derivatives participant [...]

Can you try that and see if it fixes your issues? If it still stalls, we can look deeper.

JarodRoland · August 9, 2018, 5:10pm

Thanks for the quick reply. I’ll give that a try and follow up with the results.

Here is my new command line:
singularity run --cleanenv -B /scratch:/scratch \ /scratch/jarod.roland/poldracklab_fmriprep_latest-2018-07-31-c0cb918082c1.img --fs-license-file /home/jarod.roland/freesurfer-license.txt \ /home/jarod.roland/StudyEP199 /home/jarod.roland/StudyEP199/derivatives \ participant --participant-label EP199 --nthreads 1 -w /scratch/jarod.roland/fmriprep-scratch \ --omp-nthreads 1 | tee /home/jarod.roland/StudyEP199/fmriprep-EP199-pre.log

ChrisGorgolewski · August 9, 2018, 5:39pm

I wonder if stalling is due to docker (windows or mac) soft memory limits. Recent docker installations have swap enabled which might give an impression of stalling. Check the memory settings in docker UI.

effigies · August 9, 2018, 5:47pm

Shell piping could also be causing an impression of stalling because of buffering. You may not see some output until enough new output is produced or the program terminates. There are various ways to modify the buffering behavior, but none that a quick Googling turned up were ones I could be confident in recommending; you may want to experiment with them yourself.

If you’re able to log into the machine that’s running your job, I would recommend using htop, which will let you see if any cores, memory or swap are being heavily used. That will help distinguish a stalled main process from one that’s just waiting on a subprocess. Some fMRIPrep steps are pretty expensive.

JarodRoland · August 10, 2018, 4:49pm

Thanks for the suggestions. I tried a few options but still end up stalling at the same “merge” step. Neither the files nor the output log have changed in well over 10 hours since I left it overnight, so I’m sure its not buffering.
I don’t think I can connect to the node running on the cluster to query the memory status. The cluster node has 128GB or RAM, so if it is out of memory then I have bigger issues. The singularity image is just the defaults as built per the docs (http://fmriprep.readthedocs.io/en/stable/installation.html#preparing-a-singularity-image-singularity-version-2-5). But when I get back home this weekend I can run a local docker on my desktop to investigate further if its memory issues.

Here is my last command:
singularity run --cleanenv -B /scratch:/scratch /scratch/jarod.roland/poldracklab_fmriprep_latest-2018-07-31-c0cb918082c1.img --fs-license-file /home/jarod.roland/freesurfer-license.txt /scratch/jarod.roland/StudyEP199 /scratch/jarod.roland/fmripre-output participant --participant-label EP199 --nthreads 1 -w /scratch/jarod.roland/fmriprep-scratch --omp-nthreads 1 | tee /home/jarod.roland/fmriprep-EP199-pre.log

And the output is here https://www.dropbox.com/s/8tdrtgffr0rh1jg/EP199Sing.o4844737.txt?dl=0

Thanks!

effigies · August 10, 2018, 6:07pm

nor the output log have changed in well over 10 hours since I left it overnight, so I’m sure its not buffering.

My point was that there’s a possibility that the next node is not producing enough output to show that Merge has finished. Given that you have --omp-nthreads 1, if the next job is extremely high resource consumption, it might be taking a very long time.

In any event, if you add a -v flag (or two) to your command, you should be able to produce more output to rule this case out.

Finally, are you at any risk of hitting your filesystem quota?

I don’t think I can connect to the node running on the cluster to query the memory status.

Perhaps you can talk to your sysadmin? It’s going to be very difficult to debug without knowing whether the job is actually running.

The cluster node has 128GB or RAM, so if it is out of memory then I have bigger issues.

I’m seeing the following in your PBS epilogue:

Limits: nodes=1:ppn=1,walltime=168:00:00,neednodes=1:ppn=1,mem=4gb

4GB is going to cause problems, one way or another. Also, if you’re running ANTs or FreeSurfer, you’re going to suffer extremely long run times with only one core.

JarodRoland · August 10, 2018, 6:24pm

In any event, if you add a -v flag (or two) to your command, you should be able to produce more output to rule this case out.

I’ll add -vv to the command and re-run.

Finally, are you at any risk of hitting your filesystem quota?

I was afraid of that too, so just in case I moved all the output to my scratch disc, where the limit is 1TB and I’m no where close to that

I’m seeing the following in your PBS epilogue:

Good catch, I didn’t realize the default memory for single thread jobs is 4GB. I was trying a single thread to limit complexity for debugging. I fixed the job to request 64GB memory and switched back to 16 threads.

Thanks again for the awesome tips and quick reply. I’ll go back and see if I make progress.

Cheers!

JarodRoland · August 16, 2018, 1:04am

As an update, I never could get the data to run through fmriprep with the Singularity container, but I did successfully run it in Docker on my desktop without error. Not sure why it kept stalling on the “merge” step on the HCP in Singularity (it has more memory, disk space, and cores than my desktop). Oh well, moving on for now.

Thanks again for the help