Tips for running BIDS apps on very large data sets?

I am working on the the HCP development data release, including 655 subjects with a lot anat, func, and dwi data. I am looking to run mriqc, fmriprep, and qsiprep to process these data. For other datasets I work with that are much smaller, I have had no trouble submitting an sbatch job array, with each subject getting their own job with the --participant-label argument.

Typically, each process begins with building an SQL database of the BIDS directory in the work folder; not too time consuming for my other datasets, so I don’t mind having it run for each subject even though it may not be the most efficient in theory. However, for the HCP dataset, none of my jobs have passed this step after 15 hours at time of writing, having devoted 40GB X 4 CPUS. And this happens for every subject.

Does anyone have any tips for processing an unwieldy large dataset? Would it make sense to include multiple subjects in a job to reduce the redundancies in recreating the database every time? Am I just not devoting enough memory? Should I try a workaround where I make a BIDS-compliant directory for each individual subject (hopefully not this :sweat_smile:)?

Thanks, and happy to provide more info as needed!
Steven

Not sure if you’ve checked out brainlife, but it’s a platform meant for large dataset processing.

I don’t believe it has qsiprep though.

1 Like

Thanks for the quick reply! I have heard of brainlife.io but I’ve never uploaded data there (I typically just download their containers to run on my own). Good to know that it may be better suited for larger datasets.

They have access to quite a bit of processing resources, which is certainly convenient for large imaging studies. There’s even a BIDS upload command so that you can upload your BIDS data and hit the ground running with the BIDS-apps on the platform.

1 Like

I know fmriprep has:

--bids-database-dir

Path to an existing PyBIDS database folder, for faster indexing (especially useful for large datasets).

It doesn’t appear as thought mriqc has this option, and I’m not sure about qsiprep.

1 Like

That is very helpful to know and sounds like the fix for my problems (if only qsiprep had such an option)! Preliminary testing with brainlife.io is also running well.

1 Like