I am working on the the HCP development data release, including 655 subjects with a lot anat, func, and dwi data. I am looking to run mriqc, fmriprep, and qsiprep to process these data. For other datasets I work with that are much smaller, I have had no trouble submitting an sbatch job array, with each subject getting their own job with the --participant-label
argument.
Typically, each process begins with building an SQL database of the BIDS directory in the work folder; not too time consuming for my other datasets, so I don’t mind having it run for each subject even though it may not be the most efficient in theory. However, for the HCP dataset, none of my jobs have passed this step after 15 hours at time of writing, having devoted 40GB X 4 CPUS. And this happens for every subject.
Does anyone have any tips for processing an unwieldy large dataset? Would it make sense to include multiple subjects in a job to reduce the redundancies in recreating the database every time? Am I just not devoting enough memory? Should I try a workaround where I make a BIDS-compliant directory for each individual subject (hopefully not this )?
Thanks, and happy to provide more info as needed!
Steven