Creating a lot of subdatasets in parallel

vinpetersen · July 19, 2021, 6:09pm

Hi,

I try to dataladify our cohort study data adhering to YODA principles. For that I want to add for each subject a subject subdataset in the dataset containing the bidsified raw data and in each output/derivative dataset (e.g. for qsiprep). Since it takes a really long time doing this sequentally isn’t feasible. Is there a way to add a large amount of subdatasets in parallel? When running ´datalad create -d^. ` via GNU parallel it errors and some subdatasets aren’t registered in the parent dataset.

Grateful for any input.

eknahm · July 19, 2021, 6:57pm

Yes, parallel write access to git-repo history is tricky. A suitable approach could be to not create subdatasets right away, but merely new (disconnected) datasets inside the main superdataset. This can be done in parallel. They would be registered by a single final datalad save call (no recursion!) in the superdataset. This step will still be slowish, but just needs to be done once. We used this to build a superdataset for a bidsified UKB dataset analog to your setup. This one had 42k subdatasets.

vinpetersen · July 20, 2021, 4:24pm

Hi @eknahm ,

thanks a lot for your help. Creating multiple independent subject datasets in parallel and the saving the superdataset worked nicely.