I am looking for ways to properly re-structure a nested BIDS dataset and its derivatives to follow the YODA principles (which are not compatible with BIDS derivatives) for sharing. There are not really existing cases on openneuro or datalad repo that follow YODA. I am looking to brainstorm the best structure.
Our raw BIDS super-dataset is a bit particular: few participants scanned in different BIDS sub-datasets (datalad in bold):
**super-dataset**
├── **anat_ds**
├── **func_ds1**
├── **func_ds2**
├── ...
New functional datasets and new anatomical sessions will be added in yearly releases.
But my question is the same for more standard single raw-BIDS datalad.
Let say I want to distribute (f|s)MRIPrep+DWI+QSM+… derivatives, following BIDS I would output these in a sub-datalad (func|anat)_ds/derivatives/(fmriprep|freesurfer|qsm_prep|dwiprep|…).
But YODA now comes! Derivatives as subdataset of raw data is not compatible (and if we set the raw BIDS also as sourcedata of derivatives then we create a “cycle” in the dependency graph), it has to be the reverse: raw as subdataset of derivatives.
I also think it is better and easier if the users install only a super-dataset and then get the sub-datasets they need, rather than separately installing independent datasets, but that might not be achievable.
Then which organization would make the most sense :
- a super-dataset / derivatives type:
**fmriprep-super-dataset**
├── sourcedata -> **super-dataset**
├── fmriprep/
├── **freesurfer/**
<b>my_diff_pipeline-dataset</b>
├── sourcedata -> **super-dataset**
├── my_diff_pipeline
And then maybe a super-dataset to aggregate these to ease crawling of the data (though with multiple duplicates of the raw dataset in tree leafs).
- an all-preprocessed super-dataset
<b>preprocessed-super-dataset</b>
├── sourcedata -> **super-dataset**
├── **fmriprep**
├── **freesurfer**
├── **my_diff_pipeline**
├── **my_qsm_pipeline**
Though in that case the execution of the pipeline with datalad run
has to be recorded in the top-dataset to keep the commit of the raw data used as input.
Or do you have suggestions for another structure?
Then this organizational problem also arises if we want to provide higher-level derivatives, for instance beta maps generated by fitlins, parcellation-based or embeddings timeseries, …
We can create another super-dataset with each sub-dataset setting the preprocessed super-dataset (or subdataset) as sourcedata, but it’s getting really complex.
Looking for guidance and ideas!
Thanks!