I am learning to use DataLad, so first I would like to thank the developers for this amazing tool and the fantastic DataLad Handbook that helped me a lot to get started.
After playing around with DataLad for the last couple of days, I am still a bit confused about the ideal data organization for a project.
Following the YODA principles would mean to have one /code directory and /inputs and /outputs directories. But at the same time directories should be kept modular to be able to reuse them across projects.
Here’s an example:
Let’s say I collected MRI data. I create a DataLad data-set with the dicoms to archive it
Now I want to convert the data to BIDS: I create a new DataLad dataset (bids) with a subdataset containing the dicoms. Following YODA, let’s clone it into a directory called /input
Now I add code to the /code directory to convert the dicoms inside /input to BIDS. Following YODA, these converted BIDS data would go into the /output directory.
Now the trouble starts: I want to run an analysis on the BIDS dataset. Therefore, I create a new Datalad dataset (let’s call it analysis) and add the BIDS-dataset as a sub-dataset that I call /input again. But, now from the analysis root the BIDS data (my actual input data) is inside the input directory and then inside the output directory. Confusing!
An alternative would be to make the BIDS output an independent dataset but then it would lose the provenance of how it was converted from the dicoms
All of this gets additionally complicated if I want to keep one central code repository that contains all code for the entire study and is cloned at different stages into various other datasets. Or should each dataset only contain code that is relevant for manipulating the files in the data-set which means that code would be distributed across many different datasets?
I hope the example is clear. Maybe to briefly summarize, my question is how to not get lost in all the nested datasets while still maintaining provenance how each dataset came into existence.
Please excuse if I am missing or overlooking something obvious, I am still figuring out a good workflow with DataLad, especially since I am trying to integrate it with an existing project structure.
@yarikoptic@adina could you help me out? Thank you very much for your help!
you point to an important issue re modularization. I strongly recommend to not have dedicated “output” directories for the exact reason you are giving. Sometimes it is necessary to have dedicated output directories, though. For example, fmriprep forces you to have a single output directory, although it can produce to completely disjoint output structures in it (freesurfer dir and actual fmriprep). In such cases, I’d recommend to place dedicated output subdataset at those locations. This will yield two datasets that are then more flexibly reusable for further analysis. You are correctly pointing out that such an approach is detrimental to the comprehensiveness of the provenance capture. It can be maintained at a somewhat acceptable level by using a third output superdataset that tracks all output datasets. Here is a sketch of the layout:
/ (output superdataset)
/ freesurfer (subdataset for freesurfer output)
/ fmriprep (subdataset for fmriprep output)
/ sourcedata (BIDS input dataset)
/ code / pipelines (toolbox dataset with an `containers-add`ed fmriprep container)
Note that input/, output/ recommended by YODA are generic and just recommendation. BIDS specifies sourcedata/ (instead of input/) as an expected location for the source data used to produce this BIDS (derivative included) dataset (so effectively . is your output/).
Yes – it is a bit mind-bending – like an Inception or even Predestination. But that is how it is: to get clear provenance record YODA-style you need to incorporate sourcedata/ with its git commit as input. Thanks to git, git-annex and datalad you can actually make it quite efficient. See --reckless option which would allow to efficiently reuse git-annex’ed content from the “input” dataset upstairs: you install it only for the duration of processing, and then uninstall whenever you are done with your “derived” dataset. All provenance is now recorded and you can install THAT SPECIFIC VERSION of the input (sourcedata/) dataset if needed.
Your explanations make a lot of sense to me and clarify some of the confusion.
I am still a bit unsure about where to place /code. Let’s say I have one repo with code (e.g., my heudiconv heuristic, code to run fMRIPrep, some GLM analysis code, etc.). Then I would add this code repo as a subdataset to ( a ) the dataset in which I’m creating the BIDS directory structure, ( b ) the fMRIPrep dataset (which will then also contain the BIDS dataset) and ( c ) the dataset in which I am running the analysis (which might also include the BIDS and fMRIPrep (sub)datasets) etc… Or would you recommend that in each of these datasets the /code directory only contains the code that runs on the data in the respective dataset (e.g., the fMRIPrep dataset only contains code to run fMRIPrep)?
I think it’s amazing how datalad makes this nesting work so seamlessly but yes, I find it a bit mind-bending indeed - thanks @yarikoptic btw, for the movie recommendations (I haven’t seen Predestination, yet!) - I’ll also look into --reckless, thanks for the hint!