Sharing (nested) BIDS raw + derivatives in a datalad YODA way

bpinsard · March 5, 2021, 2:15pm

I am looking for ways to properly re-structure a nested BIDS dataset and its derivatives to follow the YODA principles (which are not compatible with BIDS derivatives) for sharing. There are not really existing cases on openneuro or datalad repo that follow YODA. I am looking to brainstorm the best structure.
Our raw BIDS super-dataset is a bit particular: few participants scanned in different BIDS sub-datasets (datalad in bold):

**super-dataset**
├── **anat_ds**
├── **func_ds1**
├── **func_ds2**
├── ...

New functional datasets and new anatomical sessions will be added in yearly releases.
But my question is the same for more standard single raw-BIDS datalad.

But YODA now comes! Derivatives as subdataset of raw data is not compatible (and if we set the raw BIDS also as sourcedata of derivatives then we create a “cycle” in the dependency graph), it has to be the reverse: raw as subdataset of derivatives.
I also think it is better and easier if the users install only a super-dataset and then get the sub-datasets they need, rather than separately installing independent datasets, but that might not be achievable.

Then which organization would make the most sense :

a super-dataset / derivatives type:


**fmriprep-super-dataset**
├── sourcedata -> **super-dataset**
├── fmriprep/
├── **freesurfer/**

<b>my_diff_pipeline-dataset</b>
├── sourcedata -> **super-dataset**
├── my_diff_pipeline

And then maybe a super-dataset to aggregate these to ease crawling of the data (though with multiple duplicates of the raw dataset in tree leafs).

an all-preprocessed super-dataset

<b>preprocessed-super-dataset</b>
├── sourcedata -> **super-dataset**
├── **fmriprep**
├── **freesurfer**
├── **my_diff_pipeline**
├── **my_qsm_pipeline**

Though in that case the execution of the pipeline with datalad run has to be recorded in the top-dataset to keep the commit of the raw data used as input.

Or do you have suggestions for another structure?

Then this organizational problem also arises if we want to provide higher-level derivatives, for instance beta maps generated by fitlins, parcellation-based or embeddings timeseries, …
We can create another super-dataset with each sub-dataset setting the preprocessed super-dataset (or subdataset) as sourcedata, but it’s getting really complex.

Looking for guidance and ideas!
Thanks!

effigies · March 5, 2021, 3:25pm

I’m not sure if the below will feel like more manageable complexity, but it boils down to two rules:

Release all datasets at a shallow level for user-friendliness.
Each dataset should still have its deep YODA chain of sources.

You could do a hybrid of options 1 and 2. This is how I would think of it:

processed//
    raw-bids//           # latest version of raw data
        ...
    freesurfer//         # FreeSurfer derivatives, possibly YODA-fied
        ...
    fmriprep//           -> YODA-fied fMRIPrep
        sourcedata/       # Actual directory since we have multiple inputs
            raw-bids//
            freesurfer//  # fMRIPrep default if you use --output-layout bids
        code//           -> Maybe you've got a dataset of veresioned images
            fmriprep-20.2.1.simg
        dataset_description.json
        ...
    my_diff_pipeline//
        sourcedata/
            raw-bids//
        dataset_description.json
        ...
    fitlins//
        sourcedata/
            fmriprep//
        dataset_description.json
        ...

This gives you a relatively flat and understandable structure for data consumers, such that they can find the raw data at raw-bids/ instead of following a chain like fitlins/sourcedata/fmriprep/sourcedata/raw-bids/. But each dataset would still be fully YODA-compliant and record the state of its inputs at the time of run.

As a user, I could choose to include the entire dataset as the sourcedata/ for my analysis:

my_analysis/
    sourcedata/
        cneuromod//           -> Your big dataset 
    code/
        my_analysis_script.sh 
    figures/

Or I could select just one subdataset and use the full hierarchy.

my_analysis/
    sourcedata/
        cneuromod-fitlins//
    code/
        my_analysis_script.sh 
    figures/

One thing that’s a bit fiddly is that it’s possible to update raw-bids// without updating all of the derivatives, when you probably want to ratchet them all together. I might take this one step further and have a release dataset:

release//
    raw-bids//   # Release 1
    processed//  # Empty

When derivatives are ready, bump to:

release//
    raw-bids//  # Release 1
    processed//
       raw-bids//                      # Release 1
       fmriprep//sourcedata/raw-bids// # Release 1
       dmriprep//sourcedata/raw-bids// # Release 1

Now when you curate the second release of the raw data, you bump only the base raw-bids:

release//
    raw-bids//  # Release 2
    processed//
       raw-bids//                      # Release 1
       fmriprep//sourcedata/raw-bids// # Release 1
       dmriprep//sourcedata/raw-bids// # Release 1

And only bump the processed//**/raw-bids// when all derivatives are ready.

bpinsard · March 5, 2021, 4:51pm

Thanks, Chris, for the super-detailed answer!
What you suggest completely make sense and is way clearer than what I had in mind.

A number of the derivatives would only need to source the anat subdataset (for smriprep, qsm or dwi) to simplify the tree.

In cneuromod, as there are separate anat and functional subdatasets and the latter are fmriprep-ed with --anat-derivatives it might be cleaner to have:

processed//
    raw-bids//           # latest version of raw data
        ...
    freesurfer//         # FreeSurfer derivatives, possibly YODA-fied
        ...
    smriprep//           -> YODA-fied sMRIPrep / 
        sourcedata/       # Actual directory since we have multiple inputs
            anat//
            freesurfer//
        code//
            fmriprep-20.2.1.simg

    fmriprep//           -> YODA-fied fMRIPrep
        func-ds1
            sourcedata/       # Actual directory since we have multiple inputs
                 func-ds1/
                 smriprep//
            code//
                fmriprep-20.2.1.simg
             dataset_description.json
        func-ds2
            sourcedata/       # Actual directory since we have multiple inputs
                 func-ds1//
                 smriprep//
            code//
                fmriprep-20.2.1.simg
             dataset_description.json

For the release mechanisms I was thinking about gitflow-like branches and then tags for the end-user such that the user could just update when they want to.
The raw functional subdataset, once their acquisition is complete, would not change much: maybe a few bug-fixes but these would likely be merged in the relevant releases which would be tagged with a minor version.
Well this is not completely true, the physiological data would be integrated in the raw-BIDS in the next release when they are ready, and we might want to enrich the event files of naturalistic stimuli with automatic or manual annotation (though this could be considered as derivatives and live in a separate sub-dataset).
The derivatives could be updated with newer software versions output at newer releases, and would likely integrate bug-fixes as well.

bpinsard · March 5, 2021, 4:58pm

A though: technically sMRIPrep is not really YODA compliant as sMRIPrep is providing an Ants-based mask to freesurfer during the recon-all pipeline.
So there is a cycle in the provenance graph at the level of sMRIPrep and Freesurfer pipelines.

effigies · March 5, 2021, 6:11pm

Yes, the provenance of FreeSurfer/sMRIPrep is non-trivial. If you’re not calculating FreeSurfer separately and passing it to sMRIPrep, I would probably not expose it as a first-level object in your processed// super-dataset, but instead as part of the sMRIPrep dataset.

smriprep//
    sourcedata/
        raw-bids//
        freesurfer/

The slight speed bump should help remind people of the provenance.

dorianps · March 26, 2021, 3:40pm

Allow me to follow up with a question here, it is pertinent to @bpinsard 's question.

I started fmriprep processing yesterday on 250 subjects dataset with multiple timepoints. The latest docker release is 20.2.0, so I went with that, but I see today the latest should be 20.2.1 which adds the --output-layout flag. I am using Datalad and keeping studies completely separate. I am using this structure so far (D=datalad subdataset):

STUDY1 (D)
  code
  README.md
  CHANGELOG.md
  data
    bids (D)                # master bids raw data folder
    processed (D)
      fmriprep-20.2.x (D)
        code                # code used to run fmriprep
        container-images    # singularity images of fmriprep (converted from docker image)
        bids                # copy of the master raw data, cleaned up from non-funcitonal sessions, saved
        fmriprep (D)        # standard fmriprep output, subdataset of fmriprep-20.2.
          sub-*             # sub-* folders, subdataset of fmriprep
        freesurfer (D)      # freesurfer output of fmriprep, subdataset of fmriprep-20.2.x
          fsaverage (D)     # subdataset of fmriprep-20.2.x
          fsaverage5 (D)    # subdataset of fmriprep-20.2.x
          sub-* (D)         # subdatasets of fmriprep-20.2.x
        logs                # custom logs to save cluster job outputs
        README.md
      freesurfer (D)        # main processed freesurfer derivative
        sub-* (D)           # subdatasets of freesurfer 
      templates (D)         # templates created from the dataset, templateflow format
        etc, etc.

The idea of nesting bids in fmriprep is to know which version of the raw data was processed. With datalad, we do not need to keep a physical copy of bids, once I finish fmriprep processing, I will drop all the file content and leave only the git structure which can tell me which commit was the bids folder when processed fmriprep.

I do not fully understand what --output-layout is going to do different, perhaps my YODA knowledge is not deep enough. We do not use sourcedata anywhere, but certainly want to comply with data organization principles of the community, if they fit our needs.

To release datasets, I simply make a copy (or datalad install) of the portion of the tree I need. Most often we distribute raw bids data, and that’s it. I usually tag the datalad versions that are distributed so we have an idea what files to keep if we decide to drop file content from versions that were never distributed.

Fyi, the above structure was created initially for fmriprep 1.5.x, since not much has changed in v20, I thought to keep things the same.

Does this make sense, any tip what can be done better/differently?

And what about fmriprep v20.2.1 in dockerhub, any pointer why it is not there? I am trying to understand if it’s worth stopping the processing to get the latest release.

Last question: is there any major imminent release? I am a bit surprised that there is no more releases since November. Releases used to be very frequent for v 1.5.x.

Thank you.
Dorian

effigies · March 29, 2021, 3:14pm

Hi Dorian,

Right now, fMRIPrep outputs:

<output_dir>/
    fmriprep/
        <fmriprep contents>
    freesurfer/

With --output-layout bids, it becomes:

<output_dir>/
    sourcedata/
        freesurfer/
    <fmriprep contents>

The reasoning is gone into in more depth in the PR that added the flag. The specific goal here is to produce output datasets that are BIDS-Derivatives compliant with minimal post-processing by users.

That all sounds reasonable to me. If the structure works for you, keep it, but in v21+, --output-layout bids will become the default.

The Docker image is now nipreps/fmriprep, not poldracklab/fmriprep, as we shift to community management. To be clear, fMRIPrep has had significant community involvement for years, and this is a recognition of that (and an administrative matter around GitHub permissions), not an abandonment of the project by the Poldrack lab.

Right now development is slow for a few reasons. The first is that the major push we made for 20.2 LTS seems to have paid off in terms of a significant drop in bug reports, and most features that we were pushing at the time are now there. The second is that our paid developer time has been redirected to other projects, such as fMRIPrep variants for infants and rodents. Additional effort is going into improving SDC. Most of these should result in improvements for fMRIPrep, but it’s all a bit below the surface right now.

dorianps · March 29, 2021, 3:50pm

Many thanks @effigies . A follow up question, is that folder sourcedata going to keep an entire copy of the BIDS dataset that was used as input for fmriprep? I am trying to save space using datalad and its capability to keep track of data versions without needing to keep the data content locally rather than having a full copy of raw bids data permanently in the fmriprep folder.

Dorian

effigies · March 29, 2021, 3:59pm

@dorianps fMRIPrep won’t place the original dataset in sourcedata/, that’s a choice I make as a user to include the raw dataset as a subdataset in sourcedata/raw-bids. It can definitely be dropped once done.

And you’re under no obligation to follow that convention. Even with --output-layout bids, you can still get your current organization with:

fmriprep STUDY1/data/bids
         STUDY1/data/processed/fmriprep-20.2.x/fmriprep
         participant \
         --output-layout bids
         --fs-subjects-dir STUDY1/data/processed/fmriprep-20.2.x/freesurfer \
         ...