Datalad branches vs. subsets for experiments

dmusicant · December 10, 2024, 10:21pm

Hi folks, new Datalad user here. I’d love to get a sense of how best to organize Datalad projects for our team.

Our setup is exceedingly common. We’ve got data, we’re building machine learning models on it, and we don’t know exactly where we’re going with it. We’re doing a lot of exploratory work, and trying lots of different kinds of models.

From reading the resources I can find online, I’m seeing multiple alternatives that are being suggested for tracking all of the wild ideas we’re going to implement.

Approach 1: Just use the basic sequential commit history. If you need to get back to an old experiment, checkout that version. This works, but is a little clunky because there isn’t necessarily anything sequential about some of our work. I might have three students each working on their own variation of the problem to see which one sticks, and so we would just arbitrarily order the commits.

Approach 2: Use a new branch for each alternative approach to solving the problem, as is recommended in the Handbook use case for reproducible ML analyses. This allows you to easily compare results across branches. This solves the ordering problem from Approach 1, though then one ends up with lots of dangling branches that never get merged back together. That’s ok, but as someone who is used to Git from a software development perspective, that’s a different mindset.

Approach 3: As described in the FAIRly big framework, use a new subdataset for each experiment. My understanding from the paper is that this helps to facilitate automating running multiple portions of the experiment in parallel, and merging them back together. That subdataset then lives on. This is loosely conceptually similar to Approach 2, but involves using subdatasets instead of branches. Again speaking as someone who has used Git for software development, this seems strange. A subdataset is another repo (well, a submodule), and one normally wouldn’t think about creating a collection of modules for experimental branches of code. I can see how this might be better for data provenance, but it does mean more work in teaching students how to create the datasets, manage them, potentially publish to Github, etc. Branches come for free, in some sense, once you’ve got the dataset created.

To put this all together, my question seems like it must be an FAQ. We’ve got lots of variations on our analysis that we’re trying. What are considered to be best practices for managing those in datalad? Branches, subdatasets, or other?

Thanks for the help, much appreciated.