Datalad rerun thinks commit was ran from a different dataset

timothy · April 5, 2021, 3:07pm

Using datalad made my analysis take much longer¹ but now I have a wonderful git history to show my PI which captures all the provenance needed to reproduce any command and its modifications to data. To demonstrate its value to my PI I’m trying to show how easy it is to reproduce parts of the analysis with datalad rerun, but datalad refuses, saying I ran the command from a different dataset. The structure of my data is roughly

.
├── results
│   ├── sub-001
│   │   ├── t1.nii.gz
│   ├── sub-002
│   │   ├── t1.nii.gz
│   ├── sub-n
│   │   ├── t1.nii.gz
[...]

where . is my top-level superdataset, results is a subdataset, and each subject is a sub dataset of the results sub-dataset. I want to reproduce the results for the first subject, so I found the commit hash (1a7f721) generated by datalad run when the data were first processed. First I tried simply running datalad rerun --branch=verify-$(date +"%Y-%m-%d")-${sub} 1a7f721 from my top-level superdataset, but that failed with the error fatal: bad revision '1a7f721’. I realized the commit I’m referencing is two subdatasets down, even though the datalad run command was initially executed from the root of the project (top-level superdataset). So I tried cd’ing into the sub-sub-dataset and running

sub=001
cd results/sub-${sub}
datalad rerun --branch=verify-$(date +"%Y-%m-%d")-${sub} 1a7f721

which resulted in the error

[INFO ] checkout commit 1a7f721;
[INFO ] skip-or-pick commit 1a7f721; (singularity exec …) 1a7f721 was ran from a different dataset; skipping or cherry picking
run(impossible): /scratch/8275053/ds/results/sub-001 (dataset) [1a7f721 was ran from a different dataset; skipping]

when I git show 1a7f721 it prints the JSON for the command including the dsid, which matches the id in .datalad/config of the top-level dataset. So I know the command was run from the top-level dataset, but I can only run datalad rerun from the subjects sub-sub dataset because that’s where the commit belongs. How can I tell datalad the root of the project is the dataset 1a7f721 was run from, but that 1a7f721 itself belongs to a sub-sub-dataset?

[1] using —explicit really helps, but I’ve run into performance limitations with a surprisingly small number of files and git submodules per repo so things as simple as git status take days to run

kyleam · April 5, 2021, 4:37pm

If I understand the scenario correctly, it should be captured by this minimal example:

#!/bin/sh

set -eu

cd "$(mktemp -d "${TMPDIR:-/tmp}"/dl-XXXXXXX)"
datalad create
datalad create -d. sub
datalad run "echo foo >>sub/foo"
commit=$(git -C sub rev-parse @)
git commit -m'non-run commit' --allow-empty

datalad rerun "$commit" || :          # 1 | fails: bad revision
datalad -C sub rerun "$commit" || :   # 2 | fails: ran from different dataset
datalad rerun "$(git rev-parse @~)"   # 3 | works

As you described, the first case fails because the specified commit ID belongs to the subdataset. The second case fails because datalad rerun refuses to rerun from a different dataset than the one used for the initial datalad run.

The invocation that works uses the corresponding run commit ID from the dataset of the original datalad run call (i.e. the commit where the “Subproject commit” is updated).

I’ve run into performance limitations with a surprisingly small number of files and git submodules per repo so things as simple as git status take days to run

Oy, running git status directly takes days? Is that just on one particular system? I’d recommend throwing a lot of effort at figuring out what’s going on there because in my view that’s pretty much an unusable state.

timothy · April 5, 2021, 5:42pm

I’m a bit confused because when I run your example the [DATALAD RUNCMD] commit appears in both the super- and sub- dataset, but for me the datalad run commit is only in the sub dataset. I think this is because my workflow (following the datalad handbooks advice) involves cloning to $TMPDIR, running processing there, then pushing (only) the results subdataset back to its non-temporary location, then merging all the commits at the end. So whereas your example produces the git history:

a9a96c9403b1c90213dda82505e9a32b74f92ca8 (HEAD -> master) non-run commit
fca50b92b4d8c06422fcf1c40c5b43e59ffbbd5a [DATALAD RUNCMD] echo foo >>sub/foo
54f188589d1628c0d1c0ccb4c5faa28411e7f842 [DATALAD] Recorded changes
336dc2b7cb0a3e7a7b9285f6d81830e1d7ff6e99 [DATALAD] new dataset

my git history is missing the DATALAD RUNCMD commits on the top-level (they only exist within each sub dataset) and instead the merge is represented in git history as a submodule-updating commit e.g.

-Subproject commit 3cf19888defbd549d2c33e44edf14c266422d55d
+Subproject commit 7d58f935f6b7ee81db3687eeb71ceeed47acd4ef

When I ask datalad to rerun everything with datalad rerun --since=HEAD~74 --script - (there are 74 commits in my git history) it prints a few commands which I ran without the clone-to-$TMPDIR-then-push-results pattern, but none of the datalad run commands that followed the aforementioned pattern are saved (since those commits only exist in the subdatasets). Does this mean it’s impossible to use datalad rerun to reproduce my results?

The git status that takes so long happens in the sub dataset that version controls the dicoms since there are so many small files. One of the biggest challenges of datalad for me has been to decide what level of tracking is actually necessary / useful w.r.t reproducibility, because running absolutely everything through datalad has not proven feasible.

kyleam · April 5, 2021, 6:37pm

I think this is because my workflow (following the datalad handbooks advice) involves cloning to $TMPDIR, running processing there, then pushing (only) the results subdataset back to its non-temporary location, then merging all the commits at the end

Yeah, that sounds like the spot where the top-level run commits are lost on your end.

Does this mean it’s impossible to use datalad rerun to reproduce my results?

Yes, to re-execute, datalad rerun requires that the top-level run commit exist. The details (including presumably a script like the one in the handbook link) are tracked, so you should have the information you need to re-execute, but it’s not something datalad rerun can handle.

That’s of course describing the current design and capabilities of datalad rerun. There might end up being a straightforward way to extend rerun to extract run commits from subdatasets as well, but from thinking about it briefly, my guess is that it’d be pretty tricky.

timothy · April 6, 2021, 9:34pm

Thanks kyleam, that helps a lot. I’m going to try adding each subdataset commit as an empty commit on the superdataset post-hoc, but if that doesn’t work I’m thinking I’ll just have to repeat my analysis from the start.