Are the file versions of input and output files tracked when running "datalad run"?

jhpb7 · May 20, 2022, 8:17am

I hope it’s okay if I ask two questions right after each other.

In the datalad manual, when we learn about the inherent application of YODA principles, I stumbled across the following question, the answer to which I can’t seem to find either in the manual or in datalad itself.
Does datalad keep track of the versions of the input and output files? So can I see what version of my input files were used to create the output files?

Again, thanks in advance
and keep up the great work!

ctr · May 21, 2022, 12:37pm

Hi,

novice DataLad user here

As I understand it, DataLad always keeps track of versions of files in a dataset using git/git-annex. Inputs and outputs specified in the datalad run command are no different.

So to recover the exact state of the inputs when datalad run was executed, you can use git commands to show the contents of the input file(s) at that particular commit:

# replace <commit-hash> with the commit ID of the '[DATALAD RUNCMD]' commit
$ git show <commit-hash>:<input-filename>

The exception is if you run datalad run with the --explicit flag, which skips checks for unsaved changes (even for the inputs). In this case, if your inputs have uncommitted changes, datalad run will operate on the not-yet-saved version of the input files, which will not be tracked in the resulting commit. Then you will not be able to determine the state of the inputs at the time of the run.

Hope this helps!

jhpb7 · May 23, 2022, 9:06am

Hello,

thanks for your reply! That was quite helpful
However, when I try your answer, a problem arose that you couldn’t forsee: The datalad run, I use takes input from subdatasets and saves the output to another subdataset. But subdatasets have their own commit-hashes.
So I tried to use the commit-hash of the superdataset inside the subdataset

$ <commit-hash of superdataset>:<input-filename in subdataset>

but of course that does not work.

If I want to know the version/state of the subdatasets, I could manually look in the version history (git log) of the superdataset to see when the subdataset’s state was last saved before the datalad run command. However, this seems quite cumbersome to me.
Is there a better way to find out the subdatasets’ states?

ctr · May 23, 2022, 2:36pm

That’s a good question!

If I understand correctly, you are asking: “At the time of commit C in the parent dataset,
what did file F in subdataset S look like?”

Indeed, according to YODA it would be typical to provide the inputs from a subdataset. But since the parent dataset only stores the subdataset’s commit hash (and not the whole history of the subdataset’s files), we cannot directly query a previous version of a file in a subdataset from the parent.

So you would need to do it in 2 steps:

(1) Find out which revision of the subdataset was stored in the parent dataset at the time of the parent’s commit of interest. For example, when you called datalad run in the parent, what was the subdataset’s commit ID?

You can get this commit ID with the git command (source):

git rev-parse <parent_commit_hash>:<subds_rel_path>

where <subds_rel_path> is the relative path to the subdataset from the parent dataset.

Example:

$ git rev-parse 193d23d82b911:data/inputs/rawdata/bids
e4cda1cf6f5e9f9be9c3baf7697c81cca7661130

(2) Take the resulting commit ID and plug it in the git show command (same as in the simple case) executed in the subdataset. This will print out the contents of the file at the given revision.

Here is a snippet that glues together the two commands (you can copy it including the outer brackets, replace the variables and run it in the shell):

(
    dataset=.
    parent_commit="193d23d82b9116e5247c4ffb46f534e2e37ac7c5"
    subds_path="data/inputs/rawdata/bids"
    subds_file_path="./README.md"

    datalad -f disabled \
        foreach-dataset -d "${dataset}" --contains "${subds_path}" \
        git show "$(git rev-parse "${parent_commit}":"${subds_path}")":"${subds_file_path}" | cat
)

the datalad foreach-dataset command is really good for iterating through subdatasets – in case you have inputs from several different subdatasets.

Still, I wonder if there is a more DataLad-onic () solution for this use case?

jhpb7 · May 25, 2022, 12:15pm

Oh, that’s really wonderful! Thank you so much
I will use this for now.

I also wonder if there is a more native solution for datalad. I think what would be really helpful would be a command that shows the commit-hashes of all the input and output files of a datalad run, possibly using the commit-hash of the run as input.
I’m not sure if that’s just interesting to me, though, or if that’s really helpful to more people… any ideas anyone?

yarikoptic · May 25, 2022, 8:58pm

FWIW there is the very core command datalad status which can give you all kinds of information about the state of anything you point it to. Someone could marry it to datalad diff output to identify what has changed (outputs) and show that information… so-- some helper tooling is there and someone would need to make such a magical command. May be by coding a new datalad extension (see Customization and extension of functionality — DataLad 0.16.3+0.g00aff2c74.dirty documentation ).

But besides a “nice to have convenience” to "get to know gory details about the state of things involved in a run" – is there some specific use-case / demand you are trying to address?

jhpb7 · May 31, 2022, 3:10pm

I’m probably missing something, but the question seems natural to me.
Let’s assume the case where I have a dataset that consists of several sub-datasets:
Some of the subdatasets are databases that I use to create certain outputs (e.g., plots). From time to time I edit my database, for example by updating certain values.
Now, if I want to use a plot I created in a publication, I want to find out which version of the database I used to create that plot, so I need to know the answer to that thread’s question.
The only way to answer this so far is to manually check which version of subdatasets was used when creating the plot with datalad run. This seems a bit cumbersome to me, especially since datalad run already lists all input and output files, so it would probably be easy to add the origin of the inputs ((sub)dataset, commit-hash, …?).
Do you have a better way to solve this issue?

Just to be sure: I’m really new to datalad, I’m just starting to use it for my work and so the “problem” I’m describing is currently just a problem I think would come up when working with datalad.