Datalad setup with orphan branches fails to recursively push data

peterg1t · March 28, 2023, 11:58pm

Please describe the problem.
We are trying to use Datalad with orphan branches and subdatasets. Taking a page from the FAIRly Big framework FAIRly big: A framework for computationally reproducible processing of large-scale data | Scientific Data we are trying to do some processing with the difference that one repo will contain multiple processing streams For this we are using orphan branches as every branch should be as independent as possible from all others. In our approach we clone a mirror job branch and perform some computations in one or more subdatasets on this branch that are then used to update the original orphan branch. Once the “job” branch on the remote location is pushed back we hoped to merge it with the local “processing branch” (including subdatasets contents)

We push back the content with:
datalad push -d source_branch --to origin
After this git log on the branch it shows that the calculations where done successfully however annex data is not on the branch. We have also tried
datalad push -d {source_dataset} --to origin --data anything -f all -r
with the same results. The git log is updated showing the results of our datalad run commands but no data is ever pushed back.

What steps will reproduce the problem?
0.18.0
Please provide any additional information below.

The anatomy of a sample working dir looks like this
├── INP
│ ├── F1 → .git/annex/objects/65/Jw/MD5E-s5000–46e072a88fbca15cdeb70c455338e15b/MD5E-s5000–46e072a88fbca15cdeb70c455338e15b
│ └── F2 → .git/annex/objects/fW/p7/MD5E-s5000–fdac567e99d4e414ba21aca1bdce2f51/MD5E-s5000–fdac567e99d4e414ba21aca1bdce2f51
├── OUT
│ ├── O1 → .git/annex/objects/65/Jw/MD5E-s5000–46e072a88fbca15cdeb70c455338e15b/MD5E-s5000–46e072a88fbca15cdeb70c455338e15b
│ ├── O2 → .git/annex/objects/fW/p7/MD5E-s5000–fdac567e99d4e414ba21aca1bdce2f51/MD5E-s5000–fdac567e99d4e414ba21aca1bdce2f51
└── tf.csv

Our branches look like this
git-annex
main
proc_f1

proc_f2
where proc_f1 and proc_f2 are orphan branches. proc_f1 whould contains F1 and O1 objects whereas proc_f2 contains F2 and O2

Have you had any luck using DataLad before? Datalad is great! we are trying to push the limits of what we can do with it. Thank you very much.

eknahm · April 11, 2023, 8:21am

In order to understand what is happening, please allow me a few questions:

For this we are using orphan branches as every branch should be as independent as possible from all others

It is not clear to me how orphan branches address this (better than the setup used in the original publication). I am using the following definition of an orphan branch

a Git branch that has no parents or git history, i.e. no shared history with any other branch.

At the surface, this is somewhat in conflict with the idea of FAIRly-big, which is:

bootstrap a computational workspace from a committed specification
execute a “job” with a particular parameter configuration
capture the outcomes (together with a definitive state of all inputs/dependencies of the job)
merge outcomes of multiple branches to form a new joint state of a dataset

You write:

one repo will contain multiple processing streams

Applied to FAIRly-big, this would translate to multiple disjoint histories, but it is not clear to me how that would affect a branch that captures a computational output (which is the one that would need pushing back).

In FAIRly-big (even if a repo contains multiple non-shared histories), a compute job is bootstrapped from a definited/committed state. Based on this state a new branch is created to received a commit with captured outputs. This branch has a common history with the branch/commit that defines the computational environment. Even when the job fails and no output could be captured, that branch exists (and is pushed back) to indicate that the job ran at all.

Please check, if there are any conceptual differences between what you are trying to achieve and this description.

re technical issue: It would be helpful to see a log of the full commands and their output, together with the information where they were (which clone) executed (prior job, within job, after push).

StephanHeunis · May 10, 2023, 10:39am

Hi @peterg1t, just following up on this, have you been able to look into @eknahm comments and questions and can you provide answers to them?

peterg1t · May 10, 2023, 2:42pm

Hi Michael and Stephan,
My apologies for the late response. I got distracted with other tasks and missed the first message. So diving into the questions.
Our goal was not quite to reproduce FAIRly big. In FAIRly big to perform a different set of analysis for example you could clone the original repo and do so, but what if we want to have this in the same location? Multiple orphan branches for the superdataset would contain the dis-jointed and individual histories of the different processing streams for example and for every branch a FAIRly big style processing would work as well.
I realized after a bit of more tinkering that I was performing some of the operations in the wrong order. Namely, I was creating a subdataset to contain the output information before creating the orphan branches. So the output exists for all branches as expected. If I wanted to achieve the intended goals I would have to create the (different) subdatasets after creating the orphan branches. I have to say that we definitely can close this question as Datalad is performing as expected!!
My apologies if this looks confusing. It kinda is, as we are pushing Datalad to do something a bit different and maybe in the near future we can share our results with the community.
Thank you very much for your help with this,
Pedro

cmo · May 11, 2023, 3:34pm

Hi Pedro, you are welcome. If I understand you correctly, the issue is solved? Best, Christian (for the datalad team).

peterg1t · May 11, 2023, 3:47pm

Yes, I have marked Michael’s reply as the answer.
Cheers!

Pedro