Datalad on github + cluster storage

Hi Lucas,

Here are a few quick answers, I hope they can help. :slight_smile:

It sounds like you are looking for a RIA store setup. We have some documentation on this here. If you have a DataLad dataset on C and push it in a RIA store on your storage server part, and then create a Github sibling from C, anyone who clones the resulting repo from GitHub will have an automatically configured link to your RIA store to get the data (that is assuming that they have permissions to access data on your cluster part).
The one thing that is not configured automatically in this setup is the link to push data back into the RIA store (they could only push and PR Git history changes), but you could create a custom procedure that can be run automatically when users clone your GitHub repository. We can help/provide examples, so maybe check the linked chapter on whether this suits your usecase and let us know :slight_smile:

If you go for the RIA store set up, then this isn’t necessary, but please follow up if that isn’t what you’re looking for.

DataLad datasets are joint Git/git-annex repositories, and - with rare exceptions - provide all features that both those tools provide. So, yes, branching is perfectly possible, as are PRs.

That’s my personal opinion, but I would advocate for sharing code alongside to data and results, ideally with documentation on how it should be executed. To be even more transparent, you could use datalad run or datalad containers-run commands to link data, code (+ software), and execution, as those commands produce re-executable records in the Git history of the dataset.

One thing first: datalad add is a very old command (and deprecated, I think). Please always use datalad save (it combines staging and committing) instead of datalad add.
The aim would definitely be to have your scripts saved in Git (and not git-annex). This makes those scripts available on GitHub and for CI, does not add the complexity of file locking to modifying the scripts, and keeps all version control experiences of those files precisely as how they would be if stored in a sole Git repository.

As for using git add + git commit versus datalad save to achieve that: Both methods are possible, it depends a bit on how you and your users prefer it to be. You can always use git to commit files into the Git portion of your dataset. You can also, however, add a configuration (so called “largefile rules” in a .gitattributes file) to the dataset that configures DataLad to store certain files/file types/folders/files of a certain size/… in Git instead of git-annex. There is documentation on how to do this by hand here, and there are also “dataset procedures” that can do such a configuration automatically (e.g., cfg_yoda, which creates a code/ directory and saves all files that are placed in there into Git), or you could write a custom procedure for your usecase and distribute it in your institute (docs on this here).

Hope that’s a start.
Cheers,
Adina