I was wondering what your experience has been on several users working on one datalad dataset togheter (not sibling datasets but the same one). Have you had any problems in this regard which is why you would advise against it?
datasets are git repositories, so I think the same arguments for a âmulti-userâ git repo would apply.
Git (unlike other version control systems) solves the problem of multiple people collaborating on a project by not having everyone work on the same copy. Instead, each user can retrieve a clone (âsiblingâ in datalad-world) of the âcentralâ repo/dataset (e.g. hosted on GitHub, GIN, or on a shared drive), make local changes, and then push changes back to the âcentralâ remote whenever they are ready. This way, you can make local changes freely, without worrying about interfering with other people working at the same time.
On the other hand, this setup opens up the problem of merging conflicting changes. If two users try to modify the same content, git requires you to resolve the merge conflict, which can be tricky.
Is there a particular reason in your case for not using dataset siblings?
Thank you for your response. The datalad dataset is situated on an HPC-Server. This enables multiple people working on the project without there having to be one person âowningâ the dataset. As you mentioned, if we worked with siblings, changes would have to be merged back to our origial dataset (to make changes accessible to other users). This would be annoying since only one person (the âownerâ of the dataset) would be able to/should do that. This is why we have saved our project in a group-accessible directory in which everybody has access and can run scripts (within in the project we are working on different tasks so we are not working on the same files at the same time). This has worked quite well so far except occasional, some obscure problems. Now we are wondering if these problems could be rooted back to the shared access of the directory.
thanks for the additional context, now itâs clearer.
I donât have experience with using datalad on an HPC system, but AFAIK there is an option to configure a dataset sibling so that members of a Unix group (and not just the owner) can push to it: datalad create-sibling --shared=group --group=<groupname>. This uses the git commandgit init --shared under the hood.
FWIW: Since some initial exploration of feasibility for such âsharedâ setups, and discovering some problems such as this one which was fixed only recently in 10.'th series of git-annex I must say that I stayed away from any shared git-annex (or just pure non-bare git for that matter). And I would expect in general such setups being less tested. So please make sure that you are using some recent git-annex, and then please report back issues you might encounter:
where some of those âobscureâ might even boil down to trivial âincorrectâ mode of user operation such as a default restrictive umask or absent group sticky bit on a folder, which would cause permission issues even before talking about any interaction with git and git-annex.
but why in this setup you didnât make that âcentralâ repo to be the âsharedâ one , but in which nobody actually introduces any changes or runs anything in directly? Then everyone would have their own clone (possibly with --reckless=ephemeral, thus symlinked .git/annex) and happily operate in the privacy of their clone while pushing changes back to âsharedâ one.
That would avoid the problem of ârunning scriptsâ directly in the shared one that nobody would have any assurance that the state of the shared one wouldnât be changed by someone else at that point of time, causing errors or just irreproducible results.
Chipping in here, as Iâm in Anaâs lab and set up the repositories in question.
Sounds good! Let us know if it worked out, Iâm curious too
Because of @yarikopticâs suggestion, we didnât do any thorough testing, so no idea whether it really really worked out, but on the first glance it looked good. I havenât seen permission issues since I tried git init --shared
As a side, instead we followed @yarikopticâs advice and changed the workflow, such that every user has a private clone, from where changes are pushed to the origin. To be sure to do it right this time, I have a few additional questions:
To undo the git init --shared, I manuelly deleted the sharedRepository=1 from the git config. Is that all, or do I have fix some more files? (datalad seems to run fine)
It seems the central repo has no say in whether a clone can push anything or not. Is there a way to protect a dataset, other than setting file permissions in the file system? At the moment, anyone could unlock files (either directly or through run) and make changes to files that should be immutable.
Not sure i understand --reckless=ephemeral correctly. Would this essentially have the effect, that files in the origin annex, but not in the clone, donât need to be get, to be available in a script? All changes to existing files in the clone would directly affect the origin annex? And new files would also be written directly to the origin annex? The only thing that needs to be pushed are the git files?
slightly unrelated, but is it possible to unregister a subdataset with its parent? I want to retain the subdataset as dataset, but let it forget they have a parent (so the parent dataset would become a regular directory).
Maybe I am mistaken, but that seems to do the opposite of what I meant. no? The parent dataset is still a dataset, and the child is a regular directory, now, even though I would like to achieve the parent to be a directory and the child a dataset
ah, then donât you just want rm -r .git .gitmodules in your super to convert it to just a directory? (but better first make sure you have nothing under .git/annex/objects there )
edit: a little clarification here. With âpure gitâ above might have not worked properly since submodules .git/ directory might actually reside within superâs .git/modules/ (IIRC the path correctly), so such removal of top level .git could become detrimental. In datalad, we have to have real .git/ directory (not just a git link file) for the git-annex to operate correctly.