Working on a datalad dataset togheter

Hi,

I was wondering what your experience has been on several users working on one datalad dataset togheter (not sibling datasets but the same one). Have you had any problems in this regard which is why you would advise against it?

Thank you in advance,

Ana

Hi Ana,

datasets are git repositories, so I think the same arguments for a ‘multi-user’ git repo would apply.

Git (unlike other version control systems) solves the problem of multiple people collaborating on a project by not having everyone work on the same copy. Instead, each user can retrieve a clone (“sibling” in datalad-world) of the “central” repo/dataset (e.g. hosted on GitHub, GIN, or on a shared drive), make local changes, and then push changes back to the “central” remote whenever they are ready. This way, you can make local changes freely, without worrying about interfering with other people working at the same time.

On the other hand, this setup opens up the problem of merging conflicting changes. If two users try to modify the same content, git requires you to resolve the merge conflict, which can be tricky.

Is there a particular reason in your case for not using dataset siblings?

Hi ctr,

Thank you for your response. The datalad dataset is situated on an HPC-Server. This enables multiple people working on the project without there having to be one person “owning” the dataset. As you mentioned, if we worked with siblings, changes would have to be merged back to our origial dataset (to make changes accessible to other users). This would be annoying since only one person (the “owner” of the dataset) would be able to/should do that. This is why we have saved our project in a group-accessible directory in which everybody has access and can run scripts (within in the project we are working on different tasks so we are not working on the same files at the same time). This has worked quite well so far except occasional, some obscure problems. Now we are wondering if these problems could be rooted back to the shared access of the directory.

Looking forward to your reply :slight_smile:

Ana

Hi Ana,

thanks for the additional context, now it’s clearer.

I don’t have experience with using datalad on an HPC system, but AFAIK there is an option to configure a dataset sibling so that members of a Unix group (and not just the owner) can push to it: datalad create-sibling --shared=group --group=<groupname>. This uses the git command git init --shared under the hood.

Here it is described for the RIA store remote type, but it should work the same for a regular sibling.

Maybe others who have used this setting can help out :slight_smile:

Hi ctr,

That actually already helped us out a lot! We are implementing this configuration now and will report back if it worked.

Thank you very much for your help so far.

Best,
Ana

Sounds good! Let us know if it worked out, I’m curious too :slight_smile:

FWIW: Since some initial exploration of feasibility for such “shared” setups, and discovering some problems such as this one which was fixed only recently in 10.'th series of git-annex I must say that I stayed away from any shared git-annex (or just pure non-bare git for that matter). And I would expect in general such setups being less tested. So please make sure that you are using some recent git-annex, and then please report back issues you might encounter:

where some of those “obscure” might even boil down to trivial “incorrect” mode of user operation such as a default restrictive umask or absent group sticky bit on a folder, which would cause permission issues even before talking about any interaction with git and git-annex.

but why in this setup you didn’t make that “central” repo to be the “shared” one , but in which nobody actually introduces any changes or runs anything in directly? Then everyone would have their own clone (possibly with --reckless=ephemeral, thus symlinked .git/annex) and happily operate in the privacy of their clone while pushing changes back to “shared” one.

That would avoid the problem of “running scripts” directly in the shared one that nobody would have any assurance that the state of the shared one wouldn’t be changed by someone else at that point of time, causing errors or just irreproducible results.