Working on a datalad dataset togheter

anadiasmaile · June 23, 2022, 4:01pm

Hi,

I was wondering what your experience has been on several users working on one datalad dataset togheter (not sibling datasets but the same one). Have you had any problems in this regard which is why you would advise against it?

Thank you in advance,

Ana

ctr · June 27, 2022, 1:34pm

Hi Ana,

datasets are git repositories, so I think the same arguments for a ‘multi-user’ git repo would apply.

Git (unlike other version control systems) solves the problem of multiple people collaborating on a project by not having everyone work on the same copy. Instead, each user can retrieve a clone (“sibling” in datalad-world) of the “central” repo/dataset (e.g. hosted on GitHub, GIN, or on a shared drive), make local changes, and then push changes back to the “central” remote whenever they are ready. This way, you can make local changes freely, without worrying about interfering with other people working at the same time.

On the other hand, this setup opens up the problem of merging conflicting changes. If two users try to modify the same content, git requires you to resolve the merge conflict, which can be tricky.

Is there a particular reason in your case for not using dataset siblings?

anadiasmaile · June 27, 2022, 2:12pm

Hi ctr,

Thank you for your response. The datalad dataset is situated on an HPC-Server. This enables multiple people working on the project without there having to be one person “owning” the dataset. As you mentioned, if we worked with siblings, changes would have to be merged back to our origial dataset (to make changes accessible to other users). This would be annoying since only one person (the “owner” of the dataset) would be able to/should do that. This is why we have saved our project in a group-accessible directory in which everybody has access and can run scripts (within in the project we are working on different tasks so we are not working on the same files at the same time). This has worked quite well so far except occasional, some obscure problems. Now we are wondering if these problems could be rooted back to the shared access of the directory.

Looking forward to your reply

Ana

ctr · June 28, 2022, 9:15pm

Hi Ana,

thanks for the additional context, now it’s clearer.

I don’t have experience with using datalad on an HPC system, but AFAIK there is an option to configure a dataset sibling so that members of a Unix group (and not just the owner) can push to it: datalad create-sibling --shared=group --group=<groupname>. This uses the git command git init --shared under the hood.

Here it is described for the RIA store remote type, but it should work the same for a regular sibling.

Maybe others who have used this setting can help out

anadiasmaile · June 30, 2022, 3:29pm

Hi ctr,

That actually already helped us out a lot! We are implementing this configuration now and will report back if it worked.

Thank you very much for your help so far.

Best,
Ana

ctr · July 1, 2022, 11:32am

Sounds good! Let us know if it worked out, I’m curious too

yarikoptic · July 1, 2022, 4:36pm

FWIW: Since some initial exploration of feasibility for such “shared” setups, and discovering some problems such as this one which was fixed only recently in 10.'th series of git-annex I must say that I stayed away from any shared git-annex (or just pure non-bare git for that matter). And I would expect in general such setups being less tested. So please make sure that you are using some recent git-annex, and then please report back issues you might encounter:

where some of those “obscure” might even boil down to trivial “incorrect” mode of user operation such as a default restrictive umask or absent group sticky bit on a folder, which would cause permission issues even before talking about any interaction with git and git-annex.

but why in this setup you didn’t make that “central” repo to be the “shared” one , but in which nobody actually introduces any changes or runs anything in directly? Then everyone would have their own clone (possibly with --reckless=ephemeral, thus symlinked .git/annex) and happily operate in the privacy of their clone while pushing changes back to “shared” one.

That would avoid the problem of “running scripts” directly in the shared one that nobody would have any assurance that the state of the shared one wouldn’t be changed by someone else at that point of time, causing errors or just irreproducible results.

eort · July 12, 2022, 5:44pm

Hey!

Chipping in here, as I’m in Ana’s lab and set up the repositories in question.

Sounds good! Let us know if it worked out, I’m curious too

Because of @yarikoptic’s suggestion, we didn’t do any thorough testing, so no idea whether it really really worked out, but on the first glance it looked good. I haven’t seen permission issues since I tried git init --shared

As a side, instead we followed @yarikoptic’s advice and changed the workflow, such that every user has a private clone, from where changes are pushed to the origin. To be sure to do it right this time, I have a few additional questions:

To undo the git init --shared, I manuelly deleted the sharedRepository=1 from the git config. Is that all, or do I have fix some more files? (datalad seems to run fine)
It seems the central repo has no say in whether a clone can push anything or not. Is there a way to protect a dataset, other than setting file permissions in the file system? At the moment, anyone could unlock files (either directly or through run) and make changes to files that should be immutable.
Not sure i understand --reckless=ephemeral correctly. Would this essentially have the effect, that files in the origin annex, but not in the clone, don’t need to be get, to be available in a script? All changes to existing files in the clone would directly affect the origin annex? And new files would also be written directly to the origin annex? The only thing that needs to be pushed are the git files?
slightly unrelated, but is it possible to unregister a subdataset with its parent? I want to retain the subdataset as dataset, but let it forget they have a parent (so the parent dataset would become a regular directory).

Thanks for your help!
Eduard

yarikoptic · July 12, 2022, 6:57pm

there might be better / more elegant way , I would do smth like

mv subds subds.aside; git submodule deinit subds; git rm subds; git commit -m 'unregistered subds' -- subds .gitmodules; mv subds.aside subds

which seems to do what you want

eort · July 13, 2022, 10:10am

I would do smth like

mv subds subds.aside; git submodule deinit subds; git rm subds; git commit -m 'unregistered subds' -- subds .gitmodules; mv subds.aside subds

which seems to do what you want

Maybe I am mistaken, but that seems to do the opposite of what I meant. no? The parent dataset is still a dataset, and the child is a regular directory, now, even though I would like to achieve the parent to be a directory and the child a dataset

yarikoptic · July 13, 2022, 9:18pm

ah, then don’t you just want rm -r .git .gitmodules in your super to convert it to just a directory? (but better first make sure you have nothing under .git/annex/objects there )

edit: a little clarification here. With “pure git” above might have not worked properly since submodules .git/ directory might actually reside within super’s .git/modules/ (IIRC the path correctly), so such removal of top level .git could become detrimental. In datalad, we have to have real .git/ directory (not just a git link file) for the git-annex to operate correctly.