Parallel universes and Datalad

dmoracze · August 2, 2023, 8:20pm

@yarikoptic, before we write our own tools and/or come up with our own solutions, I want to see if our needs could be covered by Datalad.

We manage datasets of all sizes (mostly rawdata, but occasionally derivatives) that multiple groups interact with on our HPC. Occasionally these datasets need to be updated as new data arrives. The scenario: Group1 begins a study on a dataset, they understandably want to access this dataset as it existed when they began their analysis. Group2 wants a newer version of this dataset, so we download the new data and integrate it into the dataset. We want to make both groups happy.

We are looking for a solution where:

Any user can access any version of the dataset
Multiple people can access different versions at the same time.
No file redundancy to reduce disk use. If a file is the same between versions, a symlink is created, but if the file’s contents are different, a copy is created and stored in each version. For example, participants.tsv will change if new subjects are added.

Is this something Datalad can do? Previously I’ve only used it where a dataset can only exist as one version at a time on the filesystem.

Thank you!

yarikoptic · August 2, 2023, 9:00pm

Short answer: it is exactly the setup you can setup/achieve with git/git-annex and facilitated by DataLad.

Keywords on implementation/HOWTO:

git to provide versioning and branches and git-annex to provide versioning of large data files through symlinks (both of which DataLad uses)
ephemeral reckless clones. So just datalad clone --reckless ephemeral ORIGINAL_LOCATION TARGET_PERSONAL_LOCATION --branch MY_PERSONAL_BRANCH to e.g. immediately get the version pointed to by MY_PERSONAL_BRANCH (or you could simply git checkout desired version after clone) and you are set as long as ORIGINAL_LOCATION has annexed content for all the versions.

Explanation of operation:
everyone gets their own version, while TARGET_PERSONAL_LOCATION/.git/annex folder of that clone points to the ORIGINAL_LOCATION/.git/annex instance which has all the annex’ed files available. As a result you get local checkouts at whatever version you want, and where file is under git control (e.g. as typically would be for participants.tsv) - they would get a version of the file directly placed on the hard drive for them by git and annexed files which are symlinks to the .git/annex would follow that symlink and point to ORIGINAL_LOCATION/.git/annex containing all the large files in a single copy.

PS Another alternative, on how to even avoid needing --reckless ephemeral is to use CoW filesystem, such as BTRFS. Then git-annex get which uses cp --reflink=auto would create CoW “copy” of the file if original and target location are on the same CoW filesystem – which results in super-fast copy since underlying “data block” actually would not need to be copied at all. This would be preferable e.g. in case if the ORIGINAL_LOCATION could be removed/damaged/etc.

Remi-Gau · August 2, 2023, 10:36pm

I think the hardest in this is making sure that the users have “some” understanding of what is going on. If they are familiar with git branches, then confusion may follow.

dmoracze · August 2, 2023, 11:39pm

Thank you for this!! We’re going to toy around with your solution for awhile.

And yes, @Remi-Gau I am concerned about that. I want to make the barrier to entry to use our datasets as low and fool-proof as possible.

earl · August 4, 2023, 8:33pm

@yarikoptic Related question: I’m trying your suggested datalad technique out right now and there’s no --branch option that I see.

I’m on version 0.14.7-1~nd18.04+1 on an Ubuntu 18.04.6 LTS machine.

Is the --branch MY_PERSONAL_BRANCH strictly necessary for this technique? Or are you maybe thinking of a separate second command to be enacted on the git-annex files specifically?

mszczepanik · August 7, 2023, 9:25am

Hi @earl, let me jump in with an answer.

--branch would be passed directly to Git as a git clone option (with git clone call being part of the datalad clone operation), but you need at least DataLad 0.16 (changelog) for that.

In general, I see DataLad 0.14 as very old, and there was a lot of improvements and changes in behavior since. While packaging for Ubuntu 18.04 apparently stopped there, I would strongly recommend using other installation methods (e.g. conda, pip) to get a more recent DataLad version.

The --branch MY_PERSONAL_BRANCH is not strictly necessary. As @yarikoptic suggested, if you use Git branches (or tags) to mark dataset versions, you could use --branch to check out the desired branch (or tag, probably) immediately after cloning. The same can be achieved with git switch / git checkout later, but adds a step.

No file redundancy to reduce disk use. If a file is the same between versions, a symlink is created, but if the file’s contents are different, a copy is created and stored in each version. For example, participants.tsv will change if new subjects are added.

By the way, one comment about ephemeral reckless clones: this mode is called “reckless” for a reason - clones made from a source symlink directly to the source’s storage location. Whether you need that depends on how strictly you mean “No file redundancy”. Within each clone (including the one which you may designate as a persistent store for both groups to use), what you describe is already the default behaviour - all “annexed” files (files handled by git annex) use this symlinking mechanism.

The non-reckless way would be that users agree on a persistent store, from which every user makes their own clone, and use get / drop to move files (copies) in and out of their clone’s storage. Something like this is discussed here in the Handbook. --reckless ephemeral means we pretend that the clone has its storage containing a copy of the data content, while in fact it uses one from the source (useful in some setups, but potentially less protective for the data).

yarikoptic · August 7, 2023, 1:36pm

Did Earth already gained some curvature back when that version was out? I would recommend upgrading. Since there is no neurodebian version for that 18.04 – would need to go conda-forge route.

It is absolutely not necessary. You can datalad clone and then git checkout whatever commit/branch you like. If you already have that git-annex’ed content present in the original repository locally, nothing else to do.