@yarikoptic, before we write our own tools and/or come up with our own solutions, I want to see if our needs could be covered by Datalad.
We manage datasets of all sizes (mostly rawdata, but occasionally derivatives) that multiple groups interact with on our HPC. Occasionally these datasets need to be updated as new data arrives. The scenario: Group1 begins a study on a dataset, they understandably want to access this dataset as it existed when they began their analysis. Group2 wants a newer version of this dataset, so we download the new data and integrate it into the dataset. We want to make both groups happy.
We are looking for a solution where:
- Any user can access any version of the dataset
- Multiple people can access different versions at the same time.
- No file redundancy to reduce disk use. If a file is the same between versions, a symlink is created, but if the file’s contents are different, a copy is created and stored in each version. For example, participants.tsv will change if new subjects are added.
Is this something Datalad can do? Previously I’ve only used it where a dataset can only exist as one version at a time on the filesystem.
Thank you!