Data Versioning (git and git-annex)



I would like to use git, git-annex, potentially datalad, but we keep our data on a server that connect with the CIFS/SMB protocol, and multiple OS’s (linux, os x, and windows) all have access and write files to it. git-annex requires symbolic links to be understood, but I can’t think of a configuration that would allow those symbolic links to be understood across OS’s. The other option was to use git annex direct mode, but that removes features from git that I would to use. Does git (and not git-annex) work well enough for you guys? Do you only store locally or in the cloud? what sort of strategies are out there for data versioning when you’re actively collecting data?



I you datalad in a variety of scenarios. This includes collaboration on the same dataset on a shared network mount (NFS in my case). However, in cases when not everybody involved is very disciplined, that can lead to some friction – things change that shouldnt have, etc.

I personally find that going the Git path all the way, i.e. having a clone of the repos that I need with that data the I want, is often best. This even works in a “push” case, where I contribute to a data collection effort. I have a practically empty clone of a datalad dataset, I collect new data, add it to the dataset and push it to a shared storage server. Such server could be an ssh accessible machine, or be as stupid as a webdav account. Git-annex special remotes offer quite a range of possibilties. Here is a demo for the webdav case: