Data Versioning (git and git-annex)

jdkent · January 19, 2018, 3:46am

I would like to use git, git-annex, potentially datalad, but we keep our data on a server that connect with the CIFS/SMB protocol, and multiple OS’s (linux, os x, and windows) all have access and write files to it. git-annex requires symbolic links to be understood, but I can’t think of a configuration that would allow those symbolic links to be understood across OS’s. The other option was to use git annex direct mode, but that removes features from git that I would to use. Does git (and not git-annex) work well enough for you guys? Do you only store locally or in the cloud? what sort of strategies are out there for data versioning when you’re actively collecting data?

Thanks!
James

eknahm · March 3, 2018, 3:21pm

I you datalad in a variety of scenarios. This includes collaboration on the same dataset on a shared network mount (NFS in my case). However, in cases when not everybody involved is very disciplined, that can lead to some friction – things change that shouldnt have, etc.

I personally find that going the Git path all the way, i.e. having a clone of the repos that I need with that data the I want, is often best. This even works in a “push” case, where I contribute to a data collection effort. I have a practically empty clone of a datalad dataset, I collect new data, add it to the dataset and push it to a shared storage server. Such server could be an ssh accessible machine, or be as stupid as a webdav account. Git-annex special remotes offer quite a range of possibilties. Here is a demo for the webdav case:
http://datalad.org/for/data-publication