Datalad underpinning model flawed?

robbie.morrison · February 14, 2020, 9:52am

During the FOSDEM’20 presentation on Datalad by Michael Hanke (available here), I mentioned that git (and GitHub) might not be a great strategy for maintaining versioned datasets. And indeed that remark extends to any source code version control system.

The reasoning is that code versioning is underpinned by one branch of graph theory, whereas data versioning by is underpinned by category theory. Two different fields. I learned this from Martin Glauer, Otto von Guericke University Magdeburg — see also Glauer and Mossakowski (2017). Indeed when I talked to Martin and one other computer scientist, they were skeptical, having worked on the issue, that git would be suitable for managing volatile structured data.

I am not a mathematician nor theoretical computer scientist, so cannot offer an advice. Except to flag this issue as significant.

But I will say that I have coded up concepts (for nonconvex electricity market designs), only to find later that, although the code happily ran and produced answers, those answers were technically meaningless.

References

Glauer, Martin and Till Mossakowski (2017). Institutions for database schemas and datasets. Pages 18—20. (A collection of summaries so you need to skip forward to page 18.)