How to permanently delete a commit in git (annex)?

I originally posted this on StackOverflow but figure the audience here would have more experience with datalad.

I’ve started using datalad, a wrapper for git annex, to version control data and expirements in my lab. It works great except the .git folder can silently grow enormous, especially when going back and forth in git history to repeat certain steps. For example, sometimes I make a commit, realize I need to fix something, so roll it back with git reset HEAD~ then make additional commits from there. This orphans the commit that was formerly the HEAD so it doesn’t appear in git log but all its associated files will still be in the annex and if you have the commit sha you can still git show it. How can I delete these orphaned commits permanently so they and their associated files aren’t taking up disk space? I tried git gc --prune=now --aggressive but that seemingly did nothing.

For example:

datalad create test
cd test
# create new branch
git branch tmp
git checkout tmp
# build up a git history to play with
echo a > f
datalad save -m a
datalad run -i . -o . bash -c "echo aa > f"
datalad run -i . -o . bash -c "echo aaa > f"
# cat all annexed files (where symlinks point)
find .git/annex/objects -type f | xargs -I{} cat {}
# prints out:
# a
# aaa
# aa
# remove last 2 commits
git reset --hard HEAD~2
# make another commit from 2 commits ago
datalad run -i . -o . bash -c "echo b > f"
# print out git annex'd files again
find .git/annex/objects -type f | xargs -I{} cat {}
# should print
# a
# aaa
# b
# aa
# everything is still there, despite the git reset --hard
git checkout master
git branch -D tmp
git gc --prune=now --aggressive
# check what's there again
find .git/annex/objects -type f | xargs -I{} cat {}
# everything is still in the annex, even after deleting the branch and running git gc!

The best solution would probably be to only use datalad when everything’s tested so the problem doesn’t arise, but once I’ve orphaned a commit, is starting over the only way to get a minimal .git folder? I figure I could iterate over all my symlinks and see which git objects don’t point to any and delete them, but manually messing with the .git folder seems like a bad idea.

git gc would take care about removing the commits from .git/objects but annex’ed files under .git/annex/objects would indeed persist. For annexed files, you can use git annex unused to find annexed files which are no longer used in the refs you specify (so you could e.g. drop data for intermediate steps between tagged “releases”) and then use git annex drop --unused.
Note, that git-annex branch would still keep that in its history. So if you are to do it thousands of times, it might be not a complete solution and you might may be compliment it with git annex forget to forget the history of annex entirely

1 Like