I originally posted this on StackOverflow but figure the audience here would have more experience with datalad.
I’ve started using datalad, a wrapper for git annex, to version control data and expirements in my lab. It works great except the .git folder can silently grow enormous, especially when going back and forth in git history to repeat certain steps. For example, sometimes I make a commit, realize I need to fix something, so roll it back with git reset HEAD~
then make additional commits from there. This orphans the commit that was formerly the HEAD so it doesn’t appear in git log
but all its associated files will still be in the annex and if you have the commit sha you can still git show
it. How can I delete these orphaned commits permanently so they and their associated files aren’t taking up disk space? I tried git gc --prune=now --aggressive
but that seemingly did nothing.
For example:
datalad create test
cd test
# create new branch
git branch tmp
git checkout tmp
# build up a git history to play with
echo a > f
datalad save -m a
datalad run -i . -o . bash -c "echo aa > f"
datalad run -i . -o . bash -c "echo aaa > f"
# cat all annexed files (where symlinks point)
find .git/annex/objects -type f | xargs -I{} cat {}
# prints out:
# a
# aaa
# aa
# remove last 2 commits
git reset --hard HEAD~2
# make another commit from 2 commits ago
datalad run -i . -o . bash -c "echo b > f"
# print out git annex'd files again
find .git/annex/objects -type f | xargs -I{} cat {}
# should print
# a
# aaa
# b
# aa
# everything is still there, despite the git reset --hard
git checkout master
git branch -D tmp
git gc --prune=now --aggressive
# check what's there again
find .git/annex/objects -type f | xargs -I{} cat {}
# everything is still in the annex, even after deleting the branch and running git gc!
The best solution would probably be to only use datalad when everything’s tested so the problem doesn’t arise, but once I’ve orphaned a commit, is starting over the only way to get a minimal .git folder? I figure I could iterate over all my symlinks and see which git objects don’t point to any and delete them, but manually messing with the .git folder seems like a bad idea.