Datalad push updated file back to 'addurl' URL

larc909a · August 16, 2021, 6:17pm

Hello there, I am currently using datalad to version my data stored on S3. I use the datalad addurl command, which pushes the s3 urls to a GitHub repo. However, if I clone the repo on another machine and use datalad get to retrieve file contents, if I make an update to a file, how can I push this back to the same S3 url where it was retrieved from? If I use datalad push --to=s3.remote, it uploads the updated file back to s3, but under a MD5 hash name, resulting in two copies of the file now (1) the original file on S3, 2) the newly updated file on S3 with hash name).

Thank you for the assistance.

yarikoptic · August 16, 2021, 6:55pm

tricky case in that you are “mixing up” the modes how git-annex could be used: one is just reference remote urls (which could point to arbitrary locations on s3), and another - use of s3 as a “git annex special remote”.
In the latter case, as you have mentioned, git-annex just uploads annexed files (keys), and has no “relation” to the fact that the file was also available from possibly the same bucket somewhere else. If you solely used that mode (s3 special remote), you would collect all content you pushed to it, not necessarily only the “most recent versions” (so you could still git checkout some prior commit and get the content). If you desire to remove some/all old(er) versions – git annex unused and git annex dropunused --from s3.remote would be your friends to keep only the recent (or pointed by some git tagged versions?) content on the remote.
If you would have liked to have s3.remote to have “readily usable” “filenames” (not just keys, thus possibly duplicating if multiple files point to the same content) – you could have configured that remote with exporttree=yes – then you could use git annex export command to upload what needs to be uploaded to it (unfortunately we are yet to support such operation at datalad level – publish/push should auto).
If you went this way, then instead of addurl you should have used git annex import (although I have not tried it!) which would have established initial local hierarchy from S3. At this point, since you already have content etc, I don’t know how it would behave if you try, but since it is all git you can just create a throwaway clone and try I guess.

Hope this helps, even if only somewhat