Git annex whereis knows the url to a file, but datalad get cannot find it

watsone · April 2, 2024, 2:49pm

Summary of what happened:

I’ve been trying a set up a dataset that primarily lives on a web server, but needs to be clone-able by other people. The annex files are visible and downloadable from the server’s website. In particular, the files I’m concerned about here are in a subdataset.

I used datalad addurls to add the URL of each file on the server to each file in the annex. When I run git annex whereis filename, it shows up that it lives on the server in the server’s local copy of the dataset, and that it lives on the web, with a correct URL. In fact, if I click on that URL and open it in a browser, it downloads my file.

The dataset lives on Github, but the annex does not. When I make a clone of the superdataset on my personal computer, I get messages like

[INFO   ] Unable to parse git config from origin                                                                                       
[INFO   ] Remote origin does not have git-annex installed; setting annex-ignore                                                        
|   This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 
install(ok): /home/erin/Documents/DHA/carcas (dataset)

Then when I run datalad get carcas-models/ (where carcas-models is the name of the subdataset that has my large files in the annex), I get this error message

[INFO   ] Unable to parse git config from origin                                                                                       
[INFO   ] Remote origin does not have git-annex installed; setting annex-ignore                                                        
[INFO   ] This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 
[INFO   ] access to 1 dataset sibling serverweb not auto-enabled, enable with:
|               datalad siblings -d "/home/erin/Documents/DHA/carcas/carcas-models" enable -s serverweb 
install(ok): /home/erin/Documents/DHA/carcas/carcas-models (dataset) [Installed subdataset in order to get /home/erin/Documents/DHA/carcas/carcas-models]
get(error): carcas-models/models/Alpaca 3rd Carpal L.glb (file) [no known url                                                          
no known url
no known url]
get(error): carcas-models/models/Alpaca 4th Carpal L.glb (file) [no known url
no known url
no known url]
get(error): carcas-models/models/Alpaca Cranium.glb (file) [no known url
no known url
no known url]
get(error): carcas-models/models/Alpaca Mandible.glb (file) [no known url
no known url
no known url]
get(error): carcas-models/models/goat_mm.glb (file) [no known url
no known url
no known url]
action summary:
  get (error: 5)
  install (ok: 1)

I’m stuck on how to debug, because when I run git annex whereis models/Alpaca\ 3rd\ Carpal\ L.glb, everything looks correct:

whereis models/Alpaca 3rd Carpal L.glb (2 copies) 
        00000000-0000-0000-0000-000000000001 -- web
        095e299d-037e-4172-87e0-bbd7183a6613 -- CARCAS models on the 3dviewers server [here]

  web: https://3dviewer.sites.carleton.edu/carcas/carcas-models/models/Alpaca%203rd%20Carpal%20L.glb
ok

Why can’t datalad get find the models? How do I set things up properly so that people with clones from Github can download the models using datalad get, pulling from the URL?

Screenshots / relevant information:

My operating system is Fedora 39,
I’m using Python 3.11.8
My Datalad version is 0.19.6

watsone · May 7, 2024, 12:30am

For anyone with a similar question, detailed debugging is happening over at Datalad’s Github instead of here:

https://github.com/datalad/datalad/issues/7582

The long story short is that the problem is likely related to other configurations I tried out before settling on this one. The fastest solution, if you don’t care about the repository’s history, is to start over from scratch.