Update GitHub repository for dataset using DataLad

Pinging @yarikoptic @oesteban

Hello,

We are currently using datalad/git-annex to associate some GitHub repos with large neuroimaging datasets. These repos are tied to data currently hosted on OSF. After updating the datasets on OSF, is there a way that datalad can facilitate updating the git-annex pointers in the repos on GitHub?

Thanks in advance for any help!

Rastko Ciric

1 Like

Please remind me (and disclose to public) which GitHub repositories we are talking about here, and may be how originally URLs to the files hosted on OSF were added.

Hi,

We are talking about:



And the urls were added using datalad addurls + csv file.

so if you have a csv for the current state, could try addurls again I guess with additional option --ifexists overwrite. Or you are looking for an easier way? :wink:

Thanks @yarikoptic !

We’ve now set up a simple script that builds a csv from OSF metadata and then calls datalad to update the dataset (see update_dataset and addurls_from_csv here). My workflow is:
(1) Clone the dataset that we want to update from GitHub. (At this point, it’s in a few cases necessary to re-run datalad create.)
(2) Remove any obsolete files using datalad remove.
(3) Import the utility functions (same link as above, can’t post multiple links) and run update_dataset (e.g., update_dataset(url, name='tpl-NKI')).
(4) Push the changes back to GitHub.

At this point, the local repo works fine if, for instance, I call datalad get. However, if I clone the remote on GitHub and then call the same datalad get command, I instead receive the error:

>>> datalad get tpl-NKI_res-01_label-brainNoCerebellum_probseg.nii.gz
[WARNING] Running get resulted in stderr output: git-annex: get: 1 failed
 
[ERROR  ] not available; No other repository is known to contain the file.; (Note that these git remotes have annex-ignore set: origin) [get(/Users/rastko/Downloads/tpl-NKI/tpl-NKI/tpl-NKI_res-01_label-brainNoCerebellum_probseg.nii.gz)] 
get(error): /Users/rastko/Downloads/tpl-NKI/tpl-NKI/tpl-NKI_res-01_label-brainNoCerebellum_probseg.nii.gz (file) [not available; No other repository is known to contain the file.; (Note that these git remotes have annex-ignore set: origin)]

I was hoping you might be able to advise as to where our workflow is incorrect – perhaps I’m losing the sibling somehow? I would be happy to provide any additional information if it would be helpful in any way (e.g., reproducing error). Here’s an example of what the resulting dataset looks like on GitHub.

Thanks in advance for any help!

did you use datalad publish to push to github?
seems that git-annex branch wasn’t pushed. you could also do it manually if you like (git push mygithubremote git-annex) or datalad publish would do that for you.
git-annex branch is the place where git-annex stores availability information. if it is not pushed – noone would know where to obtain those files from :wink: and that is what you observe

1 Like

Thank you @yarikoptic – pushing the git-annex branch seems to have done the trick!