GitHub + Datalad RIA as common data source

Hi Datalad team,

I am trying to set up a GitHub repository with ria storage as a common data source.
According to the Datalad handbook chapter 1.4 and som information from this post

I thought that this would be a straightforward process.

Somehow I feel that I miss some important information, since I cannot setup the ria storage as a common data source.

This is the output of datalad siblings in my Datalad dataset:

.: here(+) [git]
.: nrec-dcm-box-ria-storage(+) [ora]
.: ms3(-) [ssh://p158@ms3.local:/p158/Imaging_Data_Repository.git (git)]
.: nrec-dcm-box-ria(-) [ssh://p158@dcm-box/home/p158/Imaging_Repository.git/9e0/b611a-93ba-4c54-8d49-f65ad3ecaff3 (git)]
.: nor-github(-) [https://github.com/p158/Imaging_Repository.git (git)]
.: nrec-dcm-box(+) [ssh://p158@dcm-box/home/p158/backup/Imaging_Data_Repository.git (git)]

The attempt to configure the common data source

datalad siblings configure -s nrec-dcm-box-ria --as-common-datasrc nrec-dcm-box-storage

gives the following error msg:


configure-sibling(impossible): . (sibling) [cannot configure as a common data source, URL protocol is not http or https]

So it appears that some URL is missing. I just can’t figure out on which sibling and how I would add the URL.

I hope that the supplied information is sufficient to give you an idea of the situation. Any help is very much appreciated.

Thanks in advance.

The problem was solved by recreating the whole dataset from scratch.

The issue might have been caused by my multiple attempts to reconfigure the
dataset.

Generally speaking, do you have any advice to debug the underlying git and git-annex configuration in a Datalad dataset when such things happen?
I’ve checked .git/config, but it looked fine.
Thanks

Hi @landge

From what I can see in the error message, the direct cause of the error was that the sibling you wanted to use for configuring a common data source (nrec-dcm-box-ria) had an ssh url, not http(s). I suppose datalad siblings only accepts http(s) urls on the presumption that http(s) may be accessible (to those who clone the dataset later) without password.

Note that “common data source” means creating an autoenabled type git special remote that shares a url with the specified sibling.

There is a chance you don’t need that – if you only care for the ria store to provide annexed contents. The create-sibling-ria command created, by default, two siblings: nrec-dcm-box (regular git remote), and nrec-dcm-box-ria-storage (git-annex special remote). The latter should be autoenabled (at least happens for me when I try, if not you can probably git annex configremote ... autoenabled=true), meaning that a clone made from github should try to enable it automatically, and hence have access (assuming obviously that you push to github after creating the ria sibling).

If you do need the git part of the ria store (i.e. not-storage) to be enabled in clones by default - there are ideas for making that easier, but no default method so far. I suppose (I never tried it in practice though) you could do what --as-common-datsrc would do, and create the type git special remote with git annex initremote nrec-dcm-box-ria-gitremote type=git location=ssh://... autoenable=true (note that I’m avoiding the “-storage” naming as we have that part already covered). Or, if you don’t need it to be automatic, you may find it easiest to git remote add ssh:// after cloning.

You may find this recent thread about similar setups, but for nested datasets, interesting, too.

Generally speaking, do you have any advice to debug the underlying git and git-annex configuration in a Datalad dataset when such things happen?
I’ve checked .git/config, but it looked fine.

I think a “pro tip” would be to also check the files on git-annex branch, especially remote.log and uuid.log (git annex internals). You can look them up with git cat-file -p git-annex:remote.log.

As a side note. The ssh urls you are using are ssh://user@machine/path – this is perfectly fine if using the autoenable on your own, but if you want the autoenable to work out of the box for other users, you may want to remove the user part from urls and instead use ~/.ssh/config to configure user name for the given machine.

Thank you for providing this background information. It makes it easier to understand what’s going on.

You are totally right.

When I recreated the dataset, I created the ria/ria-storage first with create-sibling-ria and the GitHub repository last.

Everything worked out of the box.

Thank you for the important comment concerning the user part int the ssh-url.

I will change the URLs accordingly.