Configuring remotes on and pushing to Github & GIN

Hi Datalad team,

Here I am again with a second question!

I tried to configure Github an GIN as remotes for my superdataset, which worked fine, as well as pushing to both of them, carefully following the instructions in section 8.2 (particularly 8.2.5) of the handbook.

However, GIN does not allow me to preview my annexed content (which should be the case according to the handbook), it gives me a 404 error when clicking on any of my subdatasets.

Github will not even allow me to click on my subdatasets (which is expected for the annexed content, but they also contain subdirs and files under Git control which I would expect to be able to see/open on Github)? Further, when I try to clone the repo from either Github or GIN, this works, but the subdatasets are completely empty, even the parts under Git control do not appear.

Here is my github repo: https://github.com/labgas/proj_discoverie

Thanks a lot in advance again!

Best wishes,

Lukas

Gin - most likely you didn’t transfer annexed content to it.
GitHub - check urls in .gitmodules . The repo you pointed to 404s for me (private?)

Thanks!

I followed the instructions in 8.2. Dataset hosting on GIN — The DataLad Handbook, more specifically section 8.2.5.

I did notice in section 8.2.2 however that the output reads

    The authenticity of host 'gin.g-node.org (141.84.41.219)' can't be established.
ECDSA key fingerprint is SHA256:E35RRG3bhoAm/WD+0dqKpFnxJ9+yi0uUiFLi+H/lkdU.
Are you sure you want to continue connecting (yes/no)? yes
[INFO   ] Failed to enable annex remote gin, could be a pure git or not accessible
[WARNING] Failed to determine if gin carries annex.
.: gin(-) [git@gin.g-node.org:/adswa/DataLad-101.git (git)]

I got the same output when executing my command, and wonder whether that could be the problem

Thanks,

Lukas

I removed both repos, but will push them back in a few days.

Any specific instructions as to how to transfer annexed content when pushing to GIN, in addition to the handbook instructions I followed (see my previous post)?

Thanks!

Just make sure that gin remote url doesn’t have .git at the end.

1 Like

Thanks Yarik!

Here is my command and the output it generates (similar to the handbook)

u0027997@gbw-s-labgas01:/data/proj_discoverie$ datalad siblings add -d . --name gin-update --pushurl git@gin.g-node.org:/labgas/proj_discoverie.git --url https://gin.g-node.org/labgas/proj_discoverie --as-common-datasrc gin
Enter passphrase for key '/home/luna.kuleuven.be/u0027997/.ssh/id_ed25519':
[INFO   ] Could not enable annex remote gin-update. This is expected if gin-update is a pure Git remote, or happens if it is not accessible.
[WARNING] Could not detect whether gin-update carries an annex. If gin-update is a pure Git remote, this is expected. Remote was marked by annex as annex-ignore. Edit .git/config to reset if you think that was done by mistake due to absent connection etc
.: gin-update(-) [https://gin.g-node.org/labgas/proj_discoverie (git)]

Then, as per instructions, I ran the following command in my superdataset, which did not generate output

 git config --unset-all remote.gin-update.annex-ignore

Finally, I pushed using

datalad push --to gin-update

This works, but clicking on any of my subdatasets results in 404, even the code subdataset which should not have anything annexed!

Any help would be welcome again!

Best wishes,

Lukas

Contrary to earlier attempts, I do not manage to set up a sibling on Github anymore

u0027997@gbw-s-labgas01:/data/proj_discoverie$ datalad create-sibling-github -d . -s github --github-organization labgas proj_discoverie
[ERROR  ] InitError(Failed to create the collection: Prompt dismissed..) (InitError)

Any idea what could cause this error (which I did not have before with an identical command)?

It did work manually (creating an empty repo on Github), then using datalad siblings add …, and then datalad push.

This works (https://github.com/labgas/proj_discoverie) but my problem is the same as before: I cannot open any of the subdataset folders, even not code which has everything stored in git, so I cannot open nor link to any of the scripts in it.

Following your earlier suggestion, I checked the urls in .gitmodules, and they are basically all of the form ./<subdataset_name>. Is that expected?

Please note my default branch after pushing was git-annex, which I changed to master, but the problem was already present before this switch.

Thanks,

Lukas

Hi Yarik,

I tried to clone the dataset from both Github (which works, but all subdatasets carrying annex are empty after cloning) and from GIN (which generates an error).

Hence, it looks like something goes wrong with pushing my local dataset to Github and GIN (see also above posts).

I then deleted my GIN and Github repos and tried the new instructions in the handbook, first updating my datalad using

pip install git+git://github.com/datalad/datalad.git@master

taking me to datalad-0.15.1+62.g84804787c
not sure how to get 0.16.0 atm as

pip install --upgrade datalad

only upgraded to 0.15.1

However, then all the following attempts yield authentication errors which I did not have before (when providing no credentials, I even do not get asked for them)

u0027997@gbw-s-labgas01:/data/proj_discoverie$ datalad create-sibling-github labgas/proj_discoverie -d . -s github --github-login <my_personal_access_token>
[WARNING] Cannot determine authorization token for githubloginarg
[ERROR  ] ValueError(Authorization required for GitHub, cannot find token for a credential githubloginarg.) (ValueError)
u0027997@gbw-s-labgas01:/data/proj_discoverie$ datalad create-sibling-github labgas/proj_discoverie -d . -s github
[WARNING] Cannot determine authorization token for api.github.com
[ERROR  ] ValueError(Authorization required for GitHub, cannot find token for a credential api.github.com.) (ValueError)
u0027997@gbw-s-labgas01:/data/proj_discoverie$ datalad create-sibling-github labgas/proj_discoverie -d . -s github --access-protocol ssh
[WARNING] Cannot determine authorization token for api.github.com
[ERROR  ] ValueError(Authorization required for GitHub, cannot find token for a credential api.github.com.) (ValueError)

I get the same error when I try similar command with datalad-create-sibling-gin.

When I try the manual option, using git-remote add or datalad siblings add, it does work, but again my subdatasets are not accessible after pushing, even not the non-annexed content: https://github.com/labgas/proj_discoverie/tree/master

Interestingly, if I push my code subdataset in the same the manual way to github, I can access my non-annexed content: https://github.com/labgas/proj_discoverie_code/tree/master

Any help would be appreciated!

Thanks a lot,

Lukas

Finally, I tried to create siblings on osf in two ways.
First, I used the export mode, which allows me to push git content to osf using git push, and annexed content as well using git-annex export, but my subdatasets are not pushed when I push my superdataset.
Second, I used the annex mode, which works, but is non-human readable (as it should be). However, the size is similar to the export mode repo, hence does only have the non-subdataset subdirs from my superdataset too I guess. I then tried to set up a github sibling with a publication dependency as in use case 3 of the datalad-osf documentation, but failed to create one using datalad create-sibling-github due to the authorization issues listed above.
I then fell back on the manual option, which works, but results in an identical github repo as the one I created without the publication dependency (see above): https://github.com/labgas/proj_discoverie_osf/tree/master

Hence, it seems like in every scenario, the main problem is that content in subdatasets, whether annexed or not, does not get pushed, and I do not understand why.

I now created separate siblings for a few of my subdatasets on Github, for example https://github.com/labgas/proj_discoverie_BIDS, and used their url in .gitmodules of the superdataset, which makes them browsable on Github and GIN, but this seems like a complicated way of organizing things, so I guess their must be a way in which I do not need to created a github repo for every subdataset?

Moreover, I cannot datalad get annexed contents since the github repo it tries to clone from does not support annex.

u0027997@gbw-s-labgas01:/data/test_datalad/proj_discoverie$ datalad get BIDS
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] https://github.com/labgas/proj_discoverie_BIDS/config download failed: Not Found
install(ok): /data/test_datalad/proj_discoverie/BIDS (dataset) [Installed subdataset in order to get /data/test_datalad/proj_discoverie/BIDS]
get(error): BIDS/sub-KUL004/anat/sub-KUL004_T1w.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL004/fmap/sub-KUL004_run-01_magnitude.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL004/fmap/sub-KUL004_run-02_magnitude.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-01_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-02_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-03_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-04_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-rest_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL005/anat/sub-KUL005_T1w.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): BIDS/sub-KUL005/fmap/sub-KUL005_fieldmap.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
  [31 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (error: 41)
  install (ok: 1)

Thanks by the way for the nice new features including datalad-osf!

L

Just to make sure, you use -r, like in datalad push -r with annex mode setup, right?

AFAIK export mode isn’t supported by DataLad yet: https://github.com/datalad/datalad/issues/3127

Thanks Yarik!

Should that work without having to adapt the paths in .gitmodules, and without needing to configure separate siblings for each subdataset?

…and no -r option when creating the sibling for the superdataset?

If I do not have siblings in each subdataset, and try to push -r I get errors

u0027997@gbw-s-labgas01:/data/proj_discoverie$ datalad push --to gin -r
publish(error): BIDS (dataset) [Unknown target sibling 'gin'.]
publish(error): code (dataset) [Unknown target sibling 'gin'.]
publish(error): code_git (dataset) [Unknown target sibling 'gin'.]
publish(error): derivatives (dataset) [Unknown target sibling 'gin'.]
publish(error): mriqc (dataset) [Unknown target sibling 'gin'.]
publish(error): pipeline (dataset) [Unknown target sibling 'gin'.]
publish(error): sourcedata (dataset) [Unknown target sibling 'gin'.]

Another issue is that both git config --unset-all remote.gin.annex-ignore and manually editing .git/config do not seem to work to get rid of the annex-ignore = true - upon a new datalad siblings, the gin remote is again marked as not carrying annex.

L

Hi Yarik,

I found a workflow which is a bit convoluted with a repo on GIN for every subdataset and the superdatasets, which results in browsable repos on GIN, with downloadable files, but on Github

Moreover, cloning from either of both and then datalad getting does not work (except for non-annexed content), since the GIN remotes seem to be set to annex-ignore after cloning (not in my original dataset which was pushed to GIN).

Could you please have a look at the errors and the workflow below?

I feel like I am close to making it work fully, but not there yet with the cloning!

L

u0027997@gbw-s-labgas01:/data/datalad_test$ datalad clone https://gin.g-node.org/labgas/proj_discoverie
Clone attempt:   0%|                                                                                                                                                                           | 0.00/2.00 [00:00<?, ? Candidate locations/s]Username for 'https://gin.g-node.org': lukasvo76
Password for 'https://lukasvo76@gin.g-node.org':
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] https://gin.g-node.org/labgas/proj_discoverie/config download failed: Not Found
install(ok): /data/datalad_test/proj_discoverie (dataset)

Here are the siblings for the cloned dataset

u0027997@gbw-s-labgas01:/data/datalad_test/proj_discoverie$ datalad siblings
.: here(+) [git]
.: origin(-) [https://gin.g-node.org/labgas/proj_discoverie (git)]
.: osf-annex-storage(+) [osf]
.: osf-export-storage(+) [osf]
u0027997@gbw-s-labgas01:/data/datalad_test/proj_discoverie$ git config --unset-all remote.gin.annex-ignore
u0027997@gbw-s-labgas01:/data/datalad_test/proj_discoverie$ git config --unset-all remote.origin.annex-ignore
u0027997@gbw-s-labgas01:/data/datalad_test/proj_discoverie$ datalad siblings
.: here(+) [git]
.: osf-annex-storage(+) [osf]
[WARNING] Could not detect whether origin carries an annex. If origin is a pure Git remote, this is expected. Remote was marked by annex as annex-ignore. Edit .git/config to reset if you think that was done by mistake due to absent connection etc
.: origin(-) [https://gin.g-node.org/labgas/proj_discoverie (git)]
.: osf-export-storage(+) [osf]

compared to the original dataset

u0027997@gbw-s-labgas01:/data/proj_discoverie$ datalad siblings -r
.: here(+) [git]
.: github(-) [https://github.com/labgas/proj_discoverie.git (git)]
.: gin(+) [https://gin.g-node.org/labgas/proj_discoverie (git)]
BIDS: here(+) [git]
BIDS: gin(+) [https://gin.g-node.org/labgas/proj_discoverie_BIDS (git)]
code: here(+) [git]
code: github(-) [https://github.com/labgas/proj_discoverie_code.git (git)]
code: gin(+) [https://gin.g-node.org/labgas/proj_discoverie_code (git)]
derivatives: here(+) [git]
derivatives: gin(+) [https://gin.g-node.org/labgas/proj_discoverie_derivatives (git)]
mriqc: here(+) [git]
mriqc: gin(+) [https://gin.g-node.org/labgas/proj_discoverie_mriqc (git)]
pipeline: here(+) [git]
pipeline: gin(+) [https://gin.g-node.org/labgas/proj_discoverie_pipeline (git)]
pipeline: datalad(+) [datalad]
sourcedata: here(+) [git]

Hence datalad getting fails

u0027997@gbw-s-labgas01:/data/datalad_test/proj_discoverie$ datalad get BIDS
Clone attempt:   0%|                                                                                                                                                                           | 0.00/4.00 [00:00<?, ? Candidate locations/s]Username for 'https://gin.g-node.org': lukasvo76
Password for 'https://lukasvo76@gin.g-node.org':
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] https://gin.g-node.org/labgas/proj_discoverie_BIDS/config download failed: Not Found
Username for 'https://gin.g-node.org': lukasvo76
Password for 'https://lukasvo76@gin.g-node.org':
install(ok): /data/datalad_test/proj_discoverie/BIDS (dataset) [Installed subdataset in order to get /data/datalad_test/proj_discoverie/BIDS]
get(error): BIDS/sub-KUL004/anat/sub-KUL004_T1w.nii.gz (file) [Remote gin-common-bids not usable by git-annex; setting annex-ignore
https://gin.g-node.org/labgas/proj_discoverie_BIDS/config download failed: Not Found]
get(error): BIDS/sub-KUL004/fmap/sub-KUL004_run-01_magnitude.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL004/fmap/sub-KUL004_run-02_magnitude.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-01_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-02_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-03_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-MIST_run-04_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL004/func/sub-KUL004_task-rest_bold.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL005/anat/sub-KUL005_T1w.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
get(error): BIDS/sub-KUL005/fmap/sub-KUL005_fieldmap.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: gin-common-bids origin)]
  [31 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (error: 41)
  install (ok: 1)

L

Publish your dataset on GIN and/or Github

NOTE: this is work in progress- see this issue I opened on Neurostars, and this rapidly evolving section of the Datalad handbook, particularly this walkthrough

GIN

For now, this somewhat convoluted workflow works best

  1. Add a GIN “superrepo” as a sibling (and common data source) to your superdataset

For now, only the manual workflow works - at least I experienced authentication problems with the datalad create-sibling-gin command used in the automated workflow - working on this!

After creating your empty superrepo on GIN, you can run

datalad siblings add -d . --name gin --pushurl git@gin.g-node.org:/labgas/proj_discoverie.git --url https://gin.g-node.org/labgas/proj_discoverie --as-common-datasrc gin-common

Then make sure that annex is supported for this sibling by running (probably not needed, but does not harm)

git config --unset-all remote.gin.annex-ignore

  1. Add a GIN “subrepo” as a sibling (and common data sourece) for each subdataset

NOTE: we do NOT do this for the sourcedata subdataset, since we do not want it to be “datalad gettable”, even not for people with access to the private GIN repo!

Essentially, repeat the process above in a slightly simplified way for each subdataset - this is what I mean by convoluted above

After creating your empty subrepo on GIN, you can run from your subdataset

datalad siblings add -d . --name gin --pushurl git@gin.g-node.org:/labgas/proj_discoverie_code.git --url https://gin.g-node.org/labgas/proj_discoverie_code --as-common-datasrc gin-common-code

Then make sure that annex is supported for this sibling by running

git config --unset-all remote.gin.annex-ignore (probably not needed, but does not harm)

  1. Add the url of the subrepos for each of the corresponding subdatasets in your superdataset

Run the following command from your superdataset

datalad subdatasets --contains code

–set-property url https://gin.g-node.org/labgas/proj_discoverie_code

  1. Push recursively from your superdataset to GIN

datalad push --to gin -r

NOTE: no worries about the error about the sourcedata subdataset, we did not create a GIN sibling for it on purpose!

  1. Clone the entire superdataset wherever you like

datalad clone https://gin.g-node.org/labgas/proj_discoverie

If you want a subdataset with annexed files downloaded to your computer, you should

datalad get BIDS

Github

NOTE: Github does not support large files nor annexed content, so it is less convenient than GIN, but it is more widely known so we want our dataset and particularly the code subdataset available on Github as well, preferably in a clonable way (through a link to the common data source on GIN).

  1. Add a Github “superrepo” as a sibling to your superdataset

Like for GIN, currently only the manual approach works in my hands, so create an empty repo on Github first, and then run the following command from your superdataset

datalad siblings add -d . --name github --url https://github.com/labgas/proj_discoverie.git

  1. Add a Github “subrepo” as a sibling to your code subdataset

Create and empty repo on Github first, and then run the following command from your code subdataset

datalad siblings add -d . --name github --url https://github.com/labgas/proj_discoverie_code.git

  1. Push recursively from your superdataset to Github

datalad push --to github -r

NOTE: no worries about the errors for most subdatasets, we did not create a Github sibling for them on purpose, since they are all on GIN anyway, and do no want them to be public prior to publication - private repos on Github or not free, contrary to GIN (pocket money for Bill Gates)

FWIW https://github.com/datalad/datalad/pull/5949 merged recently (destined for 0.16.0 eventually) unified and extended create-sibling-github to create-sibling-{github,gogs,gitea,gin}. Would be great if you could give it a test run on your use cases.

Sure, how shall I install it?

Can you send the (pip) install command for this version?

Last thing I did was

pip install git+git://github.com/datalad/datalad.git@master

L

Another quick update: manually adapting .git/config in the cloned version of the dataset (copy/paste lines from the .git/config file in the local dataset which was pushed) seems to solve the datalad get problem.

However, there should be a better way by putting this info in other config files that stick to the dataset when pushed/cloned - suggestions welcome as I did not manage to figure this out from the documentation.

I tried osf as an alternative to GIN, but the problem is essentially the same: when cloning, a remote origin is created with has annex-ignore true, and in case of osf remotes, I cannot even seem to change that by manually adapting the .git/config, preventing a succesful datalad get of annexed content altogether.

that should be a correct one to get a “master” version installed. datalad --version should provide information to exactly identify the version used.

Thanks!

Now at datalad 0.15.1+66.gffe050383

datalad create-sibling-osf works fine provide I use an environment variable to store my osf credentials, but datalad osf-credentials fails, with the errors I also get for datalad create-sibling-github etc, but I think this has to do with keyring issues on our system, will work on it with our admin and get back to you.

Any thoughts on my workflow above and the issue of making the annex config more sticky?

Or a workaround that allows to avoid having to publish each subdataset separately?

Thanks!

L

Here is an alternative workaround for getting the subdataset’s annexed data using the gin cli, which does not have the problem of datalad clone mentioned above (i.e. it does not set configs to annex ignore), but it does require cloning each of the subdatasets separately into the superdatasets, so also not perfect.

Hence, better solutions are still welcome!

An alternative solution to datalad clone is to use gin commands, specifically gin get (which is the gin equivalent of datalad (or git) clone). gin get-content allows you to get all the (annexed) files in your local repo and hence is the equivalent of datalad get.

This works perfectly for the subdatasets, and contrary to datalad clone, the config about the annex is correctly preserved!

However, when gin getting the superdataset, gin get-content for the subdatasets does unfortunately not work immediately.

There is a fairly easy workaround using gin commands and minor edits in .git/config of the superdataset

cd … (superdataset root)

rm -r BIDS

gin get labgas/proj_discoverie_BIDS

mv proj_discoverie_BIDS BIDS

nano .git/config

add info on the BIDS submodule

[submodule “BIDS”]
active = true
url = https://gin.g-node.org/labgas/proj_discoverie_BIDS
path = ./BIDS

ctrl + O
ctrl + X

NOTE: not sure whether this step is really needed since the information on submodules/subdatasets is already correctly stored in the .gitmodules file under the root of the superdataset, but it definitely does not harm to make these two files consistent

cd BIDS

gin get-content .

NOTE: this last step is only needed for the subdatasets with annexed content, hence not for “code” or “mriqc” for example

Quick update: both datalad osf-credentials and datalad create-sibling-xxx work fine now, after having optimized credential storage on our system, so no problem from the datalad side when it comes to this issue anymore!

This is now the only remaining issue here - see my posts above for workarounds with both datalad and gin commands, none of them perfect, but working nevertheless.

I read here in the docs that the -r option of the new datalad create-sibling-xxx commands will increase the configuration options for publishing of dataset hierarchies, and that a walkthrough is planned. Do you think this could solve the problem?