Reliable way to add existing RIA as sibling after cloning from GitHub?

mshin.neuro · April 24, 2023, 8:39am

Dear Datalad community,

Hi, I am building a workflow around Datalad and RIA store.

As I am dealing with a dataset with lots of subdatasets, I reasoned that setting RIA store in the lab NAS would be a viable option.

To efficiently clone the dataset, and use across my personal laptop and workstation,
I added the superdataset to GitHub, hoping that cloning would be easier.

The goal of this workflow is

cloning superdataset from GitHub
download data from RIA as ORA
upload updates to RIA (& github)

However, I am not sure how to reliably add RIA to all the subdatasets.

I guess manually add RIA as new sibling to GitHub-cloned dataset is the way like the example command below:

datalad create-sibling-ria -s ria-store -r --existing reconfigure ria+ssh://internal-nas/path/to/ria-store

Without the option --existing reconfigure, an error occurs

a sibling 'ria-store-storage' is already configured in dataset

So, my first question is whether it is safe to ignore the error message and add RIA store as a sibling to the dataset cloned from github (where the data is already in RIA store).

Second, if I try to clone GitHub repository where I do not have access to our NAS (e.g., outside of the institution), the following error occurs:

[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] https://github.com/username/repository.git/config download failed: Not Found
[INFO   ] ssh: connect to host internal-nas: Operation timed out
[INFO   ] RIA store unavailable. -caused by- Failed to access ssh://internal-nas/archive/ria-store/ria-layout-version -caused by- ConnectionOpenFailedError: 'ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=/home/mshin/.cache/datalad/sockets/28b3b1f3 internal-nas' failed with exitcode 255 [Failed to open SSH connection (could not start ControlMaster process)]
[INFO   ] Reset branch 'main' to 42b1e562 (from ccfea0e6) to avoid a detached HEAD
install(ok): /archive/project-internal/fmri (dataset) [Installed subdataset in order to get /archive/project-internal/fmri]

If I want to download data again when I have access to NAS, what should I do?

Sorry for a bit messy questions.
I am a bit confused right now, but hope to build a robust workflow soon!

Minho

mszczepanik · April 24, 2023, 11:05am

Hi, Minho

First of all, I think the general idea for the workflow is good.

However, I am not sure how to reliably add RIA to all the subdatasets.

I think the question can be understood in two ways.

I will assume that in the clone made from GitHub, you are able to datalad get --no-data the subdatasets, and your question is about setting them up for git + git-annex push.

But if the question is about how to get the subdatasets in the first place, please let me know - the tweaks involved are likely even simpler (hint: look at the .gitmodules file).

So, my first question is whether it is safe to ignore the error message and add RIA store as a sibling to the dataset cloned from github (where the data is already in RIA store).

In general, in its basic form, the create sibling command configures a git remote ria-store plus a git-annex special remote ria-store-storage in your dataset. It also sets things up in the location you point to.

I am not fully familiar with the behaviour of create-sibling-ria --existing reconfigure, but I think it is not needed here. Once you clone / get your (super/sub)dataset, it already has the git-annex special remote (ria-store-storage) configured and most likely enabled, hence the error message.

This happens because Git-annex tracks the information about its special remotes information in the git-annex branch. So if you take a dataset, create a RIA & GitHub siblings, and then push to GitHub, clones made from GitHub “know” about the RIA storage. You can check dataset’s remotes and enabled special remotes with datalad siblings; and see all special remotes (enabled or not) with git annex info.

Now what is missing in the clone made from GitHub is the git remote (ria-store). You can add it with git remote add ria-store ssh://internal-nas/path/to/ria-store/dataset (if you’re using aliases, replace #~name with alias/name - Git doesn’t understand the #~ notation).

The context’s a bit different (RIA & other RIA, not GitHub & RIA), but you can see similar configuration in this Handbook section.

To do git remote add recursively, take a look at datalad foreach-datset (manpage).

If I want to download data again when I have access to NAS, what should I do?

AFAIK, during datalad clone DataLad would try to a) enable RIA storage siblings, and b) do some reconfiguration (IIRC, that includes e.g. changing subdataset paths from ria+file to ria+ssh if you clone from ria+ssh), hence the INFO messages about the store not being available.

I am not sure about this, but I would expect that git annex enableremote ria-store-storage when you have the SSH access again should be sufficient.

As side note, if you wish to use GitHub as the main entry point for cloning, you could set up the RIA stores with --storage-sibling-only (so RIA / ORA would only hold the annex part, and GitHub would hold the git part). Or you may wish to cut out GitHub as the middleman, and use ria+ssh:// as the entry point for cloning. But keeping the Git part both in GitHub and RIA is perfectly fine, and I can see why you could prefer that (redundancy, flexibility of access).

Let me know if that worked, or feel free to ask follow-up questions.

mshin.neuro · April 24, 2023, 11:48am

Hi Michał,
Thank you so much for the detailed answer!
Your answer helped me a lot to understand some internals of datalad.

I will assume that in the clone made from GitHub, you are able to datalad get --no-data the subdatasets, and your question is about setting them up for git + git-annex push.

Yes, your are right.

But if the question is about how to get the subdatasets in the first place, please let me know - the tweaks involved are likely even simpler (hint: look at the .gitmodules file).

I know that I can populate the subdataset with something like

git config -f .datalad/config datalad.get.subdataset-source-candidate-000mypreferredRIAstore ria+http://store.datalad.org#{id}

and this is one reason why I want to use RIA as my remote.

I may be able to manually add urls in .gitmodules file, but I do not want to do that for every subject (i.e., sub-dataset).

I am not fully familiar with the behaviour of create-sibling-ria --existing reconfigure, but I think it is not needed here. Once you clone / get your (super/sub)dataset, it already has the git-annex special remote (ria-store-storage) configured and most likely enabled, hence the error message.

Based on your comment and my git history, I guess that the --existing reconfigure option re-writes remote.log

Now what is missing in the clone made from GitHub is the git remote (ria-store). You can add it with git remote add ria-store ssh://internal-nas/path/to/ria-store/dataset (if you’re using aliases, replace #~name with alias/name - Git doesn’t understand the #~ notation).

I think this would work. However, the problem is that, as I mentioned above, it gets tricky when I have a lot of subdatasets. I do not wish to alias all my subdatasets nor manually type the full dataset id. So, as long as --existing reconfigure does only rewrites, I think it is safe to use this option?

The ideal option would be automatically adding ria-store as remote for all subdatasets recursively using their dataset ids, but I do not know such option exists.

I am not sure about this, but I would expect that git annex enableremote ria-store-storage when you have the SSH access again should be sufficient.

This indeed worked! Thanks!!!

As side note, if you wish to use GitHub as the main entry point for cloning, you could set up the RIA stores with --storage-sibling-only (so RIA / ORA would only hold the annex part, and GitHub would hold the git part). Or you may wish to cut out GitHub as the middleman, and use ria+ssh:// as the entry point for cloning. But keeping the Git part both in GitHub and RIA is perfectly fine, and I can see why you could prefer that (redundancy, flexibility of access).

Yeah, as you mentioned, my preference is storing git things to both GitHub and RIA store (as a full backup option), and use the same RIA store as a special remote. I just want to find an elegant way to achieve that.

Many thanks,
Minho

mszczepanik · April 24, 2023, 7:37pm

I know that I can populate the subdataset with something like
git config -f .datalad/config datalad.get.subdataset-source-candidate-000mypreferredRIAstore ria+http://store.datalad.org#{id}

Right! I forgot about this option. That would be a good fit here (in case someone finds the thread in the future: Prioritizing subdataset clone locations).

I do not wish to alias all my subdatasets nor manually type the full dataset id. So, as long as --existing reconfigure does only rewrites, I think it is safe to use this option?

I took a cursory glance at the create_sibling_ria source and tried a minimal reproducer. It seems to me that create sibling ria --existing reconfigure would still go through the motions of opening an SSH connection, checking if RIA files are in place, maybe re-running some initialization commands – but in the end both the annex UUID of the remote and the files in the RIA store remain unchanged. So: yes, I think you are right.

The ideal option would be automatically adding ria-store as remote for all subdatasets recursively using their dataset ids, but I do not know such option exists.

I agree. I’m afraid such an option does not currently exist. FTR, it has been mentioned as a possibility in an issue in DataLad-Next extension, but I don’t think it is being actively working on right now. And also FTR, adding the git remotes as a purely local operation with git remote add is probably achievable with a combination of datalad foreach-dataset, datalad configuration get datalad.dataset.id, and some string processing to turn the id into RIA address… But not sure if it’s worth it.

mshin.neuro · April 25, 2023, 1:11am

Hi Michał,

Thank you again for your support!

I took a cursory glance at the create_sibling_ria source and tried a minimal reproducer. It seems to me that create sibling ria --existing reconfigure would still go through the motions of opening an SSH connection, checking if RIA files are in place, maybe re-running some initialization commands – but in the end both the annex UUID of the remote and the files in the RIA store remain unchanged. So: yes, I think you are right.

Wow… It seems a lot of work. Thank you for testing this out! I will apply this as my temporary solution until the issue in DataLad-Next that you mentioned is properly implemented.

Many thanks,
Minho

NathanHuneke · April 29, 2023, 5:48pm

Hi there,

I think what you are trying to achieve is the exact workflow I use. Because I often do projects with sensitive data, I need to be careful about who can access the datasets. I therefore store most data in a secure server as an RIA special remote that I can then give access to collaborators as needed. But, to make life easy for said collaborators, I have a human readable and easily accessible superdataset in gitlab.

The whole thing works as follows:

Create superdataset with gitlab sibling
Within superdataset, create subdatasets for each sensitive project or for each analysis unit (e.g. subject, experiment, etc.).
For each subdataset, create an RIA sibling/special remote on secure server accessible via SSH

The command you mentioned above

git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" "ria+ssh://my.awesome.server/datasets#{id}"

just needs to be run in the superdataset to make it possible to clone all subdatasets from their respective RIA store.

If you do all this correctly, it should be possible to clone the superdataset, then run a datalad get subdataset for it to automatically pull from the RIA store.

I created a simple script to allow collaborators to set their datasets up in the same way (among other set up procedures GitHub - nhuneke/dataset-setup-procedures: DataLad procedures to set up DataLad datasets) :

#!/bin/bash 
  
 # Procedure to create RIA-backup and GitLab siblings.  
 # If these siblings already exist then they are skipped.  
  
 set -e -u 
  
 echo "Creating RIA sibling..." 
 echo 
 echo "Please enter the URL for the RIA backup sibling e.g." 
 echo "ssh://my.awesome.server:/research/datasets/" 
 read -p 'URL for RIA backup: ' riaurl 
  
 datalad create-sibling-ria -s ria-backup --new-store-ok --existing skip ria+$riaurl 
  
 echo "Creating gitlab sibling..." 
 echo 
 echo "Please enter your GitLab project location. Should take the form of" 
 echo "<Project>/<dataset>" 
 read -p 'GitLab project location: ' gitlabproj 
  
 datalad create-sibling-gitlab -s gitlab --site mysite --project $gitlabproj \ 
         --publish-depends ria-backup --existing skip 
          
 git config -f .datalad/config "datalad.get.subdataset-source-candidate-origin" \ 
         "ria+${riaurl}#{id}" 
  
 datalad save -m "Configure backup siblings and subdataset retrieval"

mshin.neuro · May 2, 2023, 11:42am

Hi, Nathan!

Thank you for sharing details of your workflow!

The whole thing works as follows:

Create superdataset with gitlab sibling

Within superdataset, create subdatasets for each sensitive project or for each analysis unit (e.g. subject, experiment, etc.).

For each subdataset, create an RIA sibling/special remote on secure server accessible via SSH

Yeah, this is the workflow that I was hoping to implement!

I created a simple script to allow collaborators to set their datasets up in the same way (among other set up procedures GitHub - nhuneke/dataset-setup-procedures: DataLad procedures to set up DataLad datasets) :

It is an awesome resource!

I will carefully read this and gonna test things out!

Thank you so much again for sharing your experience

Minho