Datalad can publish annexed contents to GIN using the HTTPS protocol?

Datalad beginners question here.

I’m trying to publish dataset to GIN using HTTPS protocol.
I could datalad save and push ended normally, but GIN’s screen displays not contents of the file but “File content is not available”.

Here are the repository and steps I tried.

  • Execution environment is defined apt.txt and postBuild in above repository.
  1. Build an execution environment using mybinder based on the above GIN repository
  2. Create a jupyter file in test/1.ipynb
  3. datalad save test/1.ipynb
  4. datalad push --to origin

“Chapter 8.6. Walk-through: Dataset hosting on GIN” of Datalad Handbook say ssh protocol is ideal, but why ssh is ideal instead of https? and https protocol also publish datasets to GIN?
I would appreciate it if you could give me some advice.

Sincerely,

what did you see while running datalad push --to origin (assuming origin is the g-node)? It should have copied that test/1.ipynb to gin but it didn’t:

$> git annex whereis test/1.ipynb
whereis test/1.ipynb (1 copy) 
  	95b10147-3a76-402b-bb6c-e9c951341579 -- jovyan@jupyter-ivis-2dmizuguchi-2dtest-2dhttps-2ddnu20ylg:~/
ok

you can do explicit datalad push --to origin test/1.ipynb which should copy it, but AFAIK it should have worked without explicit path specification

As far as I know, GIN doesn’t allow pushing annexed contents through https, it requires you to use ssh for that purpose. I suppose it’s their design decision, although I don’t have a good explanation why.

If you are using GIN with https, DataLad’s push will publish only the “git” part (i. e. file identity information, but not file content) illustrated in the first figure of Walk-through: Dataset hosting on GIN. That’s why you see the message.

Unfortunately, Mybinder doesn’t allow outgoing SSH connections (see this discussion for reasons), so you won’t be able to fully utilise the GIN workflow when working from mybinder.

If what you care about are only the notebook files, you can configure your dataset to not annex them by adding a new line that says *.ipynb annex.largefiles=nothing to the .gitattributes file; see More on DIY configurations for details. This will work on files added afterwards, and you may need to unannex previously saved files; see Getting contents out of git-annex.

As a side note, https can be used to retrieve annexed contents from GIN, if they have been previously uploaded (possibly by another person) through SSH - this is a useful scenario for sharing data in public repositories. Through a quirk of GIN, the https url used by the person who clones such a repository needs to be given without the trailing .git.

1 Like

Thanks!

  • git annex copy

As shown in the image below, the action summary says copy (not needed: 1).

  • data transmission between Mybinder and GIN
    Thanks for your advice. I understand Mybinder don’t allow outgoing ssh connections. I will try it in other execution environments.