Datalad with S3 datasets

Ariel_Rokem · July 26, 2020, 10:07pm

I have a use-case for datalad, and I wonder whether it makes sense and is currently supported: I have some datasets that are stored in S3. Each dataset is a bucket that is organized in a BIDS-compliant manner.

I would like to know what I need to do so that I can install these datasets and start working with them with datalad. Is there an automated way of doing this? I would like something like:

datalad fancy_command s3://my_bucket

Such that I can now do:

datalad install s3://my_bucket

In particular, I want to be able to authenticate in fancy_command, so that I can access privately-stored datasets in this manner, or inherit the S3 permissions of the machine on which this is run.

Is this currently possible? Or if not, what would it take?

adina · July 27, 2020, 8:50am

If I’m understanding it correctly, the usecase you are describing is not possible with DataLad.

In your description, it seems as if your S3 buckets only contain a collection of files, no Git repositories/DataLad datasets, correct? In order to datalad/git clone anything from somewhere, what is to be cloned needs to be a Git repository, i.e., contain a .git directory with all relevant information required for version control. Without this, a datalad/git clone is not possible. (Please clarify if I’m misunderstanding.)

What could maybe come close to your desired outcome is what we have done with the HCP S3 bucket where we essentially use an existing S3 bucket as a data store that authenticated users can datalad get file contents from. We first queried the bucket for file names, versions, and file urls, created new DataLad dataset from this information, and published the resulting dataset to somewhere it can be cloned from, e.g. GitHub. After cloning the dataset, its file contents can be retrieved from the S3 bucket with datalad get, which will prompt for authentication (and store it encrypted, or retrieve authentication from your systems keyring, if it is already stored in there). There is a write-up of how this was done in the first findoutmore in this handbook usecase.

Other DataLad people, please add if I have missed something

yarikoptic · July 28, 2020, 2:46pm

@Ariel_Rokem Such a use case is exactly the purpose of the (now somewhat elderly, needing more documentation some overhaul, but should work) https://travis-ci.org/github/datalad/datalad-crawler/ extension (just pip install datalad-crawler). With that one you just would end up with 2 stage procedure (so your fancy_command == crawl-init + crawl and it is not direct datalad install although probably could be made into such for some use cases). Unfortunately, we seems do not have a nice tutorial for it in the handbook yet. But there you can find my earlier response which outlines typical use case : How to crawl with datalad? . It is with the datalad crawl we got and keep updating (from time to time) many datasets on http://datasets.datalad.org/ which do not natively come as DataLad datasets.

There is also datalad addurls which provides a powerful tool for populating (and somewhat updating) datasets (or even hierarchies of them) from some structured (.csv, .json) data. Eventually someone should overhaul (or actually just provide an alternative pipeline within it) to use datalad addurls functionality, and just feed it with those structured records crawler discovers.