Connecting DataLad to local S3 Object Store (MinIO)

Hello everybody,

we want to use DataLad with a locally run S3 object store called MinIO. The
latter is running and can be accessed using tools like “CloudBerry Explorer”.

DataLad 0.12.4 is run on Ubuntun 20.04.4 LTS using Python 3.8.10.

The procedure of connecting DataLad looked quite easy after reading through
the handbook’s chapter “8.4. Walk-through: Amazon S3 as a special remote”.

In order to setup up the special remote I followed the instructions of the
above walkthrough setting:

export AWS_ACCESS_KEY_ID=testAK
export AWS_SECRET_ACCESS_KEY=testSK
BUCKET=sample-neurodata-public

… and then running with the option “host=” added as follows:

git annex initremote public-s3 type=S3 encryption=none bucket=$BUCKET /
public=yes datacenter=EU autoenable=true host=

That resulted in the below message no matter if the bucket already
exists or not.

Playing around with additional options “port” and “protocol” did not
help.

I would be happy, if anyone is willing to share his knowledge
and experience on the issue …

TIA
:slight_smile:
Peter

  • Error Message ----------------------

Hint: I’ve replaced the real IP-address in the output by .

initremote public-s3 (checking bucket…) (creating bucket in EU…)
git-annex: HttpExceptionRequest Request {
host = “sample-neurodata-public.”
port = 80
secure = False
requestHeaders = [(“Date”,“Sat, 09 Jul 2022 15:30:22 GMT”),(“Authorization”,""),(“x-amz-acl”,“public-read”)]
path = “/”
queryString = “”
method = “PUT”
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
(ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 6, addrAddress = , addrCanonName = }, host name: Just “sample-neurodata-public.”, service name: Just “80”): does not exist (Name or service not known))
failed
git-annex: initremote: 1 failed

Hi Peter

I have only set up an S3 special remote once, using Amazon Web Services, so I have no experience with locally ran S3 object stores. Maybe @eknahm would know more?

However, the first thing I would look into is the DataLad version (and also git-annex version, as ultimately it is git-annex that is used for creating the special remote) - 0.12 is from early 2020 and a lot has changed since. I realize that there aren’t recent apt releases of DataLad for Ubuntu 20.04*, so I would recommend installing with Python’s package manager pip, or - maybe even better - using conda (conda can install not only python packages, so this way you’d get more up-to-date versions of git annex and git itself - that is probably the easiest way of setting up on Ubuntu 20.04); see DataLad installation notes.

*) if you need to stick with apt, the Neurodebian repository has DataLad 0.15.5 packaged for Ubuntu 20.04 - but I’d go for the latest (0.17.1) with conda if possible

Didn’t try setup with minio but why host is empty and not localhost or alike? slow s3 transfer suggests that @bpinsard might know how?

We use Minio deployed on a local server with docker-swarm and it has worked consistently for the past years / git-annex versions.
Each special remote is initialized with something like:
git annex -d initremote remote_name type=S3 encryption=none autoenable=true host=s3.mydomain.tld port=443 protocol=https chunk=1GiB bucket=my_new_bucket_name requeststyle=path
If you are on the same network (IP of the host is from a local range pattern if I am correct) you will have to bypass security measures with.
git config --add annex.security.allowed-ip-addresses 192.168.0.xxx with the local IP of the server.
Let me know if that works for you.

1 Like

I forgot to say that git-annex creates the bucket and fails if it already exists, so the access keys must have the permissions to create buckets (set in a policy on the minio S3).
The following policy works in our case.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:GetObject",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:CreateBucket"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

that sounds odd… I always used an already existing bucket. Is that specific to minio? might be worth filing an issue with Joey of git-annex if confirmed - should work ok for an existing bucket.

I was wrong, it only fails if the annex-uuid file is already there which makes more sense.

Hi Michal,

thanks for your suggestions. I now did set up
of Ubuntu 22.04 to be as current as possible.

Utilizing conda I now have DataLad as 0.17.1
and git-annex as 10.20220526-gc6b112108.

That set, I will now try local S3 again …

:- )
Peter

Hi yarikoptic,

yess, that’s one of the things I don’t understand.
Yet having no idea how to change that.

:slight_smile:
Peter

Now I have an idea - it looks like the parameter “requeststyle” needs to be set to “path” as Basil showed in his example. That issue seems to be solved now. : )

Basile,

your example took me a step further !

As all components are run on the same subnet, I’ve added “git config --add annex.security.allowed-ip-addresses 134.95.232.22” as you suggested on the local machine.

Some initial communication now happens with the MinIO system but then it comes to a halt and another error as you can see below. The error occurs without the bucket pre-created or with the bucket already exisiting and having “All Users” set to full control.

On the S3 system there is no firewall or the like running at the moment.

:- )
Peter


[2022-07-17 15:22:12.338857344] (Utility.Process) process [5663] read: git ["–git-dir=.git","–work-tree=.","–literal-pathspecs","-c",“annex.debug=true”,“show-ref”,“git-annex”]
[2022-07-17 15:22:12.341183195] (Utility.Process) process [5663] done ExitSuccess
[2022-07-17 15:22:12.341484337] (Utility.Process) process [5664] read: git ["–git-dir=.git","–work-tree=.","–literal-pathspecs","-c",“annex.debug=true”,“show-ref”,"–hash",“refs/heads/git-annex”]
[2022-07-17 15:22:12.343827859] (Utility.Process) process [5664] done ExitSuccess
[2022-07-17 15:22:12.344113103] (Utility.Process) process [5665] read: git ["–git-dir=.git","–work-tree=.","–literal-pathspecs","-c",“annex.debug=true”,“log”,“refs/heads/git-annex…238a965cbd6f66267dd1fb4a4c1946535b45f661”,"–pretty=%H","-n1"]
[2022-07-17 15:22:12.346384327] (Utility.Process) process [5665] done ExitSuccess
[2022-07-17 15:22:12.346892037] (Utility.Process) process [5666] chat: git ["–git-dir=.git","–work-tree=.","–literal-pathspecs","-c",“annex.debug=true”,“cat-file”,"–batch"]
initremote public-s3 (checking bucket…) [2022-07-17 15:22:12.364436714] (Remote.S3) String to sign: “GET\n\n\nSun, 17 Jul 2022 15:22:12 GMT\n/sample-neurodata-public/annex-uuid”
[2022-07-17 15:22:12.3645716] (Remote.S3) Host: “134.95.232.22”
[2022-07-17 15:22:12.364623074] (Remote.S3) Path: “/sample-neurodata-public/annex-uuid”
[2022-07-17 15:22:12.364751668] (Remote.S3) Query string: “”
[2022-07-17 15:22:12.364862103] (Remote.S3) Header: [(“Date”,“Sun, 17 Jul 2022 15:22:12 GMT”),(“Authorization”,“AWS minioAK#:pDV4lpxz0aNYakxzu39ejBT/BbI=”)]
[2022-07-17 15:22:12.365834581] (Remote.S3) String to sign: “GET\n\n\nSun, 17 Jul 2022 15:22:12 GMT\n/sample-neurodata-public/”
[2022-07-17 15:22:12.365966033] (Remote.S3) Host: “134.95.232.22”
[2022-07-17 15:22:12.366207127] (Remote.S3) Path: “/sample-neurodata-public/”
[2022-07-17 15:22:12.366318754] (Remote.S3) Query string: “”
[2022-07-17 15:22:12.366433335] (Remote.S3) Header: [(“Date”,“Sun, 17 Jul 2022 15:22:12 GMT”),(“Authorization”,“AWS minioAK#:L5zxLnXlPOwkZmne0QN8lYC7kF8=”)]
(creating bucket in US…) [2022-07-17 15:22:12.366911857] (Remote.S3) String to sign: “PUT\n\n\nSun, 17 Jul 2022 15:22:12 GMT\n/sample-neurodata-public/”
[2022-07-17 15:22:12.367011679] (Remote.S3) Host: “134.95.232.22”
[2022-07-17 15:22:12.36712056] (Remote.S3) Path: “/sample-neurodata-public/”
[2022-07-17 15:22:12.367235103] (Remote.S3) Query string: “”
[2022-07-17 15:22:12.367380261] (Remote.S3) Header: [(“Date”,“Sun, 17 Jul 2022 15:22:12 GMT”),(“Authorization”,“AWS minioAK#:P1Pr5j/dCZic4tAy/kkgbI1IAic=”)]

git-annex: HttpExceptionRequest Request {
host = “134.95.232.22”
port = 443
secure = True
requestHeaders = [(“Date”,“Sun, 17 Jul 2022 15:22:12 GMT”),(“Authorization”,"")]
path = “/sample-neurodata-public/”
queryString = “”
method = “PUT”
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
}
(ConnectionFailure Network.Socket.connect: <socket: 11>: does not exist (Connection refused))
failed
[2022-07-17 15:22:12.369519039] (Utility.Process) process [5666] done ExitSuccess
initremote: 1 failed

Ah right, I forgot to mention that we are adding https through a Traefik reverse proxy. So the initremote example I posted is with https. It might not be the case for you if you are directly connecting to the Minio instance. You might need to use http and another port (9000 by default on minio) to make it work. That might explain the timeout.

Hi Basil,

you are getting me closer to a working connection! :slight_smile:

With your suggestions (http/9000) I could get the new
and thus “better” error message below.


initremote public-s3 (checking bucket…)
Set both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to use S3

git-annex: No S3 credentials configured
failed
[2022-07-19 18:55:33.395226925] process [8217] done ExitSuccess
[2022-07-19 18:55:33.395597787] process [8218] done ExitSuccess
git-annex: initremote: 1 failed

Both keys are now recognized by DataLad’s special remote but
not git-annex itself as it seems. I cannot find any parameters
on initremote that will handle the issue - “embedcreds” has
another aim.

The closest (and pretty recent) comment I could find though
from a different perspective:

https://git-annex.branchable.com/todo/allow_for_annonymous_AWS_S3_access/

An that sounds not good …

:- )
Peter

Glad you made some progress.
If I understand correctly, you are trying to init the remote without providing s3 keys.
It is unlikely and dangerous to let anyone the permissions to create bucket and upload data anonymously on the minio server and the default policy of Minio won’t let you do that for sure. The setup/uploads of data to Minio should be done with an account that has specific permissions to do so.
For the enableremote (auto or not) and data download, it should work without credentials if the policies allow read-only access. mc policy set — MinIO Baremetal Documentation
Let me know if that works for you.

So if the only thing you want is read-only public access, you should be able to set your remote with an account that has write access with something like.
git annex -d initremote remote_name type=S3 encryption=none autoenable=true host=s3.mydomain.tld port=9000 protocol=http chunk=1GiB public=true, publicurl= http://s3.mydomain.tld:9000/my_new_bucket_name/ bucket=my_new_bucket_name

1 Like

Uuh, there’s some misunderstanding as I don’t want public read-only
access. This is due to my current non-experienced attempts to get
it running.

The S3 should of course be accessible for authenticated users only!

I set up a user, put that into a new group and assigned it the below
policy (allowing full access only for now).

-------------------------
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1658430233833",
            "Effect": "Allow",
            "Action": [
                "*"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}
--------------------------

The resulting message running initremote shows:

-------------------------
initremote public-s3 (checking bucket...)
  Set both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to use S3

git-annex: No S3 credentials configured
failed
-------------------------

Both are set prior to running initremote and can be
found with $ env. Even setting the values of AK & SK in
double quotes -as found in some examples- doesn’t help.

So my question now seems: How and where can I provide
the S3 credentials … to git-annex ?

That’s really a weird one: I always set the env var and it worked, and that’s the only way that I know to provide these to git-annex. Unless your keys contains characters (eg. $ ) that makes your shell substitute/expand to something else (simple quotes would solve that), but that would reflect in the env and not give that error.
Which version of git-annex are you using and where did you installed it from (conda-forge or else)?

The username & password (i.e. access- & secret-key)
do only contain lower- and uppercase characters while
testing. Here comes the snipped output of $ env


AWS_SECRET_ACCESS_KEY=HeWaHeWa

AWS_ACCESS_KEY_ID=HeWa

According to the initial suggestions of Michal I’ve
updated the system to Ubuntu 22.04 LTS and on DataLad
by using miniconda. It looks pretty up-to-date right now IMHO.

Find below the first part of $ datalad wtf

## datalad
  - version: 0.17.1
## dependencies
  - annexremote: 1.5.0
  - boto: 2.49.0
  - cmd:7z: 16.02
  - cmd:annex: 10.20220526-gc6b112108
  - cmd:bundled-git: 2.36.1
  - cmd:git: 2.36.1
  - cmd:ssh: 8.9p1
  - cmd:system-git: 2.34.1
  - cmd:system-ssh: 8.9p1
  - exifread: 3.0.0
  - humanize: 3.10.0
  - iso8601: 1.0.2
  - keyring: 23.4.0
  - keyrings.alt: 4.1.0
  - msgpack: 1.0.3
  - mutagen: 1.45.1
  - platformdirs: 2.4.0
  - requests: 2.28.1
## environment
  - LANG: en_US.UTF-8
  - PATH: /home/super/miniconda3/bin:/home/super/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
## extensions

Glad to see that there’s progress - although the last error puzzles me as well. I also think you are doing everything by the book, and that these environment variables are the only way to supply login credentials to git annex initremote.

I tried replicating this with Amazon Web Services and the exact same version of git-annex ( 10.20220526). For me it worked when doing (note: some options are AWS-specific):

export AWS_ACCESS_KEY_ID=<redacted>
export AWS_SECRET_ACCESS_KEY=<redacted>
git annex initremote aws-s3 type=S3 encryption=none bucket=photo-one public=no datacenter=EU autoenable=true

I did see the “No S3 credentials configured” error when setting the variables without using export (i.e. plain AWS_ACCESS_KEY_ID=...). However, for you these are reported by env, which suggests that everything was set properly.

So my impression is that maybe there’s something wrong with the execution environment, maybe something that unsets these variables between export and git annex initremote?

To make sure these aren’t going anywhere you could also specify them in the same line as the git-annex command (not sure what’s the proper name for such way of definition - command-local variables?):

AWS_ACCESS_KEY_ID=<redacted> AWS_SECRET_ACCESS_KEY=<redacted> git annex initremote ...

(tested in zsh, some shells might need env at the beginning of the line).

I agree with @mszczepanik that there seems to be an issue with the environment/setup, even if you seem to have a recent clean install.
You can check which git-annex and maybe alias git-annex to check which binary or even a bash alias that would override a direct call to git-annex.

Otherwise, you could try to run it in a datalad docker, binding the dataset. You would then need to set the env variable in the docker instance.