Datalad download from openneuro

Overview

I am having some trouble downloading data from openneuro using datalad. I am using python (when I can) to do so.
I am trying to download the stroop data on openneuro
and here is the respective github repository

Datalad Version

import datalad
datalad.__version__

'0.11.1'

Git-Annex Version

git-annex version: 7.20181121+git58-gbc4aa3f0e-1~ndall+1
build flags: Assistant Webapp Pairing S3(multipartupload)(storageclasses) WebDAV Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite
dependency versions: aws-0.20 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.3 feed-1.0.0.0 ghc-8.4.3 http-client-0.5.13.1 persistent-sqlite-2.8.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar hook external
operating system: linux x86_64
supported repository versions: 5 7
upgrade supported from repository versions: 0 1 2 3 4 5 6
local repository version: 5

git version

git version 2.7.4

Questions/Problems

  1. How should I properly “get” the git-annex tracked files?
  2. How should I download extra data (e.g. fmriprep results) to a datalad repository?

How should I properly “get” the git-annex tracked files?

Here is my code:

from datalad.api import install
import tempfile
import os
from subprocess import call

data_dir = tempfile.mkdtemp()

dataset = install(data_dir, "///openneuro/ds000164")
dataset.get("sub-001/func/")

Everything runs fine until the last line, where I get this stderr:

[INFO] access to dataset sibling "s3-PRIVATE" not auto-enabled, enable with:
| 		datalad siblings -d "/tmp/tmpyq6w5m21" enable -s s3-PRIVATE 
[WARNING] Running get resulted in stderr output:   Set both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to use S3
git-annex: get: 1 failed
 
[ERROR] from s3-PUBLIC...; Unable to access these remotes: s3-PUBLIC; Try making some of these repositories available:; 	2f45a4ca-eba9-46da-98f2-5ca487a87a67 -- [s3-PUBLIC];  	c40c41af-4d97-418c-a33d-ffb7e596b0c7 -- root@cf3d3f9acfa2:/datalad/ds000164 [get(/tmp/tmpyq6w5m21/sub-001/func/sub-001_task-stroop_bold.nii.gz)] 
[WARNING] could not get some content in /tmp/tmpyq6w5m21/sub-001/func ['/tmp/tmpyq6w5m21/sub-001/func/sub-001_task-stroop_bold.nii.gz'] [get(/tmp/tmpyq6w5m21/sub-001/func)] 

and this python traceback

---------------------------------------------------------------------------
IncompleteResultsError                    Traceback (most recent call last)
<ipython-input-13-22015468e7a7> in <module>
      1 data_dir = tempfile.mkdtemp()
      2 dataset = install(data_dir, "///openneuro/ds000164")
----> 3 dataset.get("sub-001/func/")

~/.conda/envs/nibetaseries/lib/python3.6/site-packages/datalad/distribution/dataset.py in apply_func(wrapped, instance, args, kwargs)
    492             elif i >= ds_index:
    493                 kwargs[orig_pos[i+1]] = args[i]
--> 494         return f(**kwargs)
    495 
    496     setattr(Dataset, name, apply_func(f))

~/.conda/envs/nibetaseries/lib/python3.6/site-packages/datalad/interface/utils.py in eval_func(wrapped, instance, args, kwargs)
    477                     return results
    478 
--> 479             return return_func(generator_func)(*args, **kwargs)
    480 
    481     return eval_func(func)

~/.conda/envs/nibetaseries/lib/python3.6/site-packages/datalad/interface/utils.py in return_func(wrapped_, instance_, args_, kwargs_)
    465                     # unwind generator if there is one, this actually runs
    466                     # any processing
--> 467                     results = list(results)
    468                 # render summaries
    469                 if not result_xfm and result_renderer == 'tailored':

~/.conda/envs/nibetaseries/lib/python3.6/site-packages/datalad/interface/utils.py in generator_func(*_args, **_kwargs)
    453                 raise IncompleteResultsError(
    454                     failed=incomplete_results,
--> 455                     msg="Command did not complete successfully")
    456 
    457         if return_type == 'generator':

IncompleteResultsError: Command did not complete successfully [{'type': 'file', 'refds': '/tmp/tmpyq6w5m21', 'status': 'error', 'path': '/tmp/tmpyq6w5m21/sub-001/func/sub-001_task-stroop_bold.nii.gz', 'action': 'get', 'annexkey': 'MD5E-s50382260--2c571457278c2fcd07016f50abc07f79.nii.gz', 'message': 'from s3-PUBLIC...; Unable to access these remotes: s3-PUBLIC; Try making some of these repositories available:; \t2f45a4ca-eba9-46da-98f2-5ca487a87a67 -- [s3-PUBLIC];  \tc40c41af-4d97-418c-a33d-ffb7e596b0c7 -- root@cf3d3f9acfa2:/datalad/ds000164'}, {'action': 'get', 'path': '/tmp/tmpyq6w5m21/sub-001/func', 'type': 'directory', 'refds': '/tmp/tmpyq6w5m21', 'status': 'impossible', 'message': ('could not get some content in %s %s', '/tmp/tmpyq6w5m21/sub-001/func', ['/tmp/tmpyq6w5m21/sub-001/func/sub-001_task-stroop_bold.nii.gz'])}]

So I worked around that issue by downloading from openfmri instead

from datalad.api import install
import tempfile
import os
from subprocess import call

data_dir = tempfile.mkdtemp()
dataset = install(data_dir, "///openfmri/ds000164")
dataset.get("sub-001/func/")

which works (yay!), but I’m curious if there’s something I should change to make it work on openneuro as well.

How should I download extra data (e.g. fmriprep results) to a datalad repository?

From the docs it looks like I can download extra data using the download_url method attached to my variable dataset, but I had to fall back on using the awscli to download the data.

fmriprep_res = "s3://openneuro.outputs/921294bd5b869b1852ab3ce886583795/4dd151e3-52d1-4fa2-9591-27c16520331c"
try:
    # currently not working
    dataset.download_url(fmriprep_res)
except:
    # depends on user having awscli installed: https://pypi.org/project/awscli/
    call(['aws',
      '--no-sign-request',
      's3',
      'sync',
      fmriprep_res,
      os.path.join(data_dir, 'derivatives')
     ])

I got the following error message for dataset.download_url(fmriprep_res):

[INFO] Downloading 's3://openneuro.outputs/921294bd5b869b1852ab3ce886583795/4dd151e3-52d1-4fa2-9591-27c16520331c' into '/tmp/tmp28sdodg0' 
[INFO] S3 session: Connecting to the bucket openneuro.outputs anonymously 
Anonymous access to s3://openneuro.outputs/921294bd5b869b1852ab3ce886583795/4dd151e3-52d1-4fa2-9591-27c16520331c has failed.
Do you want to enter other credentials in case they were updated? (choices: yes, no): no

then whichever I choose (yes or no), it fails, however an anonymous download via aws appears to succeed.

I’m curious if I have an incorrect version of git-annex or some settings are not correct on my end, but before testing on other environments I thought I would ask the community to see if anyone else has had these types of problems.

Thanks!
James

Thanks James for the detailed report and please accept my sincere apologies for your struggles!

Unfortunately there are still some outstanding issues within some datalad datasets provided by openneuro, see e.g. https://github.com/OpenNeuroOrg/datalad-service/issues/79 (or my other issues filed there), which I hope would get resolved soon. Despite those known issues we have decided to upload/provide them from http://datasets.datalad.org with the hope that since data is not available, we haven’t aggregated their metadata, and those problematic ones wouldn’t be found :wink: So you must have not discovered it! :wink:

For a proper fix in this particular case openneuro folks need to generate those .rmet files for git-annex with versioning information they have in their DB, and then files would become available via git-annex.

As for credentials etc in this particular regard – there was a fix in annex very recently to provide a more sensible error reporting (see e.g. https://github.com/OpenNeuroOrg/datalad-service/issues/79#issuecomment-444963064) so for me now it looks like:

(git-annex)hopa:/tmp/ds000164[git-annex]
$> git annex get T1w_group.html  
get T1w_group.html (from s3-PUBLIC...) 

  Remote is configured to use versioning, but no S3 version ID is recorded for this key

  Unable to access these remotes: s3-PUBLIC

  Try making some of these repositories available:
  	2f45a4ca-eba9-46da-98f2-5ca487a87a67 -- [s3-PUBLIC]
   	c40c41af-4d97-418c-a33d-ffb7e596b0c7 -- root@cf3d3f9acfa2:/datalad/ds000164
failed
git-annex: get: 1 failed

$> git annex version
git-annex version: 7.20181205+git27-g21eaaac6e-1~ndall+1

So the error message is right to the point. And then you could “workaround” by disabling versioning for this special remote (locally):

$> git annex enableremote s3-PUBLIC versioning=no                                               
enableremote s3-PUBLIC ok                        
(recording state in git...)

$> git annex get T1w_group.html        
get T1w_group.html (from s3-PUBLIC...) (checksum...) ok
(recording state in git...)

which seems to work. But again – it is just a workaround for now, since proper solution is to do get those .rmet files populated so even prior versions (whenever new ones get uploaded) are made available. I am not sure even either to suggest Joey (git-annex guy) to provide similar fall back to try non-versioned url if no .rmet found – that might lead to useless download only to fail checksuming for a current state of the repo.

hm - I think you hit the bug if your “experience” is similar to mine:

$> datalad ls s3://openneuro.outputs/
Connecting to bucket: openneuro.outputs
[INFO   ] S3 session: Connecting to the bucket openneuro.outputs anonymously 
Anonymous access to s3://openneuro.outputs/ has failed.
Do you want to enter other credentials in case they were updated? (choices: yes, no): yes

[ERROR  ] 'NoneType' object has no attribute 'enter_new' [base.py:_enter_credentials:290] (AttributeError)

(filed now in datalad as https://github.com/datalad/datalad/issues/3090) – let me know if that is what you see - and I will address it. As a workaround:

but now to the 2nd aspect – that bucket "s3://openneuro.outputs" is it really fully public? I don’t think so. it seems to do require authentication to access it – don’t you have credentials under .aws/credentials so that your aws manages to access it? aws sync though does work indeed (if I have credentials in .aws/credentials) - so may be it is just that needed actions for listing were not permitted while some others (for get’ing) were… ?

EDIT1 (apparently can’t post more than 3 consecutive replies, heh):
datalad issue was quickly fixed up (in master), thanks Kyle.
openneuro.outputs bucket – I’ve inquired with openneuro ppl: https://github.com/OpenNeuroOrg/openneuro/issues/1017

but in principle, you can “datalad crawl” that “folder” on S3 and create yourself the dataset from it (after we figure out access issues). Here is an example on original openneuro bucket for the fun of it. If bucket is fully public (like openneuro.org) then you could also add to_http=True so it would be populated with http:// urls instead of s3:// ones

datalad create crawl-ds
cd crawl-ds
datalad crawl-init --save --template simple_s3 bucket=openneuro.org prefix=ds000164/derivatives/
datalad crawl

sit/relax until it is doing its thing. Note that you would need to install the datalad-crawler extension (pip install datalad-crawler)

Thank you so much for this detailed response! I am trying to improve my datalad-foo so that I will be able to combine open datasets with our own (and eventually help share our data).

For the examples I showed, I did not have an aws credentials file, and yes, that was the error I was seeing: [ERROR ] 'NoneType' object has no attribute 'enter_new' [base.py:_enter_credentials:290] (AttributeError)… that fix was really fast (thanks Kyle!)

I’ll report back once I’ve tried datalad-crawler