Dear Remi-Gau,
Thank you for the interest/question(s).
OpenNeuro
OpenNeuro uses datalad as a backend, and publishes those git repositories to github, so there is no need to crawl (but read on) - they should be installable directly from github, e.g.
smaug:/mnt/btrfs/scrap/tmp/openneuro
$> datalad install -g https://github.com/OpenNeuroDatasets/ds001547
[INFO ] Cloning https://github.com/OpenNeuroDatasets/ds001547 [1 other candidates] into '/mnt/btrfs/scrap/tmp/openneuro/ds001547'
[INFO ] access to dataset sibling "s3-PRIVATE" not auto-enabled, enable with:
| datalad siblings -d "/mnt/btrfs/scrap/tmp/openneuro/ds001547" enable -s s3-PRIVATE
install(ok): /mnt/btrfs/scrap/tmp/openneuro/ds001547 (dataset)
^CTotal (2 ok out of 42) 4%|██▏ | 550M/12.5G [00:19<09:44, 20.4MB/ERROR:
Interrupted by user while doing magic: KeyboardInterrupt() [cmd.py:_process_one_line:348]
Outstanding Issues
Some datasets might need publicurl fixup
Unfortunately there is an outstanding issue so for some (older) datasets you would need to also fixup publicurl
field (git annex enableremote s3-PUBLIC publicurl=http://openneuro.org.s3.amazonaws.com/
) before you would be able to get data:
smaug:/mnt/btrfs/scrap/tmp/openneuro
$> datalad install https://github.com/OpenNeuroDatasets/ds001499
[INFO ] Cloning https://github.com/OpenNeuroDatasets/ds001499 [1 other candidates] into '/mnt/btrfs/scrap/tmp/openneuro/ds001499'
install(ok): /mnt/btrfs/scrap/tmp/openneuro/ds001499 (dataset)
1 14189.....................................:Thu 01 Nov 2018 12:22:57 PM EDT:.
smaug:/mnt/btrfs/scrap/tmp/openneuro
$> cd ds001499
CHANGES dataset_description.json sub-CSI1/ sub-CSI3/ task-5000scenes_bold.json
README derivatives/ sub-CSI2/ sub-CSI4/ task-localizer_bold.json
1 14190.....................................:Thu 01 Nov 2018 12:24:49 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001499[master]
$> git annex enableremote s3-PUBLIC publicurl=http://openneuro.org.s3.amazonaws.com/
enableremote s3-PUBLIC ok
(recording state in git...)
1 14191.....................................:Thu 01 Nov 2018 12:24:53 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001499[master]
$> datalad get sub-CSI1/ses-01/fmap
get(ok): /mnt/btrfs/scrap/tmp/openneuro/ds001499/sub-CSI1/ses-01/fmap/sub-CSI1_ses-01_acq-spinechopf68_dir-AP_epi.nii.gz (file) [from s3-PRIVATE...; from s3-PUBLIC...]
...
Information on previous versions is not yet populated
Another outstanding aspect/issue is that ATM only the most recent version of files would be available for you. It seems that many pieces already developed to fix it up “in deployment” so I guess that will come soon (also readon later on possible workaround if you really need to get access to previous versions).
But probably ATM you need just most recent one anyways and unfortunately there is
1 14177 ->1.....................................:Thu 01 Nov 2018 12:18:19 PM EDT:.
smaug:/mnt/btrfs/scrap/tmp/openneuro
$> cd ds001547
CHANGES dataset_description.json sub-180817ANDV1LGN/ sub-180921JOSV1LGN/ sub-181004LEEV1LGN/
README participants.tsv sub-180921DANV1LGN/ sub-180928CHEV1LGN/
1 14178.....................................:Thu 01 Nov 2018 12:18:27 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001547[master]
$> git describe
fatal: No names found, cannot describe anything.
1 14179 ->128.....................................:Thu 01 Nov 2018 12:18:29 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001547[master]
$> git describe --tags
fatal: No names found, cannot describe anything.
1 14180 ->128.....................................:Thu 01 Nov 2018 12:18:31 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001547[master]git
$> cat CHANGES
1.1.0 2018-10-12
- abastract is extended
...
NB filed a new issue about absent tags: https://github.com/OpenNeuroOrg/datalad-service/issues/72
DataLad superdataset
Whenever those outstanding issues are resolved we will start crawling/providing those repos also from http://datasets.datalad.org . Yet to decide on which end metadata aggregation to happen.
Crawling
Sorry about “light” docs on crawler, indeed a yet another outstanding issue but you seemed to do it all correctly and it should have worked:
smaug:/mnt/btrfs/scrap/tmp
$> datalad create testds
[INFO ] Creating a new annex repo at /mnt/btrfs/scrap/tmp/testds
create(ok): /mnt/btrfs/scrap/tmp/testds (dataset)
$> cd testds
$> datalad crawl-init --save --template=simple_s3 bucket=openneuro to_http=1 prefix=ds000030 exclude=derivatives
[INFO ] Creating a pipeline for the openneuro bucket
[WARNING] ATM we assume prefixes to correspond only to directories, adding /
(dev) 1 14225.....................................:Thu 01 Nov 2018 12:55:28 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/testds[master]
$> datalad crawl
[INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg
[INFO ] Creating a pipeline for the openneuro bucket
[WARNING] ATM we assume prefixes to correspond only to directories, adding /
[INFO ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f5a6165e290>, sub(ok_missing=True, subs=<<{'url': {'^s3://([^/]*...>>), switch(default=None, key='datalad_action', mapping=<<{'commit': <function _...>>, re=False)]
[INFO ] S3 session: Connecting to the bucket openneuro with authentication
... (no time to wait atm, hope it works ;-))
^CERROR:
Interrupted by user while doing magic: KeyboardInterrupt() [ssl.py:read:653]
so it might be either a bug or may be somehow you are in possession of too old datalad? I have 0.10.3.1 with datalad-crawler extension 0.2-16-ge7f192a (so we might need to release but I do not think there were changes since then which could affect you:
$> git diff 0.2.. --stat
datalad_crawler/nodes/annex.py | 6 +++++-
datalad_crawler/pipelines/crcns.py | 10 +++++-----
datalad_crawler/pipelines/openfmri.py | 2 +-
datalad_crawler/pipelines/simple_with_archives.py | 6 +++---
datalad_crawler/pipelines/stanford_lib.py | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
datalad_crawler/pipelines/tests/test_openfmri.py | 13 ++++++++-----