How to crawl with datalad?

Datalad beginners question here.

Say I happen to be interested in the 2 following datasets.

From I understand openfmri is transitioning to openneuro and that’s why those datasets do not show up in datalad’s superdataset.

From what @ChrisGorgolewski wrote it is possible to get the datasets (from openneuro at least) with

aws s3 sync --no-sign-request s3://openneuro.org/ds001499 <target_folder>

But say I wanted to use datalad instead. I figured that I could still datalad to crawl openneuro and to “install” and “get” them. See for example: Updating #datalad datasets

But if I try to do the same I just tend to get this:

datalad crawl-init --save --template=simple_s3 bucket=openneuro to_http=1 prefix=ds000030 exclude=derivatives
[ERROR ] could not find pipeline for simple_s3 [pipeline.py:load_pipeline_from_template:467] (PipelineNotSpecifiedError)

I have had a look at the doc but it is a bit “light” when it comes to the different options for the arguments template, template_func, bucket …

I am not exactly sure how to go about this so any pointer would be good.

Thanks

Dear Remi-Gau,

Thank you for the interest/question(s).

OpenNeuro

OpenNeuro uses datalad as a backend, and publishes those git repositories to github, so there is no need to crawl (but read on) - they should be installable directly from github, e.g.

smaug:/mnt/btrfs/scrap/tmp/openneuro
$> datalad install -g https://github.com/OpenNeuroDatasets/ds001547
[INFO   ] Cloning https://github.com/OpenNeuroDatasets/ds001547 [1 other candidates] into '/mnt/btrfs/scrap/tmp/openneuro/ds001547' 
[INFO   ] access to dataset sibling "s3-PRIVATE" not auto-enabled, enable with:
| 		datalad siblings -d "/mnt/btrfs/scrap/tmp/openneuro/ds001547" enable -s s3-PRIVATE 
install(ok): /mnt/btrfs/scrap/tmp/openneuro/ds001547 (dataset)
^CTotal (2 ok out of 42)  4%|██▏                                              | 550M/12.5G [00:19<09:44, 20.4MB/ERROR:                                                                                                          
Interrupted by user while doing magic: KeyboardInterrupt() [cmd.py:_process_one_line:348]

Outstanding Issues

Some datasets might need publicurl fixup

Unfortunately there is an outstanding issue so for some (older) datasets you would need to also fixup publicurl field (git annex enableremote s3-PUBLIC publicurl=http://openneuro.org.s3.amazonaws.com/) before you would be able to get data:

smaug:/mnt/btrfs/scrap/tmp/openneuro
$> datalad install https://github.com/OpenNeuroDatasets/ds001499 
[INFO   ] Cloning https://github.com/OpenNeuroDatasets/ds001499 [1 other candidates] into '/mnt/btrfs/scrap/tmp/openneuro/ds001499' 
install(ok): /mnt/btrfs/scrap/tmp/openneuro/ds001499 (dataset)
1 14189.....................................:Thu 01 Nov 2018 12:22:57 PM EDT:.
smaug:/mnt/btrfs/scrap/tmp/openneuro
$> cd ds001499 
CHANGES  dataset_description.json  sub-CSI1/  sub-CSI3/  task-5000scenes_bold.json
README   derivatives/              sub-CSI2/  sub-CSI4/  task-localizer_bold.json
1 14190.....................................:Thu 01 Nov 2018 12:24:49 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001499[master]
$> git annex enableremote s3-PUBLIC publicurl=http://openneuro.org.s3.amazonaws.com/
enableremote s3-PUBLIC ok
(recording state in git...)
1 14191.....................................:Thu 01 Nov 2018 12:24:53 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001499[master]
$> datalad get sub-CSI1/ses-01/fmap
get(ok): /mnt/btrfs/scrap/tmp/openneuro/ds001499/sub-CSI1/ses-01/fmap/sub-CSI1_ses-01_acq-spinechopf68_dir-AP_epi.nii.gz (file) [from s3-PRIVATE...; from s3-PUBLIC...]
...

Information on previous versions is not yet populated

Another outstanding aspect/issue is that ATM only the most recent version of files would be available for you. It seems that many pieces already developed to fix it up “in deployment” so I guess that will come soon (also readon later on possible workaround if you really need to get access to previous versions).

But probably ATM you need just most recent one anyways and unfortunately there is

1 14177 ->1.....................................:Thu 01 Nov 2018 12:18:19 PM EDT:.
smaug:/mnt/btrfs/scrap/tmp/openneuro
$> cd ds001547 
CHANGES  dataset_description.json  sub-180817ANDV1LGN/  sub-180921JOSV1LGN/  sub-181004LEEV1LGN/
README   participants.tsv          sub-180921DANV1LGN/  sub-180928CHEV1LGN/
1 14178.....................................:Thu 01 Nov 2018 12:18:27 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001547[master]
$> git describe
fatal: No names found, cannot describe anything.
1 14179 ->128.....................................:Thu 01 Nov 2018 12:18:29 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001547[master]
$> git describe --tags
fatal: No names found, cannot describe anything.
1 14180 ->128.....................................:Thu 01 Nov 2018 12:18:31 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/openneuro/ds001547[master]git
$> cat CHANGES 
1.1.0	2018-10-12

	- abastract is extended
...

NB filed a new issue about absent tags: https://github.com/OpenNeuroOrg/datalad-service/issues/72

DataLad superdataset

Whenever those outstanding issues are resolved we will start crawling/providing those repos also from http://datasets.datalad.org . Yet to decide on which end metadata aggregation to happen.

Crawling

Sorry about “light” docs on crawler, indeed a yet another outstanding issue :wink: but you seemed to do it all correctly and it should have worked:


smaug:/mnt/btrfs/scrap/tmp
$> datalad create testds
[INFO   ] Creating a new annex repo at /mnt/btrfs/scrap/tmp/testds 
create(ok): /mnt/btrfs/scrap/tmp/testds (dataset)
$> cd testds
$> datalad crawl-init --save --template=simple_s3 bucket=openneuro to_http=1 prefix=ds000030 exclude=derivatives
[INFO   ] Creating a pipeline for the openneuro bucket 
[WARNING] ATM we assume prefixes to correspond only to directories, adding / 
(dev) 1 14225.....................................:Thu 01 Nov 2018 12:55:28 PM EDT:.
(git)smaug:/mnt/btrfs/scrap/tmp/testds[master]
$> datalad crawl
[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg 
[INFO   ] Creating a pipeline for the openneuro bucket 
[WARNING] ATM we assume prefixes to correspond only to directories, adding / 
[INFO   ] Running pipeline [<datalad_crawler.nodes.s3.crawl_s3 object at 0x7f5a6165e290>, sub(ok_missing=True, subs=<<{'url': {'^s3://([^/]*...>>), switch(default=None, key='datalad_action', mapping=<<{'commit': <function _...>>, re=False)] 
[INFO   ] S3 session: Connecting to the bucket openneuro with authentication 
... (no time to wait atm, hope it works ;-))
^CERROR: 
Interrupted by user while doing magic: KeyboardInterrupt() [ssl.py:read:653]

so it might be either a bug or may be somehow you are in possession of too old datalad? I have 0.10.3.1 with datalad-crawler extension 0.2-16-ge7f192a (so we might need to release but I do not think there were changes since then which could affect you:

$> git diff 0.2.. --stat
 datalad_crawler/nodes/annex.py                    |  6 +++++-
 datalad_crawler/pipelines/crcns.py                | 10 +++++-----
 datalad_crawler/pipelines/openfmri.py             |  2 +-
 datalad_crawler/pipelines/simple_with_archives.py |  6 +++---
 datalad_crawler/pipelines/stanford_lib.py         | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 datalad_crawler/pipelines/tests/test_openfmri.py  | 13 ++++++++-----
1 Like

Re BOLD5000, unless urgent and then could be done manually via datalad addurls or just datalad download-url + datalad add-archive-content, we should get a dedicated crawler pipeline to take advantage of the FigShare API. Initiated an issue

2 Likes

Wow! Thanks @yarikoptic for this long and detailed reply. :heart_eyes:

I will check all those options you describe and get back you to let you know if all of them worked! :slight_smile:

OK I took the time to get back to this and there was clearly some “Error: incompetent user” on my end. Had forgotten to add the neurodebian repo before installing datalad so let’s just say that I was working with a fairly antiquated version of datalad.

I know have the version 0.11.0 and installing data sets “via github” works:

remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS$ datalad install https://github.com/OpenNeuroDatasets/ds001547
[INFO   ] Cloning https://github.com/OpenNeuroDatasets/ds001547 [1 other candidates] into '/mnt/d/BIDS/ds001547'
[INFO   ]   Detected a filesystem without fifo support.
[INFO   ]   Disabling ssh connection caching.
[INFO   ]   Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] access to dataset sibling "s3-PRIVATE" not auto-enabled, enable with:
|               datalad siblings -d "/mnt/d/BIDS/ds001547" enable -s s3-PRIVATE
install(ok): /mnt/d/BIDS/ds001547 (dataset)

remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS cd ds001547/

remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS/ds001547$ datalad get sub-180817ANDV1LGN/anat/sub-180817ANDV1LGN_T1w.json

remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS/ds001547$ ls -l sub-180817ANDV1LGN/anat/sub-180817ANDV1LGN_T1w.*
-rwxrwxrwx 1 remi-gau remi-gau 782 Nov 10 14:25 sub-180817ANDV1LGN/anat/sub-180817ANDV1LGN_T1w.json
lrwxrwxrwx 1 remi-gau remi-gau 136 Nov 10 14:25 sub-180817ANDV1LGN/anat/sub-180817ANDV1LGN_T1w.nii -> ../../.git/annex/objects/F1/9q/MD5E-s20796448--6b24b9fa26fecc0ed152f4c94404a426.nii/MD5E-s20796448--6b24b9fa26fecc0ed152f4c94404a426.nii

Strangely for ds001499 I did not need to enable git annex enableremote s3-PUBLIC publicurl=http://openneuro.org.s3.amazonaws.com/ so I am not sure what I did there.

remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS$ datalad install https://github.com/OpenNeuroDatasets/ds001499
[INFO   ] Cloning https://github.com/OpenNeuroDatasets/ds001499 [1 other candidates] into '/mnt/d/BIDS/ds001499'
[INFO   ]   Detected a filesystem without fifo support.
[INFO   ]   Disabling ssh connection caching.
[INFO   ]   Remote origin not usable by git-annex; setting annex-ignore
install(ok): /mnt/d/BIDS/ds001499 (dataset)
remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS$ cd ds001499/
remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS/ds001499$ datalad get sub-CSI1/ses-01/fmap/sub-CSI1_ses-01_acq-PA_epi.json
remi-gau@DESKTOP-ETLFG7N:/mnt/d/BIDS/ds001499$ ls -l sub-CSI1/ses-01/fmap/sub-CSI1_ses-01_acq-PA_*
-rwxrwxrwx 1 remi-gau remi-gau 2296 Nov 10 14:55 sub-CSI1/ses-01/fmap/sub-CSI1_ses-01_acq-PA_epi.json
lrwxrwxrwx 1 remi-gau remi-gau  143 Nov 10 14:55 sub-CSI1/ses-01/fmap/sub-CSI1_ses-01_acq-PA_epi.nii.gz -> ../../../.git/annex/objects/m5/f1/MD5E-s2383850--bb099958e80759eb8dc3bb729baeebb2.nii.gz/MD5E-s2383850--bb099958e80759eb8dc3bb729baeebb2.nii.gz

Point of clarification: crawling could be done via the command line in older version of datalad but can now only be done via the python API, correct?

Because after updating datalad, calling datalad crawl-init from the comman line gives me

datalad: Unknown command 'crawl-init'.  See 'datalad --help'.
Hint: Command crawl-init is provided by (not installed) extension datalad-crawler.

And I am at a loss as to how in install the extension outside a python env. Am I missing something?

Hey,

assuming you are using a virtualenv for datalad, make sure you activate that environment, and run

pip install datalad-crawler

That will give you the crawler extension, and with it also the cmdline commands. If the Python API is working for you, the cmdline API should have exactly the same capabilities. To doublecheck that run

>>> import os
>>> os.system('datalad --help')

The list of commands in the output should include the crawler commands.

HTH

1 Like