Updating #datalad datasets

oesteban · December 22, 2017, 11:31pm

Hi,

I’m getting confused about how should I update a subdataset of datalad.

Back in September I made:

datalad install -r ///openfmri

Now, there are 4 new datasets and I can’t get them:

datalad update -r openfmri/
$ datalad install -r openfmri/ds000248
get(impossible): /oak/stanford/groups/russpold/data/openfmri/ds000248 [path does not exist]

Am I missing something before installing new datasets?

eknahm · December 27, 2017, 8:30am

Hey,

if you add the --merge option to the update call you should be back in business. Update alone just gives you updated info on file availability and remote branches. In order to get new datasets onto your filesystem the current local branch needs to have such updates merged.

HTH,

Michael

oesteban · December 28, 2017, 3:31am

Hi @eknahm, it worked indeed for one of my two datalad repos. Thanks very much.

However, I tried the same in a second datalad repo, where I had tinkered a bit with the underlying git out of desperation, getting this:

$ datalad update -r --merge .
[INFO   ] Updating dataset '/oak/stanford/groups/russpold/data/openfmri' ...
[INFO   ] Merging updates...
update(ok): /oak/stanford/groups/russpold/data/openfmri (dataset)
[INFO   ] Updating dataset '/oak/stanford/groups/russpold/data/openfmri/ds000001' ...
[INFO   ] Merging updates...
update(ok): /oak/stanford/groups/russpold/data/openfmri/ds000001 (dataset)
[INFO   ] Updating dataset '/oak/stanford/groups/russpold/data/openfmri/ds000002' ...
[ERROR  ] Cmd('/share/PI/russpold/software/git-annex-6.20171109-gf187a8db6/git') failed due to: exit code(128)
|   cmdline: /share/PI/russpold/software/git-annex-6.20171109-gf187a8db6/git -c receive.autogc=0 -c gc.auto=0 fetch --prune -v origin
|   stderr: 'fatal: repository 'http://datasets.datalad.org/openfmri/.git/ds000002/' not found' [cmd.py:wait:418] (GitCommandError)

Now, I don’t know how to take it back to normal without having to reinstall all subdatasets. At this moment, the superdataset ///openfmri is following master, at the latest commit 738714cf9daf789d4ea47b46c071498d2144ba51 that seems to be also the commit for my clean, working datalad repo. I manually made git submodule sync and git submodule update. Still, I’m getting those GitCommandError.

How can I revert my wrongdoing?

Thanks very much
Cheers,
Oscar

oesteban · December 28, 2017, 3:49am

Ok, I think I found it. Apparently, the remote has changed. Now with:

git remote set-url origin http://datasets.datalad.org/openfmri/ds000002/.git

allowed me to update the dataset.

oesteban · December 28, 2017, 7:18am

That solution worked for all datasets under openfmri, except for ds000030.

I deleted and reinstalled the dataset to make sure it was clean. Still getting:

$ datalad get -r -J 8 sub-10159/anat/sub-10159_T1w.json
[ERROR  ] Try making some of these repositories available:
| 	00000000-0000-0000-0000-000000000001 -- web
|  	09ede57e-5ec2-484b-b6fb-8a632e5c7a4e -- [datalad-archives]
|  	41f07c30-3cfc-4de3-9fbc-84383f5156e6 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000030
|  [get(/oak/stanford/groups/russpold/data/openfmri/ds000030/sub-10159/anat/sub-10159_T1w.json)]
get(error): /oak/stanford/groups/russpold/data/openfmri/ds000030/sub-10159/anat/sub-10159_T1w.json (file) [Try making some of these repositories available:
	00000000-0000-0000-0000-000000000001 -- web
 	09ede57e-5ec2-484b-b6fb-8a632e5c7a4e -- [datalad-archives]
 	41f07c30-3cfc-4de3-9fbc-84383f5156e6 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000030
]

This seems to replicate in all datalad repos I have.

Cheers,
Oscar

yarikoptic · January 3, 2018, 4:00am

Do you observe it in any other dataset as well, or only in ds000030?
this one is “tricky” since got populated with a VERY heavy (in # of files) derivative and we didn’t adjust the pipeline yet to modularize those away into subdatasets. Feels that for this one we should just ignore the derivative(s) for now and update at least main URLs to account for the recreated openneuro bucket. But I would like to know first if any other dataset is also problematic?

oesteban · January 3, 2018, 5:24am

I’ve seen it with some other datasets, but after several retries, I think only ds000030 is still failing. How can I get rid of the derivatives folder (in datalad language)?

I’ll look all the logs and confirm that no other dataset is still failing.

oesteban · August 18, 2018, 12:57am

Hi @yarikoptic, I’m still stuck with this.

For ds000030 I did datalad update -r . --merge. Then checked the remotes: it seemed to me incorrect, so I manually set a remote following the example of ds000001 (which works well).

I’ve then git fetch origin, git checkout master, git pull, git rm -r derivatives/ and finally datalad get sub-10159/anat/sub-10159_T1w.nii.gz and still:

[ERROR  ] Try making some of these repositories available:
| 	00000000-0000-0000-0000-000000000001 -- web
|  	09ede57e-5ec2-484b-b6fb-8a632e5c7a4e -- [datalad-archives]
|  	41f07c30-3cfc-4de3-9fbc-84383f5156e6 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000030
|  [get(/oak/stanford/groups/russpold/data/openfmri/ds000030/sub-10159/anat/sub-10159_T1w.nii.gz)]
get(error): /oak/stanford/groups/russpold/data/openfmri/ds000030/sub-10159/anat/sub-10159_T1w.nii.gz (file) [Try making some of these repositories available:
	00000000-0000-0000-0000-000000000001 -- web
 	09ede57e-5ec2-484b-b6fb-8a632e5c7a4e -- [datalad-archives]
 	41f07c30-3cfc-4de3-9fbc-84383f5156e6 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000030
]

Any suggestions?

yarikoptic · August 18, 2018, 2:54am

eh… sorry about that.

For that repo/file we have apparently only “datalad-archives” as the source (not realy sure why it includes “web” as an available for it remote. will check with joey):

$> git annex whereis sub-10159/anat/sub-10159_T1w.nii.gz
whereis sub-10159/anat/sub-10159_T1w.nii.gz (3 copies) 
  	00000000-0000-0000-0000-000000000001 -- web
   	09ede57e-5ec2-484b-b6fb-8a632e5c7a4e -- [datalad-archives]
   	41f07c30-3cfc-4de3-9fbc-84383f5156e6 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl/openfmri/ds000030

  datalad-archives: dl+archive:MD5E-s3920586194--f5ecaf1365ea031dd6c20d0f958ed69b.tgz#path=ds030_R1.0.0/sub-10159/anat/sub-10159_T1w.nii.gz&size=11637742
  datalad-archives: dl+archive:MD5E-s3920586194--f5ecaf1365ea031dd6c20d0f958ed69b.tgz/ds030_R1.0.0/sub-10159/anat/sub-10159_T1w.nii.gz#size=11637742
  datalad-archives: dl+archive:MD5E-s4347673658--836cb09310fa22f7d2112c7f81e6258b.tgz#path=ds000030/sub-10159/anat/sub-10159_T1w.nii.gz&size=11637742
  datalad-archives: dl+archive:MD5E-s4349211504--2fe25908e474d782e8963fd31d6fe4b5.zip#path=ds000030/sub-10159/anat/sub-10159_T1w.nii.gz&size=11637742
  datalad-archives: dl+archive:MD5E-s4802398120--ce2d215f336e6dfa282d69cc35beb80d.tgz#path=sub-10159/anat/sub-10159_T1w.nii.gz&size=11637742
ok

but then all those archives seems to be no longer available from the (versioned) URLs where they used to be available:

$> git annex whereis --key MD5E-s4802398120--ce2d215f336e6dfa282d69cc35beb80d.tgz
whereis MD5E-s4802398120--ce2d215f336e6dfa282d69cc35beb80d.tgz (1 copy) 
  	00000000-0000-0000-0000-000000000001 -- web

  web: http://openfmri.s3.amazonaws.com/tarballs/ds000030_R1.0.1_sub10150-10299.tgz?versionId=X3sfPmNxugxTtoez935C.PteHH40Dbtc
ok

$> datalad ls -aL s3://openfmri/tarballs/ds030_R1.0.0_10150- 
Connecting to bucket: openfmri
[INFO   ] S3 session: Connecting to the bucket openfmri 
Bucket info:
  Versioning: S3ResponseError: 403 Forbidden
     Website: S3ResponseError: 403 Forbidden
         ACL: S3ResponseError: 403 Forbidden
tarballs/ds030_R1.0.0_10150-10274.tgz 2017-11-18T20:44:17.000Z 3920586194 ver:null                              acl:AccessDenied  http://openfmri.s3.amazonaws.com/tarballs/ds030_R1.0.0_10150-10274.tgz?versionId=null [OK]

so only a non-versioned one is now available :-/ I vaguely remember openfmri bucket going through some migration, so I guess we still have got only some stale URLs for this one.
Let me now finally crawl the extracted version while excluding derivatives, so at least we would get direct links to those files… hopefully would be done in a day or so

oesteban · August 18, 2018, 2:59am

Thanks a lot!

If I understand it correctly, I could even crawl the whole openneuro bucket my self, is that right?

yarikoptic · August 18, 2018, 3:09am

sure – noone forbids
What I did now (don’t know yet if “correct”) is

git co --orphan incoming-s3-openneuro-noderivatives
git reset --hard
datalad crawl-init --save --template=simple_s3 bucket=openneuro to_http=1 prefix=ds000030 exclude=derivatives
datalad crawl

and now it seems to be doing smth (once again – this dataset is quite heavy in # of files so it might as well be just getting all the s3 keys, you could run it instead with -l debug to possibly get more feedback… but it might be just silent for a while boto interacts with s3. process is already at 4GB RAM consumption). The idea is that I would be able to crawl straight into this new branch, populate git-annex/ (branch) with all the information about availability of those files, and then possibly even remove it (the incoming-s3-openneuro-noderivatives) entirely so not to keep those additional heavy tree object(s) around in .git… Here we go - started to download the tarballs! hopefully it would, as prescribed, ignore all the derivatives (time will show)

P.S. Note that ATM

it would require S3 credentials. anonymous access is in PR https://github.com/datalad/datalad/pull/2708
you would need to have datalad-crawler extension if you use datalad >= 0.10 (prior versions include crawler within)

yarikoptic · August 20, 2018, 2:54am

crawled, and pushed updated state of git-annex. you should be able to either datalad update or just git fetch origin and then do your get.

oesteban · August 20, 2018, 4:57pm

Sweet! Of course it works :).

Thanks a lot.