Datalad addurls errors downloading HBN data

jsmentch · July 7, 2021, 6:05pm

I’m trying to create a datalad dataset of the HBN MRI data. I have a superdataset with 4 subdatasets, one for each site. Within each site subdataset I want to create a subdataset for each subject, but running into some errors.

I’m able to generate subject-specific csv files for datalad addurls like so:

original_url,subject,filename,version
s3://fcp-indi/data/Projects/HBN/MRI/Site-CUNY/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.json,sub-NDARVN020FRK,anat/sub-NDARVN020FRK_acq-HCP_T1w.json,7kmUPAJ15TTZ2SCrDlW0KLykAND.cJXM
s3://fcp-indi/data/Projects/HBN/MRI/Site-CUNY/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.nii.gz,sub-NDARVN020FRK,anat/sub-NDARVN020FRK_acq-HCP_T1w.nii.gz,ijRoYfW3FsUHm1B1yFovsdtYdJ2xzhjE
s3://fcp-indi/data/Projects/HBN/MRI/Site-CUNY/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-VNavNorm_T1w.json,sub-NDARVN020FRK,anat/sub-NDARVN020FRK_acq-VNavNorm_T1w.json,niTxTkFuTMo8EqFCIQ.dRQeGrlV1tb80

but when I run the command:

datalad addurls -d sub-NDARVN020FRK -t csv sub-NDARVN020FRK_table.csv '{original_url}?versionId={version}' '{filename}'

I get the following errors including “Configuration does not allow accessing s3://…” and “dataset containing given paths is not underneath the reference dataset”. full errors below. I’ve tried to copy this cfg_hcp_dataset.sh file and specify -c hcp_dataset but that wasn’t working. I’m able to datalad download-url these files individually so I don’t think it is a permission problem with aws. A bit stuck here so any pointers would be helpful - thanks!

addurls(error): /om4/group/gablab/data/jsmentch/HBN_datalad/Site-CUNY/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.json (file) [AnnexBatchCommandError: 'addurl' [Error, annex reported failure for addurl (url='s3://fcp-indi/data/Projects/HBN/MRI/Site-CUNY/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.json?versionId=7kmUPAJ15TTZ2SCrDlW0KLykAND.cJXM'): {'command': 'addurl', 'success': False, 'input': ['s3://fcp-indi/data/Projects/HBN/MRI/Site-CUNY/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.json?versionId=7kmUPAJ15TTZ2SCrDlW0KLykAND.cJXM anat/sub-NDARVN020FRK_acq-HCP_T1w.json'], 'error-messages': ['  Configuration does not allow accessing s3://fcp-indi/data/Projects/HBN/MRI/Site-CUNY/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.json?versionId=7kmUPAJ15TTZ2SCrDlW0KLykAND.cJXM'], 'file': 'anat/sub-NDARVN020FRK_acq-HCP_T1w.json'}] [annexrepo.py:add_url_to_file:2114]]

[ERROR  ] dataset containing given paths is not underneath the reference dataset Dataset(/om4/group/gablab/data/jsmentch/HBN_datalad/Site-CUNY/sub-NDARVN020
FRK): [PosixPath('/om4/group/gablab/data/jsmentch/HBN_datalad/Site-CUNY/sub-NDARVN020FRK')] [status(/om4/group/gablab/data/jsmentch/HBN_datalad/Site-CUNY)]
    [22 similar messages have been suppressed]
    status(error): .. [dataset containing given paths is not underneath the reference dataset Dataset(/om4/group/gablab/data/jsmentch/HBN_datalad/Site-CUNY/sub-NDARVN020FRK): [PosixPath('/om4/group/gablab/data/jsmentch/HBN_datalad/Site-CUNY/sub-NDARVN020FRK')]]

yarikoptic · July 7, 2021, 6:54pm

could you please run

git annex initremote datalad externaltype=datalad type=external encryption=none autoenable=true

and try addurls call again?

If works – I would have thought that download-url: Set up datalad special remote if needed by kyleam · Pull Request #5648 · datalad/datalad · GitHub (which is around 0.14.5~17^2~2) should have addressed it. What version of datalad do you have?

jsmentch · July 7, 2021, 7:14pm

Ok, I’ve tried that line but still received the same errors.

I was on version 0.14.0 actually so I’ve now updated to 0.14.6 but still see the addurls errors, no longer getting the “dataset containing given paths is not underneath the reference dataset” error

yarikoptic · July 7, 2021, 8:51pm

uff… it is not yet Friday and I was running after the tail of having -d sub-XXXX in your call… that is what lead to that “reference dataset call” error etc. What about making it all explicit in your addurls call + some tune ups, e.g. do not bother with s3:// urls. So altogether it would look like:

$> datalad create -c text2git hbn; cd hbn
[INFO   ] Creating a new annex repo at /tmp/hbn 
[INFO   ] Running procedure cfg_text2git 
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) =====                                              
create(ok): /tmp/hbn (dataset)                                                                            
(dev3) 1 20518 [1].....................................:Wed 07 Jul 2021 04:49:38 PM EDT:.
(git-annex)lena:/tmp/hbn[master]
$> datalad run -m "added csv with urls from the post" 'xclip -o > urls.csv'
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
add(ok): urls.csv (file)                                                                                  
save(ok): . (dataset)                                                                                     
action summary:                                                                                           
  add (ok: 1)
  save (ok: 1)
(dev3) 1 20519 [1].....................................:Wed 07 Jul 2021 04:49:43 PM EDT:.
(git-annex)lena:/tmp/hbn[master]
$> datalad run -m "produce https urls for s3 ones" "sed -e 's,s3://fcp-indi/,https://fcp-indi.s3.amazonaws.com/,g' urls.csv >| urls-https.csv"
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
add(ok): urls-https.csv (file)                                                                            
save(ok): . (dataset)                                                                                     
action summary:                                                                                           
  add (ok: 1)
  save (ok: 1)
(dev3) 1 20520 [1].....................................:Wed 07 Jul 2021 04:49:46 PM EDT:.
(git-annex)lena:/tmp/hbn[master]
$> datalad  addurls -t csv urls-https.csv '{original_url}?versionId={version}' '{subject}//{filename}'
[INFO   ] Creating a new annex repo at /tmp/hbn/sub-NDARVN020FRK 
create(ok): . (dataset)                                                                                   
addurl(ok): /tmp/hbn/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.json (file) [to anat/sub-NDARVN020FRK_acq-HCP_T1w.json]                                                                                     
addurl(ok): /tmp/hbn/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.nii.gz (file) [to anat/sub-NDARVN020FRK_acq-HCP_T1w.nii.gz]                                                                                 
addurl(ok): /tmp/hbn/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-VNavNorm_T1w.json (file) [to anat/sub-NDARVN020FRK_acq-VNavNorm_T1w.json]                                                                           
metadata(ok): /tmp/hbn/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.json (file)                     
metadata(ok): /tmp/hbn/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-HCP_T1w.nii.gz (file)                   
metadata(ok): /tmp/hbn/sub-NDARVN020FRK/anat/sub-NDARVN020FRK_acq-VNavNorm_T1w.json (file)                
save(ok): sub-NDARVN020FRK (dataset)                                                                      
add(ok): sub-NDARVN020FRK (file)                                                                          
add(ok): .gitmodules (file)                                                                               
save(ok): . (dataset)                                                                                     
action summary:                                                                                           
  add (ok: 2)
  addurl (ok: 3)
  create (ok: 1)
  metadata (ok: 3)
  save (ok: 2)
datalad addurls -t csv urls-https.csv '{original_url}?versionId={version}'   5.43s user 1.73s system 50% cpu 14.235 total

note that with // in the filename you are telling addurls to create a subdataset at that level. If your .csv has multiple subjects, add --jobs 10 or as many as you can afford to run in parallel across those subdatasets

jsmentch · July 8, 2021, 3:01pm

that’s doing the trick, thank you!