Datalad subdatasets - How to include them in 1 repo on GitHub?

axiezai · March 12, 2021, 4:48pm

Hello,

I am trying to maintain a small datalad dataset on GitHub at GitHub - Raj-Lab-UCSF/Human_Brain_Atlases: Commonly used human brain atlases as a Datalad dataset.

Currently, I followed the handbook and have the following sub-datasets and siblings:

datalad subdatasets -r
subdataset(ok): /Users/xxie/lab/Human_Brain_Atlases/aal (dataset)
subdataset(ok): /Users/xxie/lab/Human_Brain_Atlases/brainnectome (dataset)
subdataset(ok): /Users/xxie/lab/Human_Brain_Atlases/desikan-killiany (dataset)
action summary:
  subdataset (ok: 3)

datalad siblings
.: here(+) [git]
.: github(-) [git@github.com:Raj-Lab-UCSF/Human_Brain_Atlases.git (git)]
.: github-lfs(+) [git@github.com:Raj-Lab-UCSF/Human_Brain_Atlases.git (git)]

By default, when I did datalad push --to=github, datalad created 4 separate repo’s for the dataset itself, and its 3 subdatasets. I didn’t like that so I deleted the 3 subdataset repos as the main dataset repo looked fine.

Then, on another machine, I tried to datalad clone and datalad get a subdataset, but received the following:

$ datalad get -n desikan-killiany
[ERROR  ] Failed to clone from any candidate source URL. Encountered errors per each url were:
| - https://github.com/Raj-Lab-UCSF/Human_Brain_Atlases.git/desikan-killiany
  CommandError: 'git clone --progress https://github.com/Raj-Lab-UCSF/Human_Brain_Atlases.git/desikan-killiany /data/rajlab1/shared_data/Human_Brain_Atlases/desikan-killiany' failed with exitcode 128 [err: 'Cloning into '/data/rajlab1/shared_data/Human_Brain_Atlases/desikan-killiany'...
remote: Not Found

Which makes sense, since I also did a datalad sibling remove -s github in the subdatasets because I didn’t like that it created separate repos.

So my question is… is it possible to create this dataset with its subdatasets all in one repo? How should I go about it? Thanks in advance and let me know if I should provide any more info!

yarikoptic · March 12, 2021, 6:10pm

is it possible to create this dataset with its subdatasets all in one repo?

The short answer is: No, because every DataLad dataset is a git repository so you should have separate repositories on github for each dataset.
The longer answer: it is possible (e.g., to push each subdataset branch into a prefixed branch on github in the same repo while setting up proper mapping in the local .git/config) but it would just beg for trouble etc. Thus better not even think about it IMHO

How should I go about it?

change your mind on " I didn’t like that" – if you allow for multiple repositories on github, you are 99% there, just

redo your create-sibling-github -r with --existing skip.
for datalad install etc to work: Unfortunately we do not have “native” support for “unflattening” a tree of datasets as published to flattened hierarchy on github (or elsewhere like gin etc). But the workaround is quick (and I do not think you would run into side-effects) – just adjust url entries in your Human_Brain_Atlases/.gitmodules at master · Raj-Lab-UCSF/Human_Brain_Atlases · GitHub to contain full URLs to corresponding repositories on github, datalad save those changes
datalad push -r; and you should be all set

axiezai · March 12, 2021, 6:39pm

Hi Yarik, thank you for the response.

Yes I realized I had to somehow modify the urls after posting. And I followed your instructions, modified the .gitmodules urls. Re-installed the dataset remotely, then tried to do:

$ datalad get desikan-killiany
[ERROR  ] not available; (Note that these git remotes have annex-ignore set: origin) [get(/data/rajlab1/shared_data/Human_Brain_Atlases/desikan-killiany/DK_Atlas_86_1mm.nii.gz)]
get(error): desikan-killiany/DK_Atlas_86_1mm.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]

I found this post: Update GitHub repository for dataset using DataLad - #5 by rastko

I pushed to github with datalad push -r --to=github, should I have used datalad publish instead? And what about the default branch? Should I keep it to the git-annex branch for all repos?

yarikoptic · March 13, 2021, 12:22am

where did you upload the actual data files, since they can’t be uploaded to github (the short answer ;)). I can only see the following:

(git-annex)lena:/tmp/Human_Brain_Atlases/desikan-killiany[master]
$> git annex whereis DK_Atlas_86_1mm.nii.gz
whereis DK_Atlas_86_1mm.nii.gz (2 copies) 
  	42090db8-8adc-4f59-8896-4db38d51ae16 -- xxie@RAD-4BUJGH6-LT:~/lab/Human_Brain_Atlases/desikan-killiany
   	b209260c-c041-49d1-a223-37537a55a1d4 -- axiezai@sachin:/media/rajlab/DATASETS/Human_Brain_Atlases/desikan-killiany
ok

and I thought that you were going to use OSF to store those (so you would need to create OSF dataset per each subdataset)

BTW - https://www.templateflow.org/ established a nice (although elaborate) submission of new templates (and atlases) for a similar in spirit collection GitHub - templateflow/templateflow: The Zone of templates

axiezai · March 15, 2021, 8:42pm

First I didn’t realize I had to create the OSF dataset… then I realized my OSF datasets were PRIVATE, I switched them to public and it worked! thank you!