Hello,
I am trying to maintain a small datalad dataset on GitHub at GitHub - Raj-Lab-UCSF/Human_Brain_Atlases: Commonly used human brain atlases as a Datalad dataset.
Currently, I followed the handbook and have the following sub-datasets and siblings:
datalad subdatasets -r
subdataset(ok): /Users/xxie/lab/Human_Brain_Atlases/aal (dataset)
subdataset(ok): /Users/xxie/lab/Human_Brain_Atlases/brainnectome (dataset)
subdataset(ok): /Users/xxie/lab/Human_Brain_Atlases/desikan-killiany (dataset)
action summary:
subdataset (ok: 3)
datalad siblings
.: here(+) [git]
.: github(-) [git@github.com:Raj-Lab-UCSF/Human_Brain_Atlases.git (git)]
.: github-lfs(+) [git@github.com:Raj-Lab-UCSF/Human_Brain_Atlases.git (git)]
By default, when I did datalad push --to=github
, datalad created 4 separate repo’s for the dataset itself, and its 3 subdatasets. I didn’t like that so I deleted the 3 subdataset repos as the main dataset repo looked fine.
Then, on another machine, I tried to datalad clone
and datalad get
a subdataset, but received the following:
$ datalad get -n desikan-killiany
[ERROR ] Failed to clone from any candidate source URL. Encountered errors per each url were:
| - https://github.com/Raj-Lab-UCSF/Human_Brain_Atlases.git/desikan-killiany
CommandError: 'git clone --progress https://github.com/Raj-Lab-UCSF/Human_Brain_Atlases.git/desikan-killiany /data/rajlab1/shared_data/Human_Brain_Atlases/desikan-killiany' failed with exitcode 128 [err: 'Cloning into '/data/rajlab1/shared_data/Human_Brain_Atlases/desikan-killiany'...
remote: Not Found
Which makes sense, since I also did a datalad sibling remove -s github
in the subdatasets because I didn’t like that it created separate repos.
So my question is… is it possible to create this dataset with its subdatasets all in one repo? How should I go about it? Thanks in advance and let me know if I should provide any more info!
is it possible to create this dataset with its subdatasets all in one repo?
The short answer is: No, because every DataLad dataset is a git repository so you should have separate repositories on github for each dataset.
The longer answer: it is possible (e.g., to push each subdataset branch into a prefixed branch on github in the same repo while setting up proper mapping in the local .git/config
) but it would just beg for trouble etc. Thus better not even think about it IMHO
How should I go about it?
change your mind on " I didn’t like that" – if you allow for multiple repositories on github, you are 99% there, just
- redo your
create-sibling-github -r
with --existing skip
.
- for
datalad install
etc to work: Unfortunately we do not have “native” support for “unflattening” a tree of datasets as published to flattened hierarchy on github (or elsewhere like gin etc). But the workaround is quick (and I do not think you would run into side-effects) – just adjust url
entries in your Human_Brain_Atlases/.gitmodules at master · Raj-Lab-UCSF/Human_Brain_Atlases · GitHub to contain full URLs to corresponding repositories on github, datalad save
those changes
-
datalad push -r
; and you should be all set
1 Like
Hi Yarik, thank you for the response.
Yes I realized I had to somehow modify the urls after posting. And I followed your instructions, modified the .gitmodules
urls. Re-installed the dataset remotely, then tried to do:
$ datalad get desikan-killiany
[ERROR ] not available; (Note that these git remotes have annex-ignore set: origin) [get(/data/rajlab1/shared_data/Human_Brain_Atlases/desikan-killiany/DK_Atlas_86_1mm.nii.gz)]
get(error): desikan-killiany/DK_Atlas_86_1mm.nii.gz (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
I found this post: Update GitHub repository for dataset using DataLad - #5 by rastko
I pushed to github with datalad push -r --to=github
, should I have used datalad publish
instead? And what about the default branch? Should I keep it to the git-annex
branch for all repos?
where did you upload the actual data files, since they can’t be uploaded to github (the short answer ;)). I can only see the following:
(git-annex)lena:/tmp/Human_Brain_Atlases/desikan-killiany[master]
$> git annex whereis DK_Atlas_86_1mm.nii.gz
whereis DK_Atlas_86_1mm.nii.gz (2 copies)
42090db8-8adc-4f59-8896-4db38d51ae16 -- xxie@RAD-4BUJGH6-LT:~/lab/Human_Brain_Atlases/desikan-killiany
b209260c-c041-49d1-a223-37537a55a1d4 -- axiezai@sachin:/media/rajlab/DATASETS/Human_Brain_Atlases/desikan-killiany
ok
and I thought that you were going to use OSF to store those (so you would need to create OSF dataset per each subdataset)
BTW - https://www.templateflow.org/ established a nice (although elaborate) submission of new templates (and atlases) for a similar in spirit collection GitHub - templateflow/templateflow: The Zone of templates
1 Like
First I didn’t realize I had to create the OSF dataset… then I realized my OSF datasets were PRIVATE, I switched them to public and it worked! thank you!