Hi @alexlicohen!
Sorry for not having content about this in the DataLad handbook yet - to be honest, it simply wasn’t a usecase that had occurred to me until now, so thanks for bringing it up. I’ll work on adding information on it, maybe in one of the advanced chapters or on a new standalone chapter about transforming existing studies into datasets.
I would presume, since I am aiming to create large-ish Data sets, that I would want separate git-annex repos per subject, yet still have the study superdataset be aware that each subject is a datalad dataset of it’s own?
Yes, that’s correct.
I’ve only skimmed through the exact set up you have (sorry, I’m in a train and about to get of), but from what I understand: its a large directory with subject-specific sub-directories, and subject directories should become subdatasets, the top-level directory should become a superdataset.
Here’s how I would do it:
# start with create --force in the lowest-level dataset
for dir in super/sub-0{1..n}; do datalad -C $dir create --force; done
# save lowest-level dataset contents
for dir in super/sub-0{1..n}; do datalad -C $dir save . -m "add content"; done
[repeat dataset creation and saving for the next higher level - with this, at this point, you have unconnected datasets within eachother that each have their annex. Now for registering them as subdatasets…]
I don’t see a non-hacky way to retrospectively add a dataset as a subdataset with a datalad command from the top of my head, but I would simply create the .gitmodules
programmatically using the git config
command:
# add sub-* dataset paths to .gitmodules (in super)
for sub in sub*; do git config -f .gitmodules "submodule.${sub}.path" "$sub"; done
# save
datalad save
datalad subdatasets
# should show all subdirectories
All of this is untested and quickly written down (@yarikoptic tune in if you see if this can do bad things please ) - I hope I can have content on this in the handbook soonish!