Dividing existing datalad dataset into subdatasets

alexlicohen · June 7, 2020, 10:40pm

Hi all,
I believe this is a simple question, but I cannot seem to find the answer in the datalad handbook or here:

I have an existing BIDS-organized data set with 1000 subjects, all in one DataLad dataset and want to divide it into one subdataset per subject. How do I go about doing this?

Similarly, how do I divide out the derivatives folder into a separate dataset and/or move it into a seperate datalad dataset and then streamline the history to reduce disk usage?

Thanks in advance! (And if this is already documented in the handbook, please let me know where it is listed under)

@yarikoptic: any thoughts?

-Alex

yarikoptic · June 8, 2020, 8:27pm

HOWTO would depend if

you want to maintain history?
was that dataset/data published already somewhere else already?
how large your .git/objects already?

since the simplest way forward could be to just to redo it from scratch by first unannexing all data, wiping out all .git and reinitiating new datasets in the layout you would like it to be. But if there is information to be preserved - other ways could need to be exercised

alexlicohen · June 8, 2020, 10:47pm

In this particular case, I am fine with losing the history (publically available data MGH-GSP), but:

What is the fastest process to unannex the data (datalad unlock?)
Then, in what order do I create the superdataset and the subject-level datasets?

FYI: .git/objects is currently 16GB, but I am still in the process of a 24+ hour datalad save after running freesurfer…

alexlicohen · June 10, 2020, 11:22am

@yarikoptic
I just wanted to follow up on this after reading in more detail:

Assuming I’m OK with dropping the history, is it faster to run datalad unlock . or git annex unannex to start over?
With a large hierarchical data set (with no datalad infrastructure in place yet) is it better to do many datalad create commands at the subject level and then create a superdataset with datalad create at the parent level? or first run datalad create in the parent dir, then create the subdatasets?

This seems basic, but I do not see any guidance on the correct order to do this in the handbook or in the section on the HCP use case, so I feel that I’m missing something here…

alexlicohen · June 12, 2020, 5:17pm

Sorry to keep bringing this up as I am still trying to wrap my head around the logic of nested datasets when you are building them on top of existing raw data, not installing or cloning existing datalad datasets:

It would seem that depending on the order in which I run datalad create and datalad save commands, and whether or not I specify -d .. within the subject dirs, the subject level data is stored in their own git-annex’s vs the study-dir git-annex.

Most confusing, is that if I run the following commands within a BIDS subject directory (after creating the study-level datalad dataset:

datalad create -c text2git --force -d ..
datalad create -c text2git --force

and then run datalad save in the subject dir, then again in the dataset dir:

symlinks the subject files into the study-level git-annex and does not seem to create a .gitmodules
symlinks the subject files into the subject-level git-annex and creates a .gitmodules

and yet, the handbook seems to imply that if you want to create a subdataset, Option 1 is the correct procedure… am I missing something here?

I would presume, since I am aiming to create large-ish Data sets, that I would want separate git-annex repos per subject, yet still have the study superdataset be aware that each subject is a datalad dataset of it’s own? Is the linking between the superdataset and subdataset level automatic during datalad save? This is not obvious in any of the documentation I’ve read so far, But I’m hoping I’m just missing a simple logic…

EDIT:
And there is option 3:
from the study-dir:
datalad create -c text2git --force
datalad create -c text2git --force -d . sub-0001
datalad create -c text2git --force -d . sub-0002
pushd sub-0001; datalad save; popd
pushd sub-0002; datalad save; popd
datalad save
Which DOES seem to work ‘correctly’…

So the outlier seems to be that trying to create a subdataset from WITHIN a pre-existing sub-directory, AND specifying the superdataset via -d .. seems to not create .gitmodules entries?

adina · June 26, 2020, 3:54pm

Hi @alexlicohen!

Sorry for not having content about this in the DataLad handbook yet - to be honest, it simply wasn’t a usecase that had occurred to me until now, so thanks for bringing it up. I’ll work on adding information on it, maybe in one of the advanced chapters or on a new standalone chapter about transforming existing studies into datasets.

I would presume, since I am aiming to create large-ish Data sets, that I would want separate git-annex repos per subject, yet still have the study superdataset be aware that each subject is a datalad dataset of it’s own?

Yes, that’s correct.

I’ve only skimmed through the exact set up you have (sorry, I’m in a train and about to get of), but from what I understand: its a large directory with subject-specific sub-directories, and subject directories should become subdatasets, the top-level directory should become a superdataset.

Here’s how I would do it:

# start with create --force in the lowest-level dataset
for dir in super/sub-0{1..n}; do datalad -C $dir create --force; done

# save lowest-level dataset contents
for dir in super/sub-0{1..n}; do datalad -C $dir save . -m "add content"; done

[repeat dataset creation and saving for the next higher level - with this, at this point, you have unconnected datasets within eachother that each have their annex. Now for registering them as subdatasets…]

I don’t see a non-hacky way to retrospectively add a dataset as a subdataset with a datalad command from the top of my head, but I would simply create the .gitmodules programmatically using the git config command:

# add sub-* dataset paths to .gitmodules (in super)
for sub in sub*; do git config -f .gitmodules "submodule.${sub}.path" "$sub"; done

# save
datalad save

datalad subdatasets
# should show all subdirectories

All of this is untested and quickly written down (@yarikoptic tune in if you see if this can do bad things please ) - I hope I can have content on this in the handbook soonish!

alexlicohen · June 26, 2020, 10:25pm

Thank you for the confirmation @adina!

I have now done this for 2-3 ~1000ish subject datasets using the following code:

gist.github.com

https://gist.github.com/alexlicohen/822c05fd3d6e579ff2edd563ef52829b

BIDS_to_datalad.sh

#!/bin/bash

# in study dir:
datalad create -c text2git --force
for i in `ls -d sub-*`; do 
    if [ ! -d "${i}/.git" ]; then 
        datalad create -c text2git --force -d . $i
    fi
    pushd $i
        if [ ! -z "$(git status --porcelain)" ]; then

This file has been truncated. show original

Which seems to work without having to manually edit the .gitmodules file and can be resumed if interrupted. (which for datasets of this size… happens…)

I did not know about the -C option, that is neat! Fortunately the pushd/popd doesn’t take long.

My next challenge is to figure out how to back up these multiple datasets to a Google Drive Team - Special Remote to share with, and keep in sync with, the datasets on another cluster (not to highjack my own thread, but this: http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html#setting-up-3rd-party-services-to-host-your-data does not specify whether you need a new remote per repo? and/or whether multiple datasets can use the same rclone remote (does it represent a general fileserver, or a specific remote for a specific repo?..)

I cannot thank you and the whole datalad team enough for these tools; they are quite an achievement!