Datalad, containers, data organization: multiple questions

dorianps · November 27, 2019, 4:15pm

First, thank you for creating datalad, it’s a fantastic tool that resolves greatly version tracking and data sharing. I have several questions, and will post them all here. If this is too much for a single thread, please let me know and I can split them up or open github issues.

Context:
I am starting to use datalad on a dataset with 350 subjects and 1800 MRI sessions. Data is in BIDS format, but raw data are available too.

Plan:
Following the advice from the datalad handbook, I am creating a structure of subdatasets to split up the raw bids data from the various derivatives folders.

Questions

I would like to include raw data as a subdataset. But altogether the dicoms and other files are about 12 million. @yarikoptic mentioned on another thread that git does not behave well with repositories >10k files. Would it still be ok to include the source subdataset?
Is there a way to “datalad run” with parallel jobs in SGE? Another thread seemed to show this is currently not possible. We would probably still use datalad but will process the data out of it’s tracking system.
This is just to get some tips on data organization. My idea of BIDS and YODA is that code sits in the root folder (./code) while the BIDS dataset itself should be clean and only contain unprocessed (sub-*) and derivatives. Is it advisable to use yoda only in the root superdataset, not inside bids subdatasets (i.e., derivatives/freesurfer).
If I have a superdataset in /data/studies/study1 and like to integrate it in a larger superdataset with all studies in /data/studies, is it ok to just do datalad create --force in /data/studies?
Is text2git advisable to use on bids datasets where there will be thousands of json files? Or do you usually annex everything inside a bids dataset?
We currently share folders in a server using setfacl. How does that work with datalad, What is the typical workflow for two users to share their siblings while no other person (or group) has access to the data?
Why are licenses not included in repronim containers, it would save a lot of time if the licenses of the various neuroimaging pipelines are there together with the containers.
bids2scidata, what does it do, any examples? Can it be used to extract which participant/session has T1w in a BIDS dataset?
Is there some github repo with scripts you guys use to process your data using datalad. containers, computing clusters, etc? That might save us some time, too.

Here are the main software versions:

datalad 0.12.0rc5
git version 2.18.0 (have more recent installed if needed)
git-annex version: 7.20191114-g49d738f (upgrade supported from repository versions: 0 1 2 3 4 5 6)
singularity version 3.4.2-1.1.el7
System: CentOS Linux release 7.7.1908

yarikoptic · December 2, 2019, 4:48pm

Hi Dorian,

Welcome to NeuroStars!

That is a beauty of the git submodules which DataLad uses for its sub-datasets mechanism. If you place your DICOMs into a separate subdataset at sourcedata/, number of files in that subdataset should not have direct effect on its parent (BIDS) dataset. Moreover, you are typically not interested in sourcedata/ while working on already existing BIDS dataset, so it will allow to install BIDS dataset without all the dicoms etc. Then I think you will not get your millions of files in BIDS dataset.
For your sourcedata/ dataset – Git indeed becomes slower while working with growing number of files. Depending on your workflow you could have just .tar.gz’ed each series of DICOMs into its own .tar.gz thus minimizing number of files in the repo. E.g. we are doing that within ReproIn heuristic of HeuDiConv, which can work on tarballs of DICOMs.
10k was kinda a very low bound, there are datasets with >100k files which are still “functional”. With DICOMs dataset I do not expect you to do much/often, so I think it will be just fine. But also you could try (I haven’t yet) a relatively new feature of git – split index, see https://git-scm.com/docs/git-update-index#_split_index which should be of help in such cases.

not now, and not sure if ever (for SGE specifically). There is https://github.com/datalad/datalad-htcondor for local htcondor powered execution, and we are also developing reproman run which could then be used locally or you could schedule execution on remote service. See https://reproman.readthedocs.io/en/latest/execute.html for the basic docs about it. There is a PR with an example I reference below as well. There (in ReproMan) we currently provide support for PBS (torque) and Condor, no SGE. We hope to provide submitter for SLURM some time soon (https://github.com/ReproNim/reproman/issues/484). Not sure if anyone in our team(s) would work on SGE though – we have used it only briefly in the past before quickly running away from it If you really need it, you could craft a submitter based on others, see e.g. one for Condor .

Well, if you were to follow YODA principles more closely, then BIDS dataset should not contain derivatives. You could organize your YODA study dataset to contain BIDS and derivatives subdatasets. Each derivative dataset should then “contain” BIDS dataset(s) it used. Those could be installed “temporarily and efficiently” (e.g., via datalad install --reckless --source ../bids raw_bids or relying on CoW functionality of filesystems like BTRFS). Each derivative dataset then could have raw_bids/ subdataset where BIDS dataset would be installed, and thus be self-sufficient and have all tracking information. See e.g.

for a prototypical layout.
If that is “too much”, then indeed at least making those to be subdatasets under derivatives/ of BIDS dataset - would be the least of the YODA sins

Sure, why not? it will initiate the dataset at the /data/studies level so you could add all the subdatasets to be contained within it.

I think it generally should be fine, but you might like to have some text files under git-annex control, e.g. if they contain possibly sensitive information, e.g. the _scans.tsv files could contain exact scanning date. That is why in heudiconv we explicitly list those to go under git-annex: https://github.com/nipy/heudiconv/blob/master/heudiconv/external/dlad.py#L76 .

If you install datalad-neuroimaging extension, then it would provide cfg_bids procedure, so you could use datalad create -c bids which establishes a bit less generous specification – only some top level files (README, CHANGES) go to git, the rest to git-annex.

Another point, with sensitive information in mind, you might like to start using --fake-dates option of datalad create for your DICOM and BIDS datasets if the dates when you add data to them are close to the original scanning date. That would make all git/git-annex commits start in the past and go at 1 sec interval between commits regardless when you datalad save

Oh hoh, I never found a peace of mind with ACL. git-annex needs additional work to provide lock down on ACL systems. Paranoid me also afraid of git-annex “shared” mode since it would then allows to introduce changes to tracked by it files . Having said all that, the easiest solution is just to have two clones (possibly from a “central” shared, possibly bare) with users having read-only access to the repository of the other user. I think it should work with ACL or just regular POSIX system permissions established at the level of the group having read permissions and user umasks not resetting group permissions (so being 022)

How would like them to be shared/exposed? And should it point to the license of the entrypoint project, or have licenses for all used/included components? Could be too difficult to assemble, besides for components installed via apt since all those packages must have licensing information in corresponding /usr/share/doc/<package>/copyright. Anyways – better file an issue with ReproNim/containers and we will see what could be done

That one specifically for submission of ISATAB files to data descriptor papers to Nature “Scientific Data”. You don’t need it to answer your question. You just need to extract (and possibly aggregate into super-datasets) metadata, after installing datalad-neuroimaging extension and enabling bids metadata extractor (git config -f .datalad/config --add datalad.metadata.nativetype bids). Then datalad aggregate-metadata should aggregate metadata. Then, though you can’t answer ‘which participant’, you could find an answer for “which T1w” I have across the collection of datasets I have:

(git-annex)lena:~/datalad[master]git
$> datalad -c datalad.search.index-egrep-documenttype=all search bids.type:T1w
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-emmet/ses-20180508/anat/sub-emmet_ses-20180508_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-emmet/ses-20180521/anat/sub-emmet_ses-20180521_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-emmet/ses-20180531/anat/sub-emmet_ses-20180531_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-qa/ses-20171030/anat/sub-qa_ses-20171030_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-qa/ses-20171106/anat/sub-qa_ses-20171106_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-qa/ses-20171113/anat/sub-qa_ses-20171113_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-qa/ses-20171120/anat/sub-qa_ses-20171120_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-qa/ses-20171127/anat/sub-qa_ses-20171127_acq-MPRAGE_T1w.dicom.tgz (file)
search(ok): /home/yoh/datalad/dbic/QA/sourcedata/sub-qa/ses-20171204/anat/sub-qa_ses-20171204_acq-MPRAGE_T1w.dicom.tgz (file)

With ongoing development of https://github.com/datalad/datalad-metalad which is intended to replace metadata handling in datalad, we hope to provide ways to answer “which subject” or “which dataset”.

oh, there is no single location. Exemplar use cases should eventually appear in documentation(s) and handbook. E.g.

http://handbook.datalad.org/en/latest/usecases/reproducible_neuroimaging_analysis.html
ReproNim training for OHBM 2018 (no HPC, just datalad run): http://www.repronim.org/ohbm2018-training/03-01-reproin/
a prototypical workflow which would use datalad run or reproman run for a typical BIDS dataset analysis: (yet to be finished) https://github.com/ReproNim/reproman/pull/438/files
just search GitHub for DATALAD RUNCMD to see how us/others used datalad run

Overall – everything is still a moving target

dorianps · December 2, 2019, 8:06pm

Thanks @yarikoptic.

I have indeed installed all the addons you mentioned, and I parsed using metalad, which seems faster.

From what you say, looks like the best is to have a tree like this:
.
├── derivatives
│ ├── antsCorticalThickness
│ ├── freesurfer
│ └── frmiprep
├── rawdata
└── sourcedata

all with -c bids except sourcedata. With regard to installing temporary rawdata under each derivatives, why not using the direct reference to the rawdata folder during processing (../../rawdata/sub-01... ? Since derivatives and rawdata are part of a bigger dataset their reference should hold static references accross subdatasets, right?

About sharing access to the data, my idea is not to give writing permissions but only reading permissions to other users. Then, once the other user has finished working on the sibling, I go in the from master location and pull the changes in the main shared repository. Your suggestion seems the same. Maybe I can rely on setgid to maintain the permissions of the entire folder structure of the master dataset, and have those permissions maintained on new data pulled from siblings.

Licese files:
The best from a user perspective is to have all license files direclty accessible from the datastet’s license folder. But I realize this may be more difficult to maintain if one of the pipelines starts changing the internal tools. Not sure how you can achieve it, but the best would be to have the license files within the containers dataset in some way. By license I don’t mean just, for example, fmriprep but also the licenses of tools it relies on in that specific version of the container (i.e., ants, fsl, afni, freesurfer, etc.). Maybe this is not something people worry much in academia, but licenses are often looked at outside of it.

The data are retrospective from finished study(ies) so no problem with the dates, but clever the option for --fake-dates.

And finally, thanks for the datalad search example, that might be quite useful.

Overall, datalad squeezes in a lot of functionality. In principle, version tracking is already a great baseline tool to fix data in time, the rest are nice additions that make the life easy. But I think is worth using datalad even just as a dataset tracker only.

yarikoptic · December 2, 2019, 11:52pm

IMHO all data we work with is derivatives, even dicoms, we just never see/collect true raw data. So why not

.
├── code
├── data
│ ├── antsCorticalThickness
│ ├── bids
│ ├── dicoms
│ ├── fmriprep
│ └── freesurfer
└── environments

?

yarikoptic · December 2, 2019, 11:54pm

for fmriprep output (here I used one on top for both fmriprep and freesurfer it generates) you might want some non-bids one, since it is quite specific

$> cat .gitattributes 
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
*.md annex.largefiles=nothing
*.html annex.largefiles=nothing
*.json annex.largefiles=nothing
CITATION.* annex.largefiles=(not(mimetype=text/*))

I think it worked out fine… we need to add a custom cfg_fmriprep to datalad-neuroimaging which would do that

yarikoptic · December 2, 2019, 11:58pm

I call it the “YODA impurity of the first kind”, or a “Russian YODA impurity” in contrast to the “German purity of YODA” Indeed, all information would be available if you have access to the super-dataset, but any derivative dataset wouldn’t be self-sufficient (would not include information about the version of its sourcedata/). That could complicate e.g. tracking of changes (“what has changed in source data since I have processed it”), which would be easier (pretty much a call to datalad diff -r) if sourcedata/ is a proper subdataset.

NB had to modify this one since cannot post more independent replies

I am a bit confused here – if it is dataset’s license, then e.g. for BIDS we have provision in dataset_description.json. Licenses of software which produced the dataset are not really relevant to the dataset itself (AFAIK even commercial software licenses do not place restrictions on the artifacts software produces), but rather to the operator who uses the software. Some exclusions could be some JS code and possibly used/embedded in the .html reports etc.

But overall sounds like could be a possible valuable addition to the https://github.com/duecredit/duecredit to also provide licenses of the used components.

dorianps · December 3, 2019, 12:31am

yarikoptic:

IMHO all data we work with is derivatives, even dicoms, we just never see/collect true raw data. So why not
.
├── code
├── data
│ ├── antsCorticalThickness
│ ├── bids
│ ├── dicoms
│ ├── fmriprep
│ └── freesurfer
└── environments
?

Looks ok, I was just trying to follow the bids specification material without studying it in detail, hoping that a single run of bids-validator would take care of parsing everything, raw bids data and derivatives.

The dicoms (source data) in our case may contain improperly anonymized information already in filenames, this is why I plan to keep it outside the sharable subdataset (bids + derivatives).

Thanks for this tip.

dorianps · December 3, 2019, 12:34am

yarikoptic:

for fmriprep output (here I used one on top for both fmriprep and freesurfer it generates) you might want some non-bids one, since it is quite specific
$> cat .gitattributes 
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
*.md annex.largefiles=nothing
*.html annex.largefiles=nothing
*.json annex.largefiles=nothing
CITATION.* annex.largefiles=(not(mimetype=text/*))
I think it worked out fine… we need to add a custom cfg_fmriprep to datalad-neuroimaging which would do that

Didn’t know fmriprep needs extra care. Will keep this present when we import the data we already ran into a subdataset.

dorianps · December 3, 2019, 12:54am

Ok, now I seem to get the rationale. Your suggestion is therefore to import the bids (raw) subdataset, use it with datalad (container-)run as needed (which will force the retrieval of the required data), and then drop all the data in that subdataset to keep only the datalad structure capable of giving info on what was used to make the derived data. Thanks.

I think we are talking about two different things, both worth considering. My initial suggestion was only about using the software. Say a pharma company wants to analyze fMRI data, they find the Repronim dataset with containers included, but before using it they need to know if the software being used inside the container is legally compatible for them (in this case, commercial use!?). It would be quite good if the licensing information of the bids_fmriprep container is found somewhere and included in the dataset. This is not a license agreement you distribute with the dataset/containers, it is just a facilitation for the user to quickly check the licenses that are involved in using the tools packed in the container.
The second thought you brought up is very interesting, you say software do not place restrictions on the artifacts the software produces. I know little about this, but I thought restrictions apply to the artifacts, too. So I hope very much you are right. The main hiccup I have heard outside of academia is usually related to FSL and its unusual licensing format. If, however, you are wrong and licensing applies to software artifacts, that would mean that licensing info should be kept with the dataset (again, I hope you are right, not wrong).

yarikoptic · December 3, 2019, 1:06am

You would validate each subdataset individually

That is where submodules come handy. You can share some subdatasets but skip the others, while retaining clear versioning association.

yarikoptic · December 3, 2019, 1:11am

Correct. Or even uninstall it altogether, add long as you have it’s original dataset absolute and properly referenced within .gitmodules of the derived dataset.

Gotcha - sorry I have lost the track of repronim/containers here in the picture! Yes, it would be valuable information. Ideally I should research into metadata descriptors for containers. Vanessa Sochat was interested in that iirc, should ask her

yarikoptic · December 3, 2019, 1:13am

Not that it is really needed, but just would make things prettier and more usable by default

dorianps · December 7, 2019, 7:52pm

@yarikoptic
I copied all Freesurfer data for a study in a new datalad dataset and I am trying to save it. The new data are 1.4Tb in size, ~3.3Gb per subject for ~440 subjects, 1.3 million files. Looks like will take 5-6 hours to finish saving! I am not entirely surprised because another dataset of 330Gb took 30-60 minutes, too. But I wanted to check if this sounds normal to you. Do you also save datasets these big with a similar experience, or do you chop them into nested datasets for each subject (as you suggested elsewhere). I don’t think it will make much difference saving all folders vs. individual subjects, but looking forward to hear if you have any tip in this regard.

Thanks again for your previous help.
Dorian

yarikoptic · December 7, 2019, 9:20pm

Number of subjects is on the border. I guess in some use cases you might like just a subset, so indeed having a subdataset per subject would be the way to go. Michael and his group are finalizing HCP dataset and I believe it will be with a subdataset(s) per subject. Also doing a sample subject dataset could give you idea about .git/objects size.

And watchout, iirc there is lots of xml files, which are text, so pure text2git configuration of .gitattributes wouldn’t be what you want since too much would go directly to git

dorianps · December 7, 2019, 9:48pm

Follow up:

How do I go by converting the big dataset to multiple subdatasets now?

How would someone create a sample parent dataset with just few subs?

I am trying to use the containers dataset from your repronim GitHub. It is sitting upstream in the root superdataser. Can’t seem to make bids-validator work to check a downstream dataset though. Can you give me a couple of examples how to use the containers when they don’t sit in the dataset they will apply to (maybe fmriprep and freesurfer too).

Much appreciated, thank you.

I am not using text2git at all anymore, just Yoda or bids.

dorianps · December 10, 2019, 10:11pm

@yarikoptic

I have done some progress, although through painful trial and error (questions in bold below).

At first, I interrupted the datalad save command, a big mistake whichleft 800Gb of data in .git/annex directory. I cleaned up the .git folder with git annex unused; git annex dropunused all, and the .git folder was still 9.5Gb in size, probably coming from a huge list of commits. I even tried to delete the commit history to start fresh, but the 9Gb were still there and datalad status would just hang for long time. I ended up deleting the whole freesurfer folder and starting over.

Now, I finally got all subjects as subdatasets in side the freesurfer parent dataset. To my surprise, calling datalad status is still slow, takes 9 minutes just to have a response. This is after setting the -e no flag. At this point, I am not sure what is datalad doing. Aren’t -e no or -e commit supposed to check just if the parent dataset is referencing to the current commit of the child dataset? This should be a matter of seconds, not minutes. Is this the expected behavior that the user must just get used to (i.e., waiting for minutes to check the status of a superdataset)?

Here is the log:

[dorian@mri FEDERATED_DATASETS]$  time datalad status -e no -r
  unknown: study1 (dataset)
  unknown: study1/data/bids (dataset)
  unknown: study1/data/processed (dataset)
  unknown: study1/data/processed/fmriprep (dataset)
 modified: study1/data/processed/freesurfer (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
...
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: containers (dataset)
  unknown: containers/artwork (dataset)

real    8m32.053s
user    4m18.220s
sys     4m34.170s
[dorian@mri FEDERATED_DATASETS]$  time datalad status -e commit -r
 modified: study1/data/processed/freesurfer (dataset)

real    8m37.821s
user    4m23.359s
sys     4m34.849s

A couple of ideas to improve datalad functionality:

Add a merge or split subdataset functions. I.e., if I have a huge dataset and want to split it chunks of subdatasets, datalad should take care of it. It is quite complicated for the user to try do that manually with all the various .git folders, internal references, etc. Same logic goes for merging subdatasets into a larger dataset.
Add an option when creating a subdataset to point to an external folder and have datalad copy the entire content of that folder and save it in the newly created subdataset. Thus, save the need for the user to copy the data, create the dataset with --force, and save the subdataset.
Add a command to thoroughly check the .git/annex folders for unused files and drop them all. Git annex can do that, but is rather tricky for the user (I thought the clean command would do it, but it does not).

P.s. Btw, I have updated to datalad v0.12.0rc6

yarikoptic · December 11, 2019, 12:46am

well, if you interrupted, redoing save AFAIK should just do the right thing. git-annex keeps of journal of actions which are done but not yet committed. Then at some point git gc would have kicked in which would have removed objects which might have been created but not referenced. I some times to force it really do git gc --prune=now if I know that everything I care about should be referenced by git at that point. Whenever you analyze size of .git it matters what directory in particular (I usually do du -scm .git/* and see either it is annex/ or objects/)

how it compares to plain git status timing? some issues are known (e.g. https://github.com/datalad/datalad/issues/3766 etc) some are yet to discover, but there is a cost to verify that no changes exist to a large file tree.

I think this thread grew too long already and we are getting to specific issues. Please file a new (or may be you find the one which matches and then just comment on – we need to know that there is a damand) issue on https://github.com/datalad/datalad/issues/ for your observations and ideas.

https://github.com/datalad/datalad/issues/3554 – added your comment

do you mean you want a dedicated datalad command to do datalad create . && cp -rp --reflink=auto sourcedir/* . && datalad save? I am not sure it is worth a dedicated command. Also there is git annex import if you would like to continuously add changes from a folder.

hm… might be worth indeed additional mode for datalad clean or even better datalad drop. Please file an issue. It could be a trickly one since, as you saw, there could be a number of definitions of “unused” and thus we might need first to identify common use cases. Otherwise indeed git annex unused ... && git annex dropunused all would provide the most flexible approach. Also please consider contributing to http://github.com/datalad-handbook/book on this

dorianps · December 11, 2019, 1:02am

@yarikoptic

Git status timing is just 5 seconds on the same superdataset as before which took 8mins with datalad status -r -e no.

$ time git status
On branch master
nothing to commit, working tree clean

real    0m4.951s
user    0m3.533s
sys     0m6.936s

dorianps · December 11, 2019, 1:14am

Datalad without recursive option is 2.5 mins on that same folder.

$ time datalad status

real    2m37.926s
user    1m8.018s
sys     1m30.672s

yarikoptic · December 11, 2019, 1:31am

Thank you for the timings. For now let me just recommend to use git status whenever you would like to check overall status of large datasets. We will resolve this horrible performance issue sooner than later, eg may be with similar to git caching the results (that is how it was so fast, because datalad status queried g it right before)