Datalad, containers, data organization: multiple questions

IMHO all data we work with is derivatives, even dicoms, we just never see/collect true raw data. So why not

.
├── code
├── data
│ ├── antsCorticalThickness
│ ├── bids
│ ├── dicoms
│ ├── fmriprep
│ └── freesurfer
└── environments

?

2 Likes

for fmriprep output (here I used one on top for both fmriprep and freesurfer it generates) you might want some non-bids one, since it is quite specific

$> cat .gitattributes 
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
*.md annex.largefiles=nothing
*.html annex.largefiles=nothing
*.json annex.largefiles=nothing
CITATION.* annex.largefiles=(not(mimetype=text/*))

I think it worked out fine… we need to add a custom cfg_fmriprep to datalad-neuroimaging which would do that

1 Like

I call it the “YODA impurity of the first kind”, or a “Russian YODA impurity” in contrast to the “German purity of YODA” :wink: Indeed, all information would be available if you have access to the super-dataset, but any derivative dataset wouldn’t be self-sufficient (would not include information about the version of its sourcedata/). That could complicate e.g. tracking of changes (“what has changed in source data since I have processed it”), which would be easier (pretty much a call to datalad diff -r) if sourcedata/ is a proper subdataset.

NB had to modify this one since cannot post more independent replies

I am a bit confused here – if it is dataset’s license, then e.g. for BIDS we have provision in dataset_description.json. Licenses of software which produced the dataset are not really relevant to the dataset itself (AFAIK even commercial software licenses do not place restrictions on the artifacts software produces), but rather to the operator who uses the software. Some exclusions could be some JS code and possibly used/embedded in the .html reports etc.

But overall sounds like could be a possible valuable addition to the https://github.com/duecredit/duecredit to also provide licenses of the used components.

1 Like

Looks ok, I was just trying to follow the bids specification material without studying it in detail, hoping that a single run of bids-validator would take care of parsing everything, raw bids data and derivatives.

The dicoms (source data) in our case may contain improperly anonymized information already in filenames, this is why I plan to keep it outside the sharable subdataset (bids + derivatives).

Thanks for this tip.

Didn’t know fmriprep needs extra care. Will keep this present when we import the data we already ran into a subdataset. :+1:

Ok, now I seem to get the rationale. Your suggestion is therefore to import the bids (raw) subdataset, use it with datalad (container-)run as needed (which will force the retrieval of the required data), and then drop all the data in that subdataset to keep only the datalad structure capable of giving info on what was used to make the derived data. Thanks. :+1:

I think we are talking about two different things, both worth considering. My initial suggestion was only about using the software. Say a pharma company wants to analyze fMRI data, they find the Repronim dataset with containers included, but before using it they need to know if the software being used inside the container is legally compatible for them (in this case, commercial use!?). It would be quite good if the licensing information of the bids_fmriprep container is found somewhere and included in the dataset. This is not a license agreement you distribute with the dataset/containers, it is just a facilitation for the user to quickly check the licenses that are involved in using the tools packed in the container.
The second thought you brought up is very interesting, you say software do not place restrictions on the artifacts the software produces. I know little about this, but I thought restrictions apply to the artifacts, too. So I hope very much you are right. The main hiccup I have heard outside of academia is usually related to FSL and its unusual licensing format. If, however, you are wrong and licensing applies to software artifacts, that would mean that licensing info should be kept with the dataset (again, I hope you are right, not wrong).

You would validate each subdataset individually

That is where submodules come handy. You can share some subdatasets but skip the others, while retaining clear versioning association.

1 Like

Correct. Or even uninstall it altogether, add long as you have it’s original dataset absolute and properly referenced within .gitmodules of the derived dataset.

Gotcha - sorry I have lost the track of repronim/containers here in the picture! Yes, it would be valuable information. Ideally I should research into metadata descriptors for containers. Vanessa Sochat was interested in that iirc, should ask her

1 Like

Not that it is really needed, but just would make things prettier and more usable by default :wink:

1 Like

@yarikoptic
I copied all Freesurfer data for a study in a new datalad dataset and I am trying to save it. The new data are 1.4Tb in size, ~3.3Gb per subject for ~440 subjects, 1.3 million files. Looks like will take 5-6 hours to finish saving! I am not entirely surprised because another dataset of 330Gb took 30-60 minutes, too. But I wanted to check if this sounds normal to you. Do you also save datasets these big with a similar experience, or do you chop them into nested datasets for each subject (as you suggested elsewhere). I don’t think it will make much difference saving all folders vs. individual subjects, but looking forward to hear if you have any tip in this regard.

Thanks again for your previous help.
Dorian

Number of subjects is on the border. I guess in some use cases you might like just a subset, so indeed having a subdataset per subject would be the way to go. Michael and his group are finalizing HCP dataset and I believe it will be with a subdataset(s) per subject. Also doing a sample subject dataset could give you idea about .git/objects size.

And watchout, iirc there is lots of xml files, which are text, so pure text2git configuration of .gitattributes wouldn’t be what you want since too much would go directly to git

Follow up:

How do I go by converting the big dataset to multiple subdatasets now?

How would someone create a sample parent dataset with just few subs?

I am trying to use the containers dataset from your repronim GitHub. It is sitting upstream in the root superdataser. Can’t seem to make bids-validator work to check a downstream dataset though. Can you give me a couple of examples how to use the containers when they don’t sit in the dataset they will apply to (maybe fmriprep and freesurfer too).

Much appreciated, thank you.

I am not using text2git at all anymore, just Yoda or bids.

@yarikoptic

I have done some progress, although through painful trial and error (questions in bold below).

At first, I interrupted the datalad save command, a big mistake whichleft 800Gb of data in .git/annex directory. I cleaned up the .git folder with git annex unused; git annex dropunused all, and the .git folder was still 9.5Gb in size, probably coming from a huge list of commits. I even tried to delete the commit history to start fresh, but the 9Gb were still there and datalad status would just hang for long time. I ended up deleting the whole freesurfer folder and starting over.

Now, I finally got all subjects as subdatasets in side the freesurfer parent dataset. To my surprise, calling datalad status is still slow, takes 9 minutes just to have a response. This is after setting the -e no flag. At this point, I am not sure what is datalad doing. Aren’t -e no or -e commit supposed to check just if the parent dataset is referencing to the current commit of the child dataset? This should be a matter of seconds, not minutes. Is this the expected behavior that the user must just get used to (i.e., waiting for minutes to check the status of a superdataset)?

Here is the log:

[dorian@mri FEDERATED_DATASETS]$  time datalad status -e no -r
  unknown: study1 (dataset)
  unknown: study1/data/bids (dataset)
  unknown: study1/data/processed (dataset)
  unknown: study1/data/processed/fmriprep (dataset)
 modified: study1/data/processed/freesurfer (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
...
  unknown: study1/data/processed/freesurfer/sub-XXXXXXXXXXX (dataset)
  unknown: containers (dataset)
  unknown: containers/artwork (dataset)

real    8m32.053s
user    4m18.220s
sys     4m34.170s
[dorian@mri FEDERATED_DATASETS]$  time datalad status -e commit -r
 modified: study1/data/processed/freesurfer (dataset)

real    8m37.821s
user    4m23.359s
sys     4m34.849s

A couple of ideas to improve datalad functionality:

  • Add a merge or split subdataset functions. I.e., if I have a huge dataset and want to split it chunks of subdatasets, datalad should take care of it. It is quite complicated for the user to try do that manually with all the various .git folders, internal references, etc. Same logic goes for merging subdatasets into a larger dataset.
  • Add an option when creating a subdataset to point to an external folder and have datalad copy the entire content of that folder and save it in the newly created subdataset. Thus, save the need for the user to copy the data, create the dataset with --force, and save the subdataset.
  • Add a command to thoroughly check the .git/annex folders for unused files and drop them all. Git annex can do that, but is rather tricky for the user (I thought the clean command would do it, but it does not).

P.s. Btw, I have updated to datalad v0.12.0rc6

well, if you interrupted, redoing save AFAIK should just do the right thing. git-annex keeps of journal of actions which are done but not yet committed. Then at some point git gc would have kicked in which would have removed objects which might have been created but not referenced. I some times to force it really do git gc --prune=now if I know that everything I care about should be referenced by git at that point. Whenever you analyze size of .git it matters what directory in particular (I usually do du -scm .git/* and see either it is annex/ or objects/)

how it compares to plain git status timing? some issues are known (e.g. https://github.com/datalad/datalad/issues/3766 etc) some are yet to discover, but there is a cost to verify that no changes exist to a large file tree.

I think this thread grew too long already and we are getting to specific issues. Please file a new (or may be you find the one which matches and then just comment on – we need to know that there is a damand) issue on https://github.com/datalad/datalad/issues/ for your observations and ideas.

https://github.com/datalad/datalad/issues/3554 – added your comment

do you mean you want a dedicated datalad command to do datalad create . && cp -rp --reflink=auto sourcedir/* . && datalad save? I am not sure it is worth a dedicated command. Also there is git annex import if you would like to continuously add changes from a folder.

hm… might be worth indeed additional mode for datalad clean or even better datalad drop. Please file an issue. It could be a trickly one since, as you saw, there could be a number of definitions of “unused” and thus we might need first to identify common use cases. Otherwise indeed git annex unused ... && git annex dropunused all would provide the most flexible approach. Also please consider contributing to http://github.com/datalad-handbook/book on this

@yarikoptic

Git status timing is just 5 seconds on the same superdataset as before which took 8mins with datalad status -r -e no.

$ time git status
On branch master
nothing to commit, working tree clean

real    0m4.951s
user    0m3.533s
sys     0m6.936s

Datalad without recursive option is 2.5 mins on that same folder.

$ time datalad status

real    2m37.926s
user    1m8.018s
sys     1m30.672s

Thank you for the timings. For now let me just recommend to use git status whenever you would like to check overall status of large datasets. We will resolve this horrible performance issue sooner than later, eg may be with similar to git caching the results (that is how it was so fast, because datalad status queried g it right before)

1 Like

Thanks, looks like a reasonably quick status can be obtained from all submodules with:
git submodule foreach --recursive 'git status -s'

$ time git submodule foreach --recursive 'git status -s'
Entering 'study1'
 ? data/processed
Entering 'study1/data/bids'
Entering 'study1/data/processed'
 ? freesurfer
Entering 'study1/data/processed/fmriprep'
Entering 'study1/data/processed/freesurfer'
?? testme.test
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
...
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'study1/data/processed/freesurfer/sub-XXXXXXXXXXX'
Entering 'containers'

real    0m39.430s
user    0m23.770s
sys     0m38.837s

Hey,

I have not been following this closely, so I might be wrong. But if you have lots of files in these datasets then a default datalad status will traverse the filesystem more intensely than a git status, because it will distinguish symlinks to annex keys from just symlinks in the type report. If you don’t care about this accuracy, use the -t raw option. With it, a datalad status should be rather close to the runtime of an uncached git status.

HTH

For more info on the behavior of the system on large N datasets, checkout https://github.com/datalad/datalad/issues/3869