Datalad beginner: Compressed sibling, YODA and singularity

MartinGrignard · March 3, 2021, 2:36pm

Hi!
I’m currently preparing a new project and I’d like to make it as reproducible as possible. Since my lab does not use Datalad yet, I’m trying to document the process (with regard to the platforms available) but I have a few questions about the good practice.

Following YODA, each step should be stored in a separate dataset where inputs/ contains a dataset and outputs/ contains the results.
1. ~~Is it possible to store the data in outputs/ in a separate dataset? This would make further steps cleaner since data would always be directly under outputs/ and not inputs/outputs/.~~
  (See 7314 and 5509)
2. ~~Would this sub-dataset still record traces when datalad run is used in the super-dataset?~~
  (See 7314 and 5509)
3. Some steps of the pipelines generate a bunch (300+) files for each subject. Would it be better to create a sub-dataset for each subject and group them inside a super-dataset? Still, how can I keep track of the provenance when dealing with sub-datasets as outputs?
I’m using Nextflow to define my pipelines since it is straightforward to go from local to HPC. Everything I read about reproducibility recommends to create a containerized pipeline. Since some tools are required in multiple steps of the project and some aren’t, I’ve been creating separate singularity images that I use in my nextflow pipeline. Doing so, I can create a dataset with all the images I need and simply datalad get those required for a specific step.
1. Is this good or is the full containerized pipeline better for reproducibility (since singularity and nextflow are thus required on the host to run the pipelines)?
2. Can datalad run track provenance on HPC with SLURM and nextflow?
The storage server we have access to at the lab is available through samba and ssh. I want to create a sibling of the datasets there but the free ($) space I have is limited (I can buy more but I’m trying to reduce the footprint of the data) while still keeping the repository on our gitlab.
1. Is it possible to compress the whole dataset in a .tar.xz archive on the sibling? I read that encryption had the side effect of compressing data but I can’t find information about what is compressed (each file separately or the whole dataset).
2. All the datasets involved in the project (except the sourcedata and the singularity ones) have nested datasets. Will it duplicate data on the sibling?

Thanks for your responses and sorry for all those questions in the same topic (I tried to read as much as possible on this forum but I migh have missed some topics). Since I’ve read about Datalad, I feel like I entered a rabbit hole. Working with it on my machine is not a problem, but I’ve never used it to collaborate on a project/publish data and results.

MartinGrignard · March 5, 2021, 7:21am

Sorry for the double post, I can’t edit my topic to add the information I found.

Is this good or is the full containerized pipeline better for reproducibility (since singularity and nextflow are thus required on the host to run the pipelines)?

Reading this and this, it looks like my solution is not that bad.

Can datalad run track provenance on HPC with SLURM and nextflow?

Having a dataset containing all the small images for the tools allows me to keep the run script as a single .nf file that I can run with:

datalad run -m 'Run my workflow' 'nextflow run code/my_workflow.nf'

Since this is a single command that manages the jobs internally, no parallel execution of datalad run occur and thus I don’t have to handle throw-away clones.
Is that right?

The storage server we have access to at the lab is available through samba and ssh […]

Generally, this topic is very useful to setup a sibling over SSH and describes a configuration very similar to the one I’m trying to produce. Yet I can’t find information on the compression part of my question.

All the datasets involved in the project (except the sourcedata and the singularity ones) have nested datasets. Will it duplicate data on the sibling?

Nevermind, I understand that, except if I’m using a RIA, only the files that are tracked by git-annex are pushed to the SSH sibling. There is no tracking of the sub-dataset data in a super-dataset so no possible duplication of data.

I continue to search the web for answers about questions 1.3, 2.2 (just a confirmation) and 3.1.
Have a nice day

yarikoptic · March 5, 2021, 2:26pm

depends on number of subjects and how you see this dataset to be reused later… if you are ending up with under or about 100,000 you could just keep them all in a single dataset. Otherwise - you might indeed would like to establish a subdataset per each subject. A relevant recent new wishlist issue on datalad end: RFC: datalad.save.subdataset config sections to define when/how to automagically establish subdatasets · Issue #5423 · datalad/datalad · GitHub to automate creation of such subdatasets (no work done yet though).
Indeed then sub-dataset itself would not be fully aware of its parent, but datalad run record will record the uuid of the parent dataset where run was executed so at least it is not “all lost”.

Since unfortunately you cannot (typically, running singularity within singularity - ruhroh! · Issue #1245 · hpcng/singularity · GitHub) run singularity containers within singularity, so you could have some top level container with nextflow etc and then individual for some steps, I would say that indeed a singular container would have been better so there is no reliance on the system having installed nextflow etc. BUT I do appreciate inflexibility/ivory tower aspect of such an approach: would need to rebuild entire container to add some new tool etc. So if project is still in heavy flux and you are adding new tools frequently, it might be quite suboptimal.

yes, confirming your answer. Here is more thoughts…

only if every step uses datalad run inside you would get individual datalad run records for each step. But IMHO that would be overkill etc. See Using datalad for provenance tracking in nipype - how to? - #2 by yarikoptic and links there. But also in relation to previous question: if nextflow runs individual jobs via SLURM, then those would be ran outside of the container where nextflow is (right?). Then for each step you could indeed have individual containers, and there would be no need for a monolythic singular container.
Related: in GitHub - ReproNim/reproman: ReproMan (AKA NICEMAN, AKA ReproNim TRD3) , although still supporting/using datalad, we do not use/create datalad {containers-,}run but rather schedule jobs on PBS/SLURM/… and then do a single datalad save which also stores the job specification where the submitter (SLURM or condor) is just a parameter, so later on you/someone should be able to rerun that job on another cluster with another submitter by modifying those settings (not familiar with nextflow, so not sure how easy to switch from e.g. slurm to condor on another HPC there… probably as easy)

That is what RIA you found for AFAIK. With encrypted git annex special remote you do encrypt/compress each annexed file individually.

If I got the question right, I think the

is not right – if you have a file with checksum X present in multiple datasets (sub and/or super) you will duplicate them across all of them (well, there could be a shared cache git annex which would be populated with all of them, and then efficiently (hardlinks, or CoW of the appropriate file system, etc) reused across all of them, but might be a logistic overkill).

MartinGrignard · March 8, 2021, 8:12am

Hi,
Thanks for your reply!

I’ll keep that in mind, maybe I’ll try to create .tar archives at runtime to reduce the number of files.

Having 100% provenance tracking is not my main focus. If at least the where and when are stored somewhere, since YODA implies that only one task is performed in a dataset, the what will be determined.

I’ll keep that in mind, maybe building a single full singularity will be possible once everything is fixed. This will make publishing easier. But yeah, what a pity singularity cannot run singularity…

OK, that’s what I was thinking. Ideed, changing the executor in nextflow is just a parameter so going from SLURM to local to Condor or even cloud grids is very easy.

Thus, as far as I understand, the RIA also stores the raw git history. It means that it does not only store the annexed objects and requires more storage. I’ll take a look at how to setup such a repository to keep things clean and separate (annex on server and git on gitlab).

But if the file is only in one dataset A and that A is installed in another super-dataset B, the file only appears in in A on the sibling no?

Once again, thanks for all those answers!

yarikoptic · March 8, 2021, 5:52pm

quick partial one:

note that .tar doesn’t allow for random access to files within it – there is no index. Better use .xz (as ORA used by RIA stores does)