Datalad beginner: Compressed sibling, YODA and singularity

MartinGrignard · March 5, 2021, 7:21am

Sorry for the double post, I can’t edit my topic to add the information I found.

Is this good or is the full containerized pipeline better for reproducibility (since singularity and nextflow are thus required on the host to run the pipelines)?

Reading this and this, it looks like my solution is not that bad.

Can datalad run track provenance on HPC with SLURM and nextflow?

Having a dataset containing all the small images for the tools allows me to keep the run script as a single .nf file that I can run with:

datalad run -m 'Run my workflow' 'nextflow run code/my_workflow.nf'

Since this is a single command that manages the jobs internally, no parallel execution of datalad run occur and thus I don’t have to handle throw-away clones.
Is that right?

The storage server we have access to at the lab is available through samba and ssh […]

Generally, this topic is very useful to setup a sibling over SSH and describes a configuration very similar to the one I’m trying to produce. Yet I can’t find information on the compression part of my question.

All the datasets involved in the project (except the sourcedata and the singularity ones) have nested datasets. Will it duplicate data on the sibling?

Nevermind, I understand that, except if I’m using a RIA, only the files that are tracked by git-annex are pushed to the SSH sibling. There is no tracking of the sub-dataset data in a super-dataset so no possible duplication of data.

I continue to search the web for answers about questions 1.3, 2.2 (just a confirmation) and 3.1.
Have a nice day