Using fmriprep with datalad containers-run

wem3 · October 31, 2019, 5:15pm

I’d like to use datalad containers-run to execute a singularity container for fmriprep-1.5.0 on my BIDS-formatted dataset, but have a couple issues. Apologies if these would be better presented as separate posts…

I can execute a SLURM array of fmriprep jobs on my BIDS-formatted dataset (a la https://fmriprep.readthedocs.io/en/latest/singularity.html), but would like to use datalad containers-run to wrap the command. I created a virtual environment for datalad via conda-forge, updated datalad to version 0.12.0rc6, and installed the datalad-containers extension on our hpc (CentOS 7), but am a little uncertain how/where to specify appropriate bind mounts.

Do I need to include the --bind arguments with the --cmd-fmt flag when I add the fmriprep singularity image via datalad containers-add, or can I specify bind mounts in the datalad containers-run call?

I’m also having trouble using datalad containers-add to pull the fmriprep singularity image, although I can successfully datalad containers-add an fmriprep image that I built ahead of time by specifying its path rather than using the --url flag.

I can add a heudiconv container with no problem via datalad containers-add -d . heudiconv --url docker://nipy/heudiconv:0.5.4, but when I try to add an fmriprep container via datalad containers-add -d . fmriprep --url docker://poldracklab/fmriprep:latest, I receive the following error:

ERROR:   build: failed to make environment files: open /tmp/sbuild-432867294/fs/etc/resolv.conf: permission denied
FATAL:   While performing build: packer failed to pack: while inserting base environment: build: failed to make environment files: open /tmp/sbuild-432867294/fs/etc/resolv.conf: permission denied

The only hit I could find for a similar error is for a closed singularity issue (https://github.com/sylabs/singularity/issues/4532), but I’m not sure if it applies, since I have a more recent version of singularity installed.

I’ve attached a datalad_containers-add_fmriprep_errors.txt file with the error message in it’s entirety in case this isn’t the heart of the issue (but it’s pretty lengthy), and also a datalad_wtf_output.txt.

OS: CentOS 7
singularity version: 3.4.1-1.el7
datalad version: 0.12.0rc6

Thanks much!

oesteban · November 2, 2019, 11:12pm

paging @yarikoptic here

yarikoptic · November 3, 2019, 1:00am

I will check in details later. Meanwhile you could try doing it my recommended way via Singularity even on osx (would go via docker): https://github.com/ReproNim/containers/blob/master/README.md#a-typical-workflow
If you are up for a challenge and have condor or PBS cluster, you could try doing doing my not finished recipe way: https://github.com/ReproNim/reproman/pull/438 using reproman run to parallelize per subject runs in one command. That recipe is a bit convoluted though atm

kyleam · November 5, 2019, 11:42pm

Arguments like --bind that are intended for singularity rather than the underlying command should be specified with --call-fmt when calling containers-add. It’s also fine to edit the cmdexec value in .datalad/config after the fact.

Wrapping batch submission with datalad containers-run is tricky. It implies containers-run is sitting on top, which means its call format would need to point to some sort of wrapper script that handles the batch submission. It’s also problematic at the datalad run level. (containers-run is a thin wrapper that constructs a command for datalad run.) The command that executes would be for job submission, but, as far as DataLad is aware, everything is done after the submission command exits, so it will wrap up and try to make a run commit.

datalad-htcondor (written by Michael Hanke) deals with condor batch submission by handling the submission outside of run and then, once everything is complete, faking a run commit in the dataset. The reproman run functionality that Yarik pointed to has some output processing that was inspired by (stolen from) datalad-htcondor’s approach, and it too injects those outputs back into the dataset by faking a run commit.

Regarding the containers-add failure, the issue you posted does look very similar. Are you sure you have a singularity version that includes the fix? You reported your singulariy version as 3.4.1-1.el7, but as far as I can tell the first tag to contain the fix (278a4827f) is v3.5.0-rc.1.

wem3 · November 6, 2019, 7:13pm

Thanks so much for the detailed response, Kyle!

Gotcha! I thought that might be the case, but wasn’t quite sure if I could specify bind mounts with --call-fmt (or edit the cmdexec value in .datalad/config).

What if I call datalad containers-run from within the script that gets submitted as a SLURM array? I know it’s not ideal, as there’s code being executed outside of datalad run, but I couldn’t figure out how else to combine datalad run functionality with SLURM arrays.

I guess that also wouldn’t address the problem of race conditions introduced by datalad making potentially simultaneous run commits (or attempting to make run commits on a dataset modified by other ongoing jobs). Is it possible to introduce similar faked run commits on an hpc that uses SLURM rather than condor?

I think I misread the github issue, and was under the mistaken impression that a patch had been backported to singularity v3.4.1. I’ll try to get my cluster admins to make v3.5.0-rc1 available on our hpc!

Thanks again for your help!

wem3 · November 6, 2019, 7:27pm

Actually, the datalad containers-add issue seems to have been resolved. I’m not sure why, as our hpc did not appear to update singularity, but I’m not gonna complain…

kyleam · November 7, 2019, 6:21pm

Right, I think you hit on the core issues: (a) if you move the datalad *run invocation to an inner layer, you’re not capturing information about the outer layers, and (b) datalad run isn’t designed to support concurrent operations in the same working tree. Unless datalad run is redesigned to address b, I don’t see a way around a.

So the current approaches I mentioned avoid b (and suffer from a) by handling the submission and execution themselves. Once all the jobs are complete, they create the run commit.

Speaking for the reproman approach, how to handle this is an unresolved issue. The single run commit that is created is for all the subjobs and is not really a usable run commit. (Should we be creating a run commit then? Probably not.) Along with the output files produced by the command, the commit includes files that have information about the submission, and the commit message has an ID that links it to a particular submission.

Conceptually I think an appealing way to deal with concurrent jobs would be to run each job in a separate working tree. DataLad doesn’t support git worktree-created working trees at the moment, but it can create “reckless” clones that are cheaper than a regular clone [*]. The wrapper script could then call datalad run in the dedicated working tree, and each run would get a commit (from the proper starting point). Once all the jobs are complete, the wrapper script could create a merge commit bringing in all the subjobs lines. Practically, that’s probably quite a bit of work to set up, with lots of issues to solve. Also, if there are lots of subjobs, that could quickly lead to a ridiculous number of lines being merged. And, from a visualization standpoint, even an octopus merge with just four lines starts to get unwieldy:

o---.
|\ \ \
| | | o
| | o |
| | |/
| o |
| |/
o |
|/
o

[*] If I recall correctly, Michael had an idea (in the context of datalad-htcondor) for using git annex export/import, but that was an offline discussion and I don’t recall the details now. I’m not aware of anything being done with that yet.

reproman doesn’t support SLURM submissions yet. (Any help on that end would be appreciated of course :). But, as I mentioned above, the run commits produced by reproman have outstanding issues.

wem3 · November 7, 2019, 7:23pm

I see. Thanks again for the comprehensive response!

I’d be happy to help work on reproman support for SLURM submissions! I’m still a little green with datalad, but I’m down to contribute in whatever way I can…

yarikoptic · December 2, 2019, 5:58pm

FTR the “official” TODO issue within reproman for SLURM submitter is https://github.com/ReproNim/reproman/issues/484 .