Disable datalad in fmriprep? Or a templateflow-something issue?

winkler · February 14, 2019, 6:40pm

Dear all,

I’m trying to run fmriprep 1.3.0.post1. The error is below. It seems to be related to datalad or templateflow. Note that:

This is a cluster that has no access to the internet.
There are multiple instances of fmriprep running at the same time, so surely a path named as ~/.cache/templateflow will be used by more than one instance at the same time.

Not sure if either of these is the cause of the problem…

How to fix? Thanks!

All the best,

Anderson

[INFO] Cloning https://github.com/templateflow/templateflow.git [1 other candidates] into ‘/home/winkleram/.cache/templateflow’
[ERROR] could not create work tree dir ‘/home/winkleram/.cache/templateflow’.: File exists [install(/home/winkleram/.cache/templateflow)]
/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/distribution/dataset.py:474: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
orig_pos = getargspec(f).args
/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/base.py:682: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
argspec = getargspec(call)
[WARNING] path not associated with any dataset [get(/home/winkleram/.cache/templateflow)]
Process Process-2:
Traceback (most recent call last):
File “/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/fmriprep/cli/run.py”, line 560, in build_workflow
from …workflows.base import init_fmriprep_wf
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/fmriprep/workflows/base.py”, line 24, in
from niworkflows.interfaces.bids import (
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/niworkflows/interfaces/bids.py”, line 28, in
STANDARD_SPACES = _get_template_list()
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/templateflow/api.py”, line 51, in templates
api.install(path=str(TF_HOME), source=TF_GITHUB_SOURCE, recursive=True)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/utils.py”, line 491, in eval_func
return return_func(generator_func)(*args, **kwargs)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/utils.py”, line 479, in return_func
results = list(results)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/utils.py”, line 467, in generator_func
msg=“Command did not complete successfully”)
datalad.support.exceptions.IncompleteResultsError: Command did not complete successfully [{‘action’: ‘install’, ‘path’: ‘/home/winkleram/.cache/templateflow’, ‘type’: ‘dataset’, ‘status’: ‘error’, ‘message’: “could not create work tree dir ‘/home/winkleram/.cache/templateflow’.: File exists”, ‘source_url’: ‘https://github.com/templateflow/templateflow.git’}, {‘action’: ‘get’, ‘path’: ‘/home/winkleram/.cache/templateflow’, ‘refds’: ‘/home/winkleram/.cache/templateflow’, ‘raw_input’: True, ‘orig_request’: ‘.’, ‘state’: ‘absent’, ‘status’: ‘impossible’, ‘message’: ‘path not associated with any dataset’}]

oesteban · February 14, 2019, 7:20pm

Hi @winkler,

It seems to me that you are working on a custom installation of fMRIPrep, is that correct?.

Please confirm that 1) is true at runtime, but that you can have internet access while setting up the environment. If that is the case, then you just need to pull the whole templateflow down before running. That is not a crazy amount of data and will keep datalad/git-annex/templateflow quiet.

For reference, this is what we do when building container images:

github.com

poldracklab/fmriprep/blob/master/Dockerfile#L164-L173


RUN datalad install -r https://github.com/templateflow/templateflow.git
RUN datalad get $TEMPLATEFLOW_HOME/tpl-MNI152NLin2009cAsym/* \
            $TEMPLATEFLOW_HOME/tpl-MNI152Lin/* \
            $TEMPLATEFLOW_HOME/tpl-OASIS30ANTs/* \
            $TEMPLATEFLOW_HOME/tpl-NKI/*
RUN git -C $TEMPLATEFLOW_HOME config annex.merge-annex-branches false && \
git -C $TEMPLATEFLOW_HOME/tpl-MNI152NLin2009cAsym config annex.merge-annex-branches false && \
git -C $TEMPLATEFLOW_HOME/tpl-MNI152Lin config annex.merge-annex-branches false && \
git -C $TEMPLATEFLOW_HOME/tpl-OASIS30ANTs config annex.merge-annex-branches false && \
git -C $TEMPLATEFLOW_HOME/tpl-NKI config annex.merge-annex-branches false

In your case, I’d proceed as follows:

Make sure datalad and git-annex are installed and functional
cd ~/.cache
datalad install -r -g https://github.com/templateflow/templateflow.git

The -g flag tells datalad to download all contents, which in combination with -r (recursive) will get you the whole repo installed.

Since no more writes are necessary that should fix your issue (and the concurrent access problem). However, if you keep having trouble, you may want to make TemplateFlow read only with https://github.com/poldracklab/fmriprep/blob/master/Dockerfile#L169-L173.

Please let us know if this worked out for you.

winkler · February 14, 2019, 9:36pm

Hi Oscar,

Many thanks for the quick feedback. Not sure what you mean by a custom installation. It was installed in a Python virtual environment (virtualenv) with pip. It was easy and painless. We do not, and we will not, use Docker or Singularity again anytime soon.

I see on github that templateflow contains a number of .nii.gz templates. Why can’t we just clone a repository such as that and have the files available? Or download the files from somewhere else? Is it because of size? These files don’t seem to be that large…

This is a CentOS system that is used by hundreds of people and whose admins are extremely careful about installing unstable software (evidence for git-annex being unstable comes from the documentation itself: https://git-annex.branchable.com/install/fromsource/). Does it need be like this? Does one need a whole universe of Haskell libraries just to download a bunch of NIFTI files?

Note that datalad requires git-annex newer than the version released in 13/September/2018 (that is, just 5 months ago), whereas the stable version that is available in EPEL is from 2014. Is this the best solution possible?

Thanks for the hard work on this, and looking forward to continue using fMRIprep.

All the best,

Anderson

oesteban · February 14, 2019, 10:01pm

Thanks, that is exactly what I meant. I should’ve used the “bare-metal” term for clarity here.

Yes, what you see there is a git-annex repository maintained and managed by DataLad. Those are, if you look at them, links to git annex remote files. The actual files are hosted in OSF - https://osf.io/ue5gx/. Yet, they are not too large either.

I understand where you come from here, and we are thinking about these problems too. @yarikoptic and @eknahm, can you think of a way of allowing users to download the whole datalad dataset ahead of time and prevent datalad from executing any git-annex command? Maybe after exporting to figshare, for example?

On the other hand, datalad is really effective to keep TemplateFlow under version control, and it makes it easy to transparently report the exact templates that were used in the processing. Since we intend to expand the available templates largely, we needed a tool but we didn’t want to get the load on our shoulders. DataLad was just the tool we needed.

Since git-annex provides standalone distributions that you can set at the user level, in principle I don’t see a strong need for your admins to install git-annex.

Thank you for using it and for all this valuable feedback. We are aware of the heavy friction users need to get through and we hope we solve more problems than we are creating.

winkler · February 14, 2019, 10:13pm

Thanks again! We really appreciate the help and understand how complicated these various decisions and trade-offs can be.

I think for now we’ll just downgrade to fMRIprep 1.2.6-1 as that appears not to use yet templateflow, and it has the features we need (multi-echo processing). We’ll wait until you guys find the best solution ahead for the templates. For example have as a minimal required version for git-annex some older release for which it may be easier to find compiled packages for different distros, or maybe have the management of these templates entirely python-based or with some other interface…

Thank you again!

Anderson

eknahm · February 15, 2019, 10:40am

I understand where you come from here, and we are thinking about these problems too. @yarikoptic and @eknahm, can you think of a way of allowing users to download the whole datalad dataset ahead of time and prevent datalad from executing any git-annex command? Maybe after exporting to figshare, for example?

If I got it right, the issue is not whether or not to use datalad to download the needed files, but rather to not need to download anything at all. I am not familiar with the file structure that is needed, but in general nothing prevents tar’ing up an entire, populated, dataset (with annex an all) and place it where it needs to be on the target system.

yarikoptic · February 15, 2019, 2:32pm

TL;DR summary:

I think that container should come with all needed templates pre-installed and not under ~/.cache but some location within the image (e.g., /usr/local/share/templateflow or /opt/templateflow)
templateflow should be instructed to use that location instead of its default TF_DEFAULT_HOME
If user really needs to overload those shipped templates with new ones, could consciously datalad install a new version locally and bind mount inside the container overloading the bundled version

Re related (not present here) issue of “datalad get” invocation on files which are already there failing on read-only filesystem:

datalad should not fail in such scenario - so we are fixing it https://github.com/datalad/datalad/pull/3164 - so should be merged today and I will release a quick bugfix over weekend. Should then work “as a brand new”

I am a bit confused here – since I guess it is the templateflow which executes datalad commands, so it could do analysis either it needs to do that or not. IMHO for fmriprep - all templates it possibly could use should be pre-downloaded (via datalad, or as @eknahm pointed out - could even be exported tarballs, but I do not see that being necessary) and available within fmriprep image, and then no datalad commands would strictly be needed, correct?
But if e.g. a newer templateflow would provide newer templates etc than shipped within fmriprep bundle - I do not see why user could not bind some local directory with them over the one you have in the bundle.

Here I also observe singularity specific interaction with templateflow which might be undesired! templateflow seems to rely on storing the templates under ~/.cache/templateflow. Singularity by default bind mounts $HOME. So - now execution of fmriprep becomes heavily dependent on the status of the local filesystem:

reproducibility could get severely affected even if it runs
original error suggests that actually in this case /home/winkleram/.cache/templateflow exists, but not a DataLad dataset. So it at least should be removed and then fmriprep reran hopefully cloning the correct one. But again - reproducibility is hindered since then depending on the state of templateflow repository you might keep getting a new version. Solution would be the aforementioned bundling of templateflow templates inside the container and using that location.

oesteban · February 15, 2019, 3:02pm

Thanks Yarik. In this case Anderson is not running containers. That is why the directory is under ~/.cache

However, wrt my idea I think you are right: templateflow could bypass datalad if the folder is not a git/git-annex to.

oesteban · February 16, 2019, 12:39am

the figshare export would do this right?

winkler · February 16, 2019, 6:31pm

Hi Yaroslav,

Thanks for the feedback. I did delete the .cache/templateflow in one of the attempts to run, but that didn’t fix. But even if it had fixed, it would have fixed for just 1 instance, but there are multiple fMRIprep instances running, so they’d collide when trying to write to that path.

Thanks again!

All the best,

Anderson

yarikoptic · February 16, 2019, 8:15pm

I would suggest to (always) run singularity in isolation of the environment variables (-e), and no HOME being mounted/shared between instances (--no-home) with bind mounts specifically needed (e.g. of the $PWD if you are in the dataset directory). Otherwise the ghosts of irreproducibility, if not immediate errors due to custom PYTHONPATH etc, will haunt you down

As for the fact that removal of .cache/templateflow not helping:

what is the content of that directory before and after unsuccessful run?
what is the error message (the same?)

yarikoptic · February 16, 2019, 8:19pm

also, according to https://github.com/poldracklab/fmriprep/commit/bca40d19d4c053cfbe546aa8a62c1d1b003675de#diff-3254677a7917c6c01f55212f86c57fbf the 1.3.0.post2 I think should ship the full copy of the templatflow under /opt/templateflow so that image version might resolve your troubles

winkler · February 16, 2019, 10:06pm

Hi Yaroslav,

The directory was empty. I don’t know if the error message was the same or not. The reason is the following: there is 1 instance of fmriprep running per participant, each using just 1 thread. These are “swarmed” to a SLURM cluster in which each node has no display and no internet access, such that hundreds of participants run in parallel, but each one with their pipeline run serially (multi-threading disabled). Using it in this way, that is, one fmriprep per subject and each using just 1 thread, and crucially, not using any kind of container, was the way we were able to run with fewer frustrations. It takes about 26 hours per participant, which is fine.

So I don’t know if the error message is the same or not because, whichever was the 1st participant for whom ~/.cache/templateflow was created, all others would find that directory later already existing. Regardless, the directory is always empty. I could wade through the logs to find out if the error was different for 1 participant (the first), although I’ve now deleted these.

I can try next week with 1.3.0.post2. I’ve now further downgraded to 1.2.4 to see if the display issue (see the other thread) doesn’t happen there…

Thanks!

All the best,

Anderson

oesteban · February 27, 2019, 8:04pm

8 posts were split to a new topic: Singularity & fMRIPrep: PermissionError: [Errno 13] Permission denied: ‘/.cache’`

oesteban · March 6, 2019, 8:48pm

Hi, the latest release 1.3.1 is out. Please let us know if that version resolves this problem (fmriprep should no longer depend on git-annex/datalad)!

oesteban · March 12, 2019, 8:48pm

A post was split to a new topic: fMRIPrep/templateflow: Unable to access these remotes

winkler · March 24, 2019, 7:23pm

Hi Oscar,
Updated now to 1.3.2 (and also tried with 1.3.1) and it fails for another reason. I’ll open a separate thread…
Thanks.
Anderson