Disable datalad in fmriprep? Or a templateflow-something issue?

Dear all,

I’m trying to run fmriprep 1.3.0.post1. The error is below. It seems to be related to datalad or templateflow. Note that:

  1. This is a cluster that has no access to the internet.

  2. There are multiple instances of fmriprep running at the same time, so surely a path named as ~/.cache/templateflow will be used by more than one instance at the same time.

Not sure if either of these is the cause of the problem…

How to fix? Thanks!

All the best,

Anderson

[INFO] Cloning https://github.com/templateflow/templateflow.git [1 other candidates] into ‘/home/winkleram/.cache/templateflow’
[ERROR] could not create work tree dir ‘/home/winkleram/.cache/templateflow’.: File exists [install(/home/winkleram/.cache/templateflow)]
/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/distribution/dataset.py:474: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
orig_pos = getargspec(f).args
/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/base.py:682: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
argspec = getargspec(call)
[WARNING] path not associated with any dataset [get(/home/winkleram/.cache/templateflow)]
Process Process-2:
Traceback (most recent call last):
File “/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/local/Anaconda/envs/py3.6/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/fmriprep/cli/run.py”, line 560, in build_workflow
from …workflows.base import init_fmriprep_wf
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/fmriprep/workflows/base.py”, line 24, in
from niworkflows.interfaces.bids import (
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/niworkflows/interfaces/bids.py”, line 28, in
STANDARD_SPACES = _get_template_list()
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/templateflow/api.py”, line 51, in templates
api.install(path=str(TF_HOME), source=TF_GITHUB_SOURCE, recursive=True)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/utils.py”, line 491, in eval_func
return return_func(generator_func)(*args, **kwargs)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/utils.py”, line 479, in return_func
results = list(results)
File “/gpfs/gsfs6/users/EDB/MErest/code/env-hpc/lib/python3.6/site-packages/datalad/interface/utils.py”, line 467, in generator_func
msg=“Command did not complete successfully”)
datalad.support.exceptions.IncompleteResultsError: Command did not complete successfully [{‘action’: ‘install’, ‘path’: ‘/home/winkleram/.cache/templateflow’, ‘type’: ‘dataset’, ‘status’: ‘error’, ‘message’: “could not create work tree dir ‘/home/winkleram/.cache/templateflow’.: File exists”, ‘source_url’: ‘https://github.com/templateflow/templateflow.git’}, {‘action’: ‘get’, ‘path’: ‘/home/winkleram/.cache/templateflow’, ‘refds’: ‘/home/winkleram/.cache/templateflow’, ‘raw_input’: True, ‘orig_request’: ‘.’, ‘state’: ‘absent’, ‘status’: ‘impossible’, ‘message’: ‘path not associated with any dataset’}]

Hi @winkler,

It seems to me that you are working on a custom installation of fMRIPrep, is that correct?.

Please confirm that 1) is true at runtime, but that you can have internet access while setting up the environment. If that is the case, then you just need to pull the whole templateflow down before running. That is not a crazy amount of data and will keep datalad/git-annex/templateflow quiet.

For reference, this is what we do when building container images:

In your case, I’d proceed as follows:

  1. Make sure datalad and git-annex are installed and functional
  2. cd ~/.cache
  3. datalad install -r -g https://github.com/templateflow/templateflow.git

The -g flag tells datalad to download all contents, which in combination with -r (recursive) will get you the whole repo installed.

Since no more writes are necessary that should fix your issue (and the concurrent access problem). However, if you keep having trouble, you may want to make TemplateFlow read only with https://github.com/poldracklab/fmriprep/blob/master/Dockerfile#L169-L173.

Please let us know if this worked out for you.

Hi Oscar,

Many thanks for the quick feedback. Not sure what you mean by a custom installation. It was installed in a Python virtual environment (virtualenv) with pip. It was easy and painless. We do not, and we will not, use Docker or Singularity again anytime soon.

I see on github that templateflow contains a number of .nii.gz templates. Why can’t we just clone a repository such as that and have the files available? Or download the files from somewhere else? Is it because of size? These files don’t seem to be that large…

This is a CentOS system that is used by hundreds of people and whose admins are extremely careful about installing unstable software (evidence for git-annex being unstable comes from the documentation itself: https://git-annex.branchable.com/install/fromsource/). Does it need be like this? Does one need a whole universe of Haskell libraries just to download a bunch of NIFTI files?

Note that datalad requires git-annex newer than the version released in 13/September/2018 (that is, just 5 months ago), whereas the stable version that is available in EPEL is from 2014. Is this the best solution possible?

Thanks for the hard work on this, and looking forward to continue using fMRIprep.

All the best,

Anderson

Thanks, that is exactly what I meant. I should’ve used the “bare-metal” term for clarity here.

Yes, what you see there is a git-annex repository maintained and managed by DataLad. Those are, if you look at them, links to git annex remote files. The actual files are hosted in OSF - https://osf.io/ue5gx/. Yet, they are not too large either.

I understand where you come from here, and we are thinking about these problems too. @yarikoptic and @eknahm, can you think of a way of allowing users to download the whole datalad dataset ahead of time and prevent datalad from executing any git-annex command? Maybe after exporting to figshare, for example?

On the other hand, datalad is really effective to keep TemplateFlow under version control, and it makes it easy to transparently report the exact templates that were used in the processing. Since we intend to expand the available templates largely, we needed a tool but we didn’t want to get the load on our shoulders. DataLad was just the tool we needed.

Since git-annex provides standalone distributions that you can set at the user level, in principle I don’t see a strong need for your admins to install git-annex.

Thank you for using it and for all this valuable feedback. We are aware of the heavy friction users need to get through and we hope we solve more problems than we are creating.

Thanks again! We really appreciate the help and understand how complicated these various decisions and trade-offs can be.

I think for now we’ll just downgrade to fMRIprep 1.2.6-1 as that appears not to use yet templateflow, and it has the features we need (multi-echo processing). We’ll wait until you guys find the best solution ahead for the templates. For example have as a minimal required version for git-annex some older release for which it may be easier to find compiled packages for different distros, or maybe have the management of these templates entirely python-based or with some other interface…

Thank you again!

Anderson

I understand where you come from here, and we are thinking about these problems too. @yarikoptic and @eknahm, can you think of a way of allowing users to download the whole datalad dataset ahead of time and prevent datalad from executing any git-annex command? Maybe after exporting to figshare, for example?

If I got it right, the issue is not whether or not to use datalad to download the needed files, but rather to not need to download anything at all. I am not familiar with the file structure that is needed, but in general nothing prevents tar’ing up an entire, populated, dataset (with annex an all) and place it where it needs to be on the target system.

TL;DR summary:

  • I think that container should come with all needed templates pre-installed and not under ~/.cache but some location within the image (e.g., /usr/local/share/templateflow or /opt/templateflow)
  • templateflow should be instructed to use that location instead of its default TF_DEFAULT_HOME
  • If user really needs to overload those shipped templates with new ones, could consciously datalad install a new version locally and bind mount inside the container overloading the bundled version

Re related (not present here) issue of “datalad get” invocation on files which are already there failing on read-only filesystem:

  • datalad should not fail in such scenario - so we are fixing it https://github.com/datalad/datalad/pull/3164 - so should be merged today and I will release a quick bugfix over weekend. Should then work “as a brand new” :wink:

I am a bit confused here – since I guess it is the templateflow which executes datalad commands, so it could do analysis either it needs to do that or not. IMHO for fmriprep - all templates it possibly could use should be pre-downloaded (via datalad, or as @eknahm pointed out - could even be exported tarballs, but I do not see that being necessary) and available within fmriprep image, and then no datalad commands would strictly be needed, correct?
But if e.g. a newer templateflow would provide newer templates etc than shipped within fmriprep bundle - I do not see why user could not bind some local directory with them over the one you have in the bundle.

Here I also observe singularity specific interaction with templateflow which might be undesired! templateflow seems to rely on storing the templates under ~/.cache/templateflow. Singularity by default bind mounts $HOME. So - now execution of fmriprep becomes heavily dependent on the status of the local filesystem:

  • reproducibility could get severely affected even if it runs
  • original error suggests that actually in this case /home/winkleram/.cache/templateflow exists, but not a DataLad dataset. So it at least should be removed and then fmriprep reran hopefully cloning the correct one. But again - reproducibility is hindered since then depending on the state of templateflow repository you might keep getting a new version. Solution would be the aforementioned bundling of templateflow templates inside the container and using that location.

Thanks Yarik. In this case Anderson is not running containers. That is why the directory is under ~/.cache

However, wrt my idea I think you are right: templateflow could bypass datalad if the folder is not a git/git-annex to.

the figshare export would do this right?

Hi Yaroslav,

Thanks for the feedback. I did delete the .cache/templateflow in one of the attempts to run, but that didn’t fix. But even if it had fixed, it would have fixed for just 1 instance, but there are multiple fMRIprep instances running, so they’d collide when trying to write to that path.

Thanks again!

All the best,

Anderson

I would suggest to (always) run singularity in isolation of the environment variables (-e), and no HOME being mounted/shared between instances (--no-home) with bind mounts specifically needed (e.g. of the $PWD if you are in the dataset directory). Otherwise the ghosts of irreproducibility, if not immediate errors due to custom PYTHONPATH etc, will haunt you down

As for the fact that removal of .cache/templateflow not helping:

  • what is the content of that directory before and after unsuccessful run?
  • what is the error message (the same?)

also, according to https://github.com/poldracklab/fmriprep/commit/bca40d19d4c053cfbe546aa8a62c1d1b003675de#diff-3254677a7917c6c01f55212f86c57fbf the 1.3.0.post2 I think should ship the full copy of the templatflow under /opt/templateflow so that image version might resolve your troubles

Hi Yaroslav,

The directory was empty. I don’t know if the error message was the same or not. The reason is the following: there is 1 instance of fmriprep running per participant, each using just 1 thread. These are “swarmed” to a SLURM cluster in which each node has no display and no internet access, such that hundreds of participants run in parallel, but each one with their pipeline run serially (multi-threading disabled). Using it in this way, that is, one fmriprep per subject and each using just 1 thread, and crucially, not using any kind of container, was the way we were able to run with fewer frustrations. It takes about 26 hours per participant, which is fine.

So I don’t know if the error message is the same or not because, whichever was the 1st participant for whom ~/.cache/templateflow was created, all others would find that directory later already existing. Regardless, the directory is always empty. I could wade through the logs to find out if the error was different for 1 participant (the first), although I’ve now deleted these.

I can try next week with 1.3.0.post2. I’ve now further downgraded to 1.2.4 to see if the display issue (see the other thread) doesn’t happen there…

Thanks!

All the best,

Anderson

8 posts were split to a new topic: Singularity & fMRIPrep: PermissionError: [Errno 13] Permission denied: ‘/.cache’`

Hi, the latest release 1.3.1 is out. Please let us know if that version resolves this problem (fmriprep should no longer depend on git-annex/datalad)!

A post was split to a new topic: fMRIPrep/templateflow: Unable to access these remotes

Hi Oscar,
Updated now to 1.3.2 (and also tried with 1.3.1) and it fails for another reason. I’ll open a separate thread…
Thanks.
Anderson