Issue: fitting nimare.decode.continuous.CorrelationDecoder to Neurosynth dataset

alyssadai · September 23, 2021, 6:05pm

Hi there,

I’m trying to replicate the image decoder feature from the Neurosynth web interface (which I understand is now outdated/deprecated?) using NiMARE (0.0.10rc2), to obtain correlations of my image of interest with terms in the Neurosynth database.

I’ve followed the instructions in the NiMARE docs for downloading the Neurosynth database, and attempted to use nimare.decode.continuous.CorrelationDecoder as follows:

import os
import nimare

out_dir = os.path.abspath("nimare_ns_data/")
files = nimare.extract.fetch_neurosynth(path=out_dir, version="7", overwrite=False, source="abstract", vocab="terms",)
neurosynth_db = files[0]
neurosynth_dset = nimare.io.convert_neurosynth_to_dataset(coordinates_file=neurosynth_db["coordinates"], metadata_file=neurosynth_db["metadata"], annotations_files=neurosynth_db["features"])
neurosynth_dset.save(os.path.join(out_dir, "neurosynth_dataset.pkl.gz"))
neurosynth_dset = nimare.extract.download_abstracts(neurosynth_dset, "alyssa.dai@mail.mcgill.ca")
neurosynth_dset.save(os.path.join(out_dir, "neurosynth_daaset_with_abstracts.pkl.gz"))

from nimare.decode import continuous
decoder = continuous.CorrelationDecoder(feature_group=None, features=None)
decoder.fit(neurosynth_dset)
decoder.transform("myimage.nii.gz")

However, at the decoder.fit line, I’m getting the following error:

Dataset created in /home/daialy/.nimare/temporary_files

ERROR:nimare.meta.kernel:transform failed, removing /home/daialy/.nimare/temporary_files/MKDAKernelds0tb3nh20210923T125916
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/nimare/base.py", line 548, in fit
    self._fit(dataset)
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/nimare/decode/continuous.py", line 185, in _fit
    dataset = kernel_transformer.transform(dataset, return_type="dataset")
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/nimare/utils.py", line 473, in memmap_context
    return function(self, *args, **kwargs)
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/nimare/meta/kernel.py", line 160, in transform
    coordinates[["i", "j", "k"]] = mm2vox(coordinates[["x", "y", "z"]], mask.affine)
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 2935, in __setitem__
    self._setitem_array(key, value)
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 2965, in _setitem_array
    indexer = self.loc._get_listlike_indexer(
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1552, in _get_listlike_indexer
    self._validate_read_indexer(
  File "/home/daialy/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1640, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['i', 'j', 'k'], dtype='object')] are in the [columns]"

I’m having trouble finding info on this error in the context of NiMARE, so any advice on how to troubleshoot this would be much appreciated! I’m also not 100% certain if this is the right way to use the CorrelationDecoder with the Neurosynth data, so kindly let me know if there are any steps I missed.

Thanks!

tsalo · September 24, 2021, 3:12pm

Hi @alyssadai,

I’m having some trouble reproducing the bug on my machine, so my first guess is that it’s an issue with package versions.

Can you share the versions of numpy and pandas you have?

When trying to reproduce the bug, I decided to reduce both the number of studies and the number of features, so that it would run more quickly. Unfortunately, it ran without issue.

Could you try the following to see if still raises the exception you saw? That way, I’ll at least know that the changes I made didn’t fix or hide the issue.

import os
import nimare

out_dir = os.path.abspath("nimare_ns_data/")

# Load the Dataset you created
neurosynth_dset = nimare.dataset.Dataset.load(os.path.join(out_dir, "neurosynth_dataset.pkl.gz"))

# This is a necessary step when fitting the Decoder
# It just tells the Dataset where to write out the MA maps used in the Decoder
neurosynth_dset.update_path(out_dir)

# Reduce the studies in the Dataset to make it easier to decode
neurosynth_dset = neurosynth_dset.slice(neurosynth_dset.ids[:50])

# Reduce the features without using the `feature_group` or `features` parameters
neurosynth_dset.annotations = neurosynth_dset.annotations[
    ["id", "study_id", "contrast_id", "terms_abstract_tfidf__11", "terms_abstract_tfidf__15"]
]

# And then run the Decoder, hopefully reproducing the bug faster
from nimare.decode import continuous
decoder = continuous.CorrelationDecoder(feature_group=None, features=None)
decoder.fit(neurosynth_dset)
decoder.transform("myimage.nii.gz")

One thing that should raise an Exception, but not the one you saw, is the fact that your code snippet didn’t include neurosynth_dset.update_path(out_dir). You should have seen the following:

ValueError: Dataset output path is not set. Set the path with Dataset.update_path().

Best,
Taylor

alyssadai · September 26, 2021, 3:10am

Hi Taylor,

Thanks for your reply. The code in my original post was run using numpy 1.18.5 and pandas 1.0.5, in Python 3.8.3. Running the code in this environment with your changes above still raises the same exception.

I think your suspicion of package version issues may be correct - I reran your code on the Niagara cluster (used by our lab) which has numpy 1.19.2 and pandas 1.1.2 in Python 3.8.5, and did not see the error! As a sanity check, I saved the output of decoder.transform() as follows:

decoded_df = decoder.transform("myimage.nii.gz")
decoded_df.to_csv("neurosynth_dataset_slice_decoded.csv")

Below is a screenshot of the csv contents, which seems to match the output I expected.
neurosynth_dataset_slice_decoded-screenshot

As an aside, when initially running your code on Niagara, the decoder.fit() line raised a different exception:

Traceback (most recent call last):
  File "decode_example.py", line 14, in <module>
    decoder.fit(neurosynth_dset)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/base.py", line 548, in fit
    self._fit(dataset)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/decode/continuous.py", line 185, in _fit
    dataset = kernel_transformer.transform(dataset, return_type="dataset")
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/utils.py", line 462, in memmap_context
    dataset_dir = _get_dataset_dir("temporary_files", data_dir=None)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/extract/utils.py", line 143, in _get_dataset_dir
    raise OSError(
OSError: NiMARE tried to store the dataset in the following directories, but: 
 -/home/m/mchakrav/daialy/.nimare/temporary_files (Read-only file system)

$HOME is read-only on Niagara compute nodes (and, if I’m not mistaken, for the other SciNet HPC clusters as well). Are there any plans to modify creation of temporary files to use a $TMPDIR variable or /tmp? For convenience, I’ve opened an issue for this in the NiMARE repo as well. (For the sake of testing the code you shared, I redirected the temporary files into a $SCRATCH directory using a symlink to suppress the OSError.)

Finally, I wanted to double check if for a Neurosynth-style decoding, there is a recommended dataset to use between neurosynth_dataset.pkl.gz or neurosynth_dataset_with_abstracts.pkl.gz. Am I correct that both should produce the same result with the CorrelationDecoder, given that the Dataset.annotations dataframe is the same for both?

Sorry for the slightly long response, and thanks again for your help!

Best,
Alyssa

tsalo · September 26, 2021, 3:59pm

I’m glad that using newer versions fixed things. Unfortunately, I don’t have minimum versions of numpy and pandas pinned in NiMARE’s requirements, so this is a problem on NiMARE’s side. Obviously you can just install whatever versions would work on your machine, but I will try to define minimum versions so that installing NiMARE will also update any problematic dependencies.

I will also look into testing minimum versions to ensure that NiMARE’s tests run successfully on the oldest supported versions of its dependencies.

I will follow up in the issue, but that sounds like a great idea.

The Neurosynth dataset will come with the features you probably want for decoding- namely, the TF-IDF values you see on the Neurosynth website. You only need the abstracts if you want to extract other features. For example, you can train topic models with either LDA or GCLDA on those abstracts. However, for your purposes, it doesn’t seem like you need abstracts.

The decoding results would be the same with either one, unless you ran some automated annotation method that used the abstracts in order to overwrite or add to the existing features in the Dataset.

alyssadai · September 29, 2021, 3:53am

Hi Taylor,

Thanks for following up on this and for the clarification!

A couple related questions below:

I am running into some roadblocks with the required walltime to fit the CorrelationDecoder on the full Neurosynth dataset. I have tried to submit the following script on the Niagara cluster requesting the max walltime (24h) on 1 node (~200 GB RAM, 80 logical CPUs):

import os
os.environ['NUMEXPR_MAX_THREADS'] = '64'
import nimare
import sys

out_dir = os.path.abspath("nimare_ns_data/")
neurosynth_dset = nimare.dataset.Dataset.load(os.path.join(out_dir, "neurosynth_dataset.pkl.gz"))
neurosynth_dset.update_path(os.path.join(out_dir, "meta-analyses"))

from nimare.decode import continuous
decoder = continuous.CorrelationDecoder(feature_group=None, features=None, memory_limit='3gb')
decoder.fit(neurosynth_dset)

# Save decoder to avoid having to refit/regenerate MA maps for Neurosynth dataset for future target images
decoder.save(os.path.join(out_dir, "CorrelationDecoder_neurosynth_dset_fitted.pkl.gz"))

decoded_df = decoder.transform("cortex-c1.nii.gz")
decoded_df.to_csv("nimare_results/neurosynth_cortex-c1_decoded.csv")

The issue is that after 24h, only ~45/3228 terms have been decoded. Some things I’ve tried to speed this up include 1) adding os.environ['NUMEXPR_MAX_THREADS'] = '64', since I was initially getting the following warning when running the script:

# Previous warning
INFO:numexpr.utils:Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO:numexpr.utils:Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.

However, I’m not 100% sure if the CorrelationDecoder currently supports parallelization?
I’ve also tried 2) setting memory_limit='3gb' (anything much larger seems to result in a MemoryError) when initializing the decoder, but again have not seen much reduction in runtime.

Based on this thread, it seems like this is a known limitation of the current version. But since I am limited also by the max walltime per node, do you have any additional recommendations for speeding up the fitting stage (e.g., decoding the dataset in ‘chunks’ and then concatenating the resultant feature*r dataframes; running it over multiple nodes)?

I would also like to re-run the decoder using a subset of terms of interest (<200) rather than all features in the Neurosynth dataset. Based on the docs and the sample code you provided above, it seems that there are two potential ways I could do this, given a list of features myfeatures:

Reduce features in neurosynth_dset.annotations to those of interest, e.g.
neurosynth_dset.annotations = neurosynth_dset.annotations[["id", "study_id", "contrast_id", myfeatures[0], myfeatures[1], myfeatures[2] ... ]]
Use the features parameter in the decoder initialization, e.g.
decoder = continuous.CorrelationDecoder(feature_group=None, features=myfeatures)
(Assuming all terms in myfeatures have prefix “terms_abstract_tfidf”)

Are 1) and 2) above essentially equivalent in their memory usage when it comes to fitting the decoder? Also, to further cut down on memory/time needed, would you recommend first reducing the studies in the Dataset by looping over myfeatures and using get_studies_by_label() to create a dataset including only studies with at least one of my labels of interest?

Let me know if any clarification is needed, and thank you again kindly,

Best,
Alyssa

tsalo · September 29, 2021, 2:37pm

I definitely need to work on improving the CorrelationDecoder’s performance. In the meantime, running the meta-analyses and correlations outside the decoder should be faster and easier. The code you linked to in that thread should perform better, especially if you combine it with splitting the features into chunks.

Given that the meta-analyses are on the full dataset, I wouldn’t recommend trying to break up the dataset itself, but you could definitely break up the features into chunks. That should work just fine with your setup, especially if you use the custom code to write the meta-analysis results to an output directory.

Yes, that should be equivalent.

It shouldn’t affect things much, since that’s what’s happening inside the CorrelationDecoder. The main slowdown seems to come from writing to, and reading from, temporary files in order to reduce memory usage. Also, if you use the default meta-analytic Estimator (MKDAChi2), then it needs both studies with the feature and studies without the feature. However, if you choose to use a different Estimator, then that might speed things up, though it won’t be the same way Neurosynth did things.

Unfortunately, it doesn’t, but it shouldn’t be too hard to add.

alyssadai · September 29, 2021, 3:23pm

Thanks so much for the detailed answer! Will try this with your recommendations.

alyssadai · October 1, 2021, 4:56pm

Hi Taylor,

Running the meta-analyses outside the decoder (following the code linked to above) alongside splitting the features into chunks worked great for reducing runtime when tested on my reduced features of interest!

Unfortunately, I’m running into some issues trying to ‘manually’ correlate the obtained meta-analysis maps (e.g. terms_abstract_tfidf__mood_z_desc-specificity.nii.gz) with my target image. I didn’t see a method for loading individual meta-analytic maps from files in the NiMARE docs, so I attempted the following using the CorrelationDecoder.fit() source code as a guide:

import os
import sys
import nimare
from nimare.stats import pearson
import numpy as np
import nibabel as nib

target_img = sys.argv[1] # image to be decoded
data_dir = os.path.abspath("nimare_ns_data/")
results_dir = os.path.abspath("nimare_results/")
neurosynth_dset = nimare.dataset.Dataset.load(os.path.join(data_dir, "neurosynth_dataset.pkl.gz"))
meta_result_files = np.genfromtxt(os.path.join(data_dir, "psychosis_terms_meta-analysis_maps_list.txt"), comments=None, delimiter="None", dtype="str")
map_type = "z_desc-specificity"

# Extract + store file names
features_ = []
for file in meta_result_files:
    basefile = os.path.basename(file) # get basename of file
    feature = basefile.replace(f"_{map_type}","") # strip suffix
    features_.append(feature)

# Correlate feature meta-analytic images with target image
for i, file in enumerate(meta_result_files):
    img_nifti = nib.load(file) # meta-analysis image
    # feature_data = img_nifti.get_fdata()
    feature_data = neurosynth_dset.masker.transform(img_nifti)

    if i == 0:
        images_ = np.zeros((len(features_), len(feature_data))
    images_[i,:] = feature_data

target_img_vec = neurosynth_dset.masker.transform(target_img)
corrs = pearson(target_img_vec, images_)

# Write out result
out_df = pd.DataFrame(index=features_, columns=["r"], data=corrs)
out_df.index.name = "feature"
out_df.to_csv(f"{out_dir}/neurosynth_cortex-c1_decoded.csv")

When I run this, I get the following error:

Traceback (most recent call last):
  File "nimare_correlate_with_target.py", line 42, in <module>
    images_[i,:] = feature_data
ValueError: could not broadcast input array from shape (228453) into shape (1)

I suspect I’m not correctly transforming the input meta-analysis nifti into a 2D array for the correlation, but am still relatively new to applying nilearn maskers to nifti images. In the source code for the decoder nimare.results.MetaResult.get_map() is used, but since in this case the files are not stored in a MetaResult object I wasn’t sure how to approach this method.

I think I must be missing something basic - could you advise on how to correctly load the individual meta-analysis files into memory?

Thank you very much!

tsalo · October 3, 2021, 3:03pm

Hi Alyssa,

Your approach looks good to me, and I’m not entirely sure why that error comes up. It’s definitely an issue with the shapes of the masked arrays, though. Maybe nilearn is transforming the map into a 2D array instead of a 1D. Can you try adding feature_data = np.squeeze(feature_data) after masking img_nifti?

If that doesn’t work, can you print out the following in the second for loop: i, len(features_), and len(feature_data).

Best,
Taylor

alyssadai · October 3, 2021, 8:20pm

Hi Taylor,

Thanks for the suggestion. Adding feature_data = np.squeeze(feature_data) after feature_data = neurosynth_dset.masker.transform(img_nifti) in the second for loop got rid of the ValueError from before, but I am now getting the following RunTimeWarning at the correlation step:

/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/stats.py:108: RuntimeWarning: invalid value encountered in true_divide
  rs = temp / (datass[1:] * datass[0])

The script runs to completion, but printing out_df.head() gives me the following unexpected output:

feature                           r
terms_abstract_tfidf__adolescence NaN
terms_abstract_tfidf__adolescents NaN 
terms_abstract_tfidf__adolescent  NaN
terms_abstract_tfidf__beliefs     NaN
terms_abstract_tfidf__belief      NaN

I have also tried adding target_img_vec = np.squeeze(target_img_vec) (changing the shape from (1, 228453) to (228453,)) after masking target_img, but this doesn’t seem to change anything.

Attached below is the output of adding the print lines you requested in the second for loop.

print_feature_data.txt (3.4 KB)

As a sanity check, I also ran the following to check the contents of a couple of my feature_data arrays:

>>> img_nifti = nib.load(meta_result_files[0]) # terms_abstract_tfidf__adolescence
>>> feature_data = neurosynth_dset.masker.transform(img_nifti)
>>> feature_data = np.squeeze(feature_data)
>>> print(np.where(feature_data == 0)[0].size)
228453

This output seems to indicate that all of the voxel activations in the meta-analysis image are 0, which seems strange? I’m not sure if this could be contributing to the RunTimeWarning/all the NaN correlations in out_df, but am wondering if this also suggests an issue with the masker or a problem earlier on in generating the meta-analysis maps themselves.

Any thoughts would be much appreciated!

tsalo · October 5, 2021, 4:08pm

This is odd. I ran a local test using a recently-compiled version of the Neurosynth dataset and the "terms_abstract_tfidf__adolescence" label. The maps produced by the meta-analysis looked fine…

import nimare
import numpy as np

dset = nimare.dataset.Dataset.load("neurosynth_dataset.pkl.gz")
mkda = nimare.meta.cbma.mkda.MKDAChi2()
term = "terms_abstract_tfidf__adolescence"
term_ids = dset.get_studies_by_label(term)  # 241 studies
notterm_ids = list(set(dset.ids) - set(term_ids))  # 14130 studies
dset1 = dset.slice(term_ids)
dset2 = dset.slice(notterm_ids)

res = mkda.fit(dset1, dset2)
res.save_maps(prefix=term)

arr = dset.masker.transform("terms_abstract_tfidf__adolescence_z_desc-specificity.nii.gz")
print(np.where(arr == 0)[0].size)  # returns 0

Does the code look roughly equivalent to what you did to run the meta-analyses?

By any chance did you run the Chi2 analysis on the same dataset twice (instead of term studies vs non-term studies)? E.g., meta.fit(dset1, dset1)? That might explain why everything would end up zeros.

alyssadai · October 8, 2021, 7:51pm

Hi Taylor,

Sorry for the delay in my response, and thanks for looking into this! Indeed, that was the problem that was causing the zeros in my meta-analysis images! I realized there was a typo in the code from this response, which I was following, where both label_positive_dset and label_negative_dset were assigned the same value. I must have missed it when writing up my script.

Below is my script replicating your local test. The decoder fitting time with these changes is about ~10 min for a single term, compared to the ~30 min needed when using CorrelationDecoder functions.

import os
import nimare
import numpy as np

# Save meta-analytic maps to an output directory
data_dir = os.path.abspath("nimare_ns_data/")
out_dir = f"{data_dir}/meta-analyses_maps/"
label = "terms_abstract_tfidf__adolescence"

dataset = nimare.dataset.Dataset.load(os.path.join(data_dir, "neurosynth_dataset.pkl.gz"))
# dataset.update_path(os.path.join(data_dir, "study_maps/")) # is this needed?

# Initialize the Estimator
meta_estimator = nimare.meta.cbma.mkda.MKDAChi2()

label_positive_ids = dataset.get_studies_by_label(label, 0.001)
label_negative_ids = list(set(dataset.ids) - set(label_positive_ids))

label_positive_dset = dataset.slice(label_positive_ids)
label_negative_dset = dataset.slice(label_negative_ids)
meta_result = meta_estimator.fit(label_positive_dset, label_negative_dset)
meta_result.save_maps(output_dir=out_dir, prefix=label)

A couple of follow-up questions: I noticed that when I added the code for pre-generating the MA maps after initializing the estimator (as you suggest in this response) the script threw a MemoryError during the subsequent meta_estimator.fit() line:

# Pre-generate MA maps to speed things up
kernel_transformer = meta_estimator.kernel_transformer
dataset = kernel_transformer.transform(dataset, return_type="dataset")
dataset.save("neurosynth_with_ma.pkl.gz")

Similarly, when I ran the script loading the saved dataset with MA maps instead of neurosynth_dataset.pkl.gz, i.e. dataset = nimare.dataset.Dataset.load("neurosynth_with_ma.pkl.gz"), I get the same error:

ERROR:nimare.meta.cbma.mkda:_fit failed, removing None
ERROR:nimare.meta.cbma.mkda:_fit failed, removing None
Traceback (most recent call last):
  File "nimare_fit_decoder_fast.py", line 56, in <module>
    meta_result = meta_estimator.fit(label_positive_dset, label_negative_dset)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 719, in fit
    maps = self._fit(dataset1, dataset2)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/utils.py", line 473, in memmap_context
    return function(self, *args, **kwargs)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/meta/cbma/mkda.py", line 231, in _fit
    ma_maps2 = self._collect_ma_maps(
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 168, in _collect_ma_maps
    ma_maps = self.masker.transform(self.inputs_[maps_key])
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/input_data/base_masker.py", line 185, in transform
    return self.transform_single_imgs(imgs, confounds)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/input_data/nifti_masker.py", line 443, in transform_single_imgs
    data = self._cache(filter_and_mask,
  File "/scinet/niagara/software/2019b/opt/base/python/3.8.5/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/input_data/nifti_masker.py", line 68, in filter_and_mask
    data, affine = filter_and_extract(imgs, _ExtractionFunctor(mask_img_),
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/input_data/base_masker.py", line 99, in filter_and_extract
    region_signals, aux = cache(extraction_function, memory,
  File "/scinet/niagara/software/2019b/opt/base/python/3.8.5/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/input_data/nifti_masker.py", line 30, in __call__
    return(masking.apply_mask(imgs, self.mask_img_,
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/masking.py", line 759, in apply_mask
    return _apply_mask_fmri(imgs, mask_img, dtype=dtype,
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/masking.py", line 803, in _apply_mask_fmri
    series = _utils.as_ndarray(series, dtype=dtype, order="C",
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/_utils/numpy_conversions.py", line 110, in as_ndarray
    ret = _asarray(arr, dtype=dtype, order=order)
  File "/gpfs/fs1/home/m/mchakrav/daialy/.virtualenvs/nimare_venv/lib/python3.8/site-packages/nilearn/_utils/numpy_conversions.py", line 27, in _asarray
    ret = np.asarray(arr, dtype=dtype, order=order)
  File "/scinet/niagara/software/2019b/opt/base/python/3.8.5/lib/python3.8/site-packages/numpy-1.19.2-py3.8-linux-x86_64.egg/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError: Unable to allocate 95.0 GiB for an array with shape (91, 109, 91, 14130) and data type int64

So, it seems that having the MA maps in memory creates issues. Am I correct then, that in my current script (the first code block), the MA maps are just generated every time the script is run? Based on my tests on the cluster it seems that not having the MA maps pre-generated is the only way the code runs to completion without memory errors. Not sure if I’m missing something.

Also, I was wondering when it is necessary to run dataset.update_path(data_dir) after loading the dataset. In my script above it seems that the line can be omitted without errors, and even when it is included there are no MA file outputs. Meanwhile, when using the CorrelationDecoder functions, the MA maps seem to be generated everytime decoder.fit(dataset) is run, even if data_dir is already populated. Is this the expected behaviour?

Thanks as always for your time, and apologies for all my questions!

Best,
Alyssa

tsalo · October 12, 2021, 11:36pm

Sort of. The problem is that, when the MA maps are written to disk and then loaded up by NiMARE later, nilearn has trouble loading all of them at once. That’s where the memory_limit parameter can help. NiMARE has a function that transforms (masks) images in batches based on the memory limit you want.

Yes, that’s correct. That can slow things down.

That method determines where the Dataset should look for maps on the disk. It’s essential for image-based Datasets, but it’s only necessary for coordinate-based Datasets like Neurosynth when you want to write the MA maps to disk. In your original approach (generating them from scratch for each feature you’re decoding), you don’t have to use it.

What you describe is definitely the expected behavior.

alyssadai · October 13, 2021, 2:51pm

Understood, thanks for the explanation!

Good point! I think I missed this in the docs previously. Just to clarify, to implement memory_limit together with loading pre-generated MA maps from the disk (to reduce the memory load), would this be the correct place to specify it?

dataset = nimare.dataset.Dataset.load("neurosynth_with_ma.pkl.gz")
dataset.update_path(os.path.join(data_dir, "study_maps/"))

kwargs = {"kernel__memory_limit" : "50gb"}
meta_estimator = nimare.meta.cbma.mkda.MKDAChi2(**kwargs)

Based on some (rough) tests with this approach, it would appear that generating the MA maps from scratch for each feature is still the faster method. As a last sanity check, the cost of setting memory_limit is that things are slowed down even further, correct?

tsalo · October 13, 2021, 5:10pm

When the Dataset contains MA maps, those MA maps correspond to files with the parameters of the MA kernel (including the memory_limit used) in the filenames, so changing the memory_limit later on will make it write out new MA maps. However, I think it should be fine to use kernel__memory_limit=None and memory_limit="50gb" for the MKDAChi2. That should use the memory limit when loading the MA maps in the estimator, but would also use the existing maps rather than making new ones.

Yeah, basically. It seems to be the main cause of the extreme slowdown in the CorrelationDecoder.

alyssadai · October 14, 2021, 4:48pm

Got it. In that case, for my compute resources I think generating new maps is working fine. Just a note, though, that setting memory_limit for the MKDAChi2 (i.e., meta_estimator = nimare.meta.cbma.mkda.MKDAChi2(memory_limit="50gb", kernel__memory_limit=None) doesn’t seem to work as expected (?) with existing MA maps; I get the same error as before regardless of the value for memory_limit:

MemoryError: Unable to allocate 95.0 GiB for an array with shape (91, 109, 91, 14130) and data type int64

Printing the params of the estimator also doesn’t reveal a memory_limit parameter aside from the one for the kernel_transformer. So, I wonder if this approach is indeed limited by the memory_limit set when the MA maps are first created with the dataset, as you suggested.

Either way, I think I have all the info needed for my use case now! Thanks again Taylor.