How to extract Neurosynth studies with specific features in NiMARE?

Hi!

I was trying to generate a meta-analytic dataset with specific features (e.g., aging) from Neurosynth, but this toolbox seems to be deprecated - even basic commands like “dataset.add_features” don’t work with recent versions of python on my computer. Can I do the same thing in NiMARE? Is there any example or tutorial on feature-based meta-analysis (especially, on generating datasets with specific features) in NiMARE? Thanks so much!

NiMARE’s Dataset class has some rudimentary search methods, including Dataset.get_studies_by_label(). If the Dataset has any labels in it (stored in the Dataset.annotations attribute), then that method can find the IDs of studies with that label. You can then use Dataset.slice() to create a new Dataset containing only studies with that label, and then feed that Dataset into an meta-analysis.

Here’s some code to accomplish that:

import nimare

# Download and convert Neurosynth data to NiMARE format
# See https://nimare.readthedocs.io/en/latest/auto_examples/01_datasets/download_neurosynth.html#sphx-glr-auto-examples-01-datasets-download-neurosynth-py
nimare.extract.fetch_neurosynth(path=".")
neurosynth_dataset = nimare.io.convert_neurosynth_to_dataset("database.txt", "features.txt")

# Labels are prefixed by the "label source".
# In this case that's Neurosynth_TFIDF
fear_ids = neurosynth_dataset.get_studies_by_label("Neurosynth_TFIDF__fear")
fear_dataset = neurosynth_dataset.slice(fear_ids)

# Now you can use the new fear_dataset in your meta-analysis

Please let me know if that doesn’t work for you.

Thanks so much!! Your example is very useful!

The “get_studies_by_label” command works, but only when I add “label_threshold” to the command. For instance, I can only get 26 studies related to fear without any thresholding, but 363 studies when the threshold is set to be 0.01. The number 363 seems to match the search results on the Neurosynth website. I am wondering what particular threshold value I should normally take to extract the studies? Do you have any recommendations?

Oh, sorry about that, but great job catching that issue! In Neurosynth, the default threshold is 0.001, which basically corresponds to the term appearing a least once in the abstract. However, the default threshold for get_studies_by_label() is 0.5. It may be worth it to change the default to 0.001 since most folks who search Datasets by labels will be using Neurosynth’s database. In any case, if you’re using NiMARE on Neurosynth, you should use 0.001.

Got it, thank you!!

I came across another question while dealing with the labels in NiMARE. Those labels seem to be different from the actual labels in Neurosynth. As an example, I attach a word cloud of the labels of all studies with activation in middle frontal gyrus: |442px;x225px;
These labels are derived by removing the “Neurosynth_TFIDF__” part of the original NiMARE labels. As you can see, many of them do not make sense, e.g., “ffec”, “al”, “p”. It seems that some characters in the label get lost when they are imported into NiMARE. Is there any way to solve this problem, and get the actual label of these studies?

I’m having trouble reproducing this (i.e., finding labels like “p” in a Neurosynth-based NiMARE dataset). I wonder if there’s a bug in NiMARE’s decoding function instead. Can you share the tsv/csv file with the MFG labels?

Nvm, I think I figured out the problem! It turns out that I removed “Neurosynth_TFIDF__” in the wrong way.

Originally, I did:
keys_new = [s.strip('Neurosynth_TFIDF__') for s in keys]
Where keys are dict keys of all the labels. This actually removes each character in “Neurosynth_TFIDF__” at the beginning and the end of the label (If I understand this correctly).

Then I modified my code:
keys_new = [s.replace('Neurosynth_TFIDF__','') for s in keys]
It now removes “Neurosynth_TFIDF__” as a whole and the labels become normal. Below are the updated MFG labels:
image

This seems to match the “feature.txt” file downloaded with the neurosynth data. However, I still have problems interpreting the labels in general. Some of them are straightforward, but others are really vague. For example, “effects”, “stimuli”, “evidence”, “greater”, and there are also numbers such as “001” - I am not exactly sure how to interpret them. Is there a codebook that describes these labels in detail, or any other things that could help me interpret the data?

Thanks so much!!
Linjing

I’m glad you were able to find the problem.

Regarding the terms themselves- they come from a very simple TF-IDF vectorization of the article abstracts. Some terms are removed automatically using a stop list, but that list is not very extensive. Not to mention that the labeling procedure comes down to basically “is this word used in the abstract?”. All of which is to say that you shouldn’t read too much into the labels, or really any automated annotation result.

On the bright side, I do recall that @62442katieb classified all of the labels in Neurosynth’s standard feature set into informative and uninformative groups at one point. Maybe she can provide that mapping?

One alternative that might be helpful if you’re concerned with interpretability might be to use a topic model instead. Topic models find underlying distributions among term uses in the abstracts, which generally groups terms that are used together into the same topics. The caveats regarding automated annotation still apply, of course, but at least the annotation procedure tends to produce more useful results.

Thanks so much for your clarification!

A follow-up question: If I were to use the topic model, how should I do that in NiMARE? Is there any specific NiMARE function or command for this, or should I use other packages?

NiMARE has a couple of topic model tools, including Latent Dirichlet Allocation (LDA) and Generalized Correspondence Latent Dirichlet Allocation (GCLDA). LDA just uses text, while GCLDA uses both text and coordinates. The docstrings for both classes include papers in which those algorithms (not the NiMARE implementations) were used. Be forewarned, though, that GCLDA takes a very long time to train.

Got it! I noticed that there are multiple GCLDA functions in NiMARE. One is under “annotate” , such as nimare.annotate.gclda and others are under “decode”, such as nimare.decode.discrete.gclda_decode_roi. Are these two types of functions the same?

nimare.annotate contains classes and functions for automated annotation (i.e., extraction of labels/terms from studies- generally using their abstracts), while nimare.decode contains tools for performing functional characterization analysis, or functional decoding. You’ll want to use nimare.annotate.gclda.GCLDAModel to train your GCLDA model, which builds distributions of p(term|topic), p(study|topic), and p(voxel|topic). Then, you can use nimare.decode.discrete.gclda_decode_roi to decode a mask/ROI based on the GCLDA model. All that function does is average topic weights from the p(voxel|topic) distributions across the ROI, as in Rubin et al. (2017).

And regarding term classification as “informative” or noninformative, I do have such a list and am happy to email it to you, if you’re interested @linjing_jiang !

1 Like

Thanks for your clarification! I will try these out and keep you posted on how it goes!

If you could send me the list that would be great! Here is my email: linjing.jiang@stonybrook.edu

Thanks so much!!

Hi! I’m doing the same procedure to generate a meta-analytic dataset with specific features. But when I try to insert this piece of code

neurosynth_dataset = nimare.io.convert_neurosynth_to_dataset("database.txt", "features.txt")

The following error happens:

Traceback (most recent call last):
File “”, line 1, in
File “C:\pythonProject3\lib\site-packages\nimare\io.py”, line 123, in convert_neurosynth_to_dataset
dict_ = convert_neurosynth_to_dict(text_file, annotations_file)
File “C:\pythonProject3\lib\site-packages\nimare\io.py”, line 45, in convert_neurosynth_to_dict
dset_df = pd.read_csv(text_file, sep="\t")
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 819, in init
self._engine = self._make_engine(self.engine)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 1867, in init
self._open_handles(src, kwds)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 1362, in _open_handles
self.handles = get_handle(
File “C:\pythonProject3\lib\site-packages\pandas\io\common.py”, line 642, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: ‘database.txt’

Do you have any idea what it could be? Thanks!

nimare.io.convert_neurosynth_to_dataset requires two files- one containing the study coordinates and one containing the term frequencies from article abstracts. Those files are provided by Neurosynth in the neurosynth-data repository. Have you downloaded those files, and do they exist in the folder you’re running nimare.io.convert_neurosynth_to_dataset from?

hi - Could i also get in on that?
foldes.andrei@gmail.com

Hello,

There a certain Neurosynth topic that I would like to work with within NIMARE: Neurosynth: topic 23
It includes 140 studies on neurosynth.org; however when I do the same using NIMARE
relational_cluster_ids = neurosynth_dataset.get_studies_by_label("Neurosynth_TFIDF__23", label_threshold= 0.001) it results in 254 studies.

What am I doing wrong?

It’s a little buried in the website, but for the topic meta-analyses a threshold of 0.05 is used instead of 0.001 (check the FAQs tab in the meta-analysis page you linked):

How do you determine which studies to include in an analysis?

We use a predefined binary cut-off. For all topic-based meta-analyses, we treat all studies with a loading > 0.05 as “active” for a given topic, and all other studies as inactive. Although the choice of threshold is relatively arbitrary, in practice, varying it within a fairly broad range of values has minimal influence on the results. Adopting a continuous approach instead of dichotomizing the dataset also has a negligible effect.

I also noticed that the label you have is “Neurosynth_TFIDF__23”. Is that just because you used nimare.io.convert_neurosynth_to_dataset to create your Dataset, or did you use the “features.txt” file in neurosynth-data?