How to extract Neurosynth studies with specific features in NiMARE?

Hi!

I was trying to generate a meta-analytic dataset with specific features (e.g., aging) from Neurosynth, but this toolbox seems to be deprecated - even basic commands like “dataset.add_features” don’t work with recent versions of python on my computer. Can I do the same thing in NiMARE? Is there any example or tutorial on feature-based meta-analysis (especially, on generating datasets with specific features) in NiMARE? Thanks so much!

NiMARE’s Dataset class has some rudimentary search methods, including Dataset.get_studies_by_label(). If the Dataset has any labels in it (stored in the Dataset.annotations attribute), then that method can find the IDs of studies with that label. You can then use Dataset.slice() to create a new Dataset containing only studies with that label, and then feed that Dataset into an meta-analysis.

Here’s some code to accomplish that:

import nimare

# Download and convert Neurosynth data to NiMARE format
# See https://nimare.readthedocs.io/en/latest/auto_examples/01_datasets/download_neurosynth.html#sphx-glr-auto-examples-01-datasets-download-neurosynth-py
nimare.extract.fetch_neurosynth(path=".")
neurosynth_dataset = nimare.io.convert_neurosynth_to_dataset("database.txt", "features.txt")

# Labels are prefixed by the "label source".
# In this case that's Neurosynth_TFIDF
fear_ids = neurosynth_dataset.get_studies_by_label("Neurosynth_TFIDF__fear")
fear_dataset = neurosynth_dataset.slice(fear_ids)

# Now you can use the new fear_dataset in your meta-analysis

Please let me know if that doesn’t work for you.

Thanks so much!! Your example is very useful!

The “get_studies_by_label” command works, but only when I add “label_threshold” to the command. For instance, I can only get 26 studies related to fear without any thresholding, but 363 studies when the threshold is set to be 0.01. The number 363 seems to match the search results on the Neurosynth website. I am wondering what particular threshold value I should normally take to extract the studies? Do you have any recommendations?

Oh, sorry about that, but great job catching that issue! In Neurosynth, the default threshold is 0.001, which basically corresponds to the term appearing a least once in the abstract. However, the default threshold for get_studies_by_label() is 0.5. It may be worth it to change the default to 0.001 since most folks who search Datasets by labels will be using Neurosynth’s database. In any case, if you’re using NiMARE on Neurosynth, you should use 0.001.

Got it, thank you!!

I came across another question while dealing with the labels in NiMARE. Those labels seem to be different from the actual labels in Neurosynth. As an example, I attach a word cloud of the labels of all studies with activation in middle frontal gyrus: |442px;x225px;
These labels are derived by removing the “Neurosynth_TFIDF__” part of the original NiMARE labels. As you can see, many of them do not make sense, e.g., “ffec”, “al”, “p”. It seems that some characters in the label get lost when they are imported into NiMARE. Is there any way to solve this problem, and get the actual label of these studies?

I’m having trouble reproducing this (i.e., finding labels like “p” in a Neurosynth-based NiMARE dataset). I wonder if there’s a bug in NiMARE’s decoding function instead. Can you share the tsv/csv file with the MFG labels?

Nvm, I think I figured out the problem! It turns out that I removed “Neurosynth_TFIDF__” in the wrong way.

Originally, I did:
keys_new = [s.strip('Neurosynth_TFIDF__') for s in keys]
Where keys are dict keys of all the labels. This actually removes each character in “Neurosynth_TFIDF__” at the beginning and the end of the label (If I understand this correctly).

Then I modified my code:
keys_new = [s.replace('Neurosynth_TFIDF__','') for s in keys]
It now removes “Neurosynth_TFIDF__” as a whole and the labels become normal. Below are the updated MFG labels:
image

This seems to match the “feature.txt” file downloaded with the neurosynth data. However, I still have problems interpreting the labels in general. Some of them are straightforward, but others are really vague. For example, “effects”, “stimuli”, “evidence”, “greater”, and there are also numbers such as “001” - I am not exactly sure how to interpret them. Is there a codebook that describes these labels in detail, or any other things that could help me interpret the data?

Thanks so much!!
Linjing

I’m glad you were able to find the problem.

Regarding the terms themselves- they come from a very simple TF-IDF vectorization of the article abstracts. Some terms are removed automatically using a stop list, but that list is not very extensive. Not to mention that the labeling procedure comes down to basically “is this word used in the abstract?”. All of which is to say that you shouldn’t read too much into the labels, or really any automated annotation result.

On the bright side, I do recall that @62442katieb classified all of the labels in Neurosynth’s standard feature set into informative and uninformative groups at one point. Maybe she can provide that mapping?

One alternative that might be helpful if you’re concerned with interpretability might be to use a topic model instead. Topic models find underlying distributions among term uses in the abstracts, which generally groups terms that are used together into the same topics. The caveats regarding automated annotation still apply, of course, but at least the annotation procedure tends to produce more useful results.

Thanks so much for your clarification!

A follow-up question: If I were to use the topic model, how should I do that in NiMARE? Is there any specific NiMARE function or command for this, or should I use other packages?

NiMARE has a couple of topic model tools, including Latent Dirichlet Allocation (LDA) and Generalized Correspondence Latent Dirichlet Allocation (GCLDA). LDA just uses text, while GCLDA uses both text and coordinates. The docstrings for both classes include papers in which those algorithms (not the NiMARE implementations) were used. Be forewarned, though, that GCLDA takes a very long time to train.

Got it! I noticed that there are multiple GCLDA functions in NiMARE. One is under “annotate” , such as nimare.annotate.gclda and others are under “decode”, such as nimare.decode.discrete.gclda_decode_roi. Are these two types of functions the same?

nimare.annotate contains classes and functions for automated annotation (i.e., extraction of labels/terms from studies- generally using their abstracts), while nimare.decode contains tools for performing functional characterization analysis, or functional decoding. You’ll want to use nimare.annotate.gclda.GCLDAModel to train your GCLDA model, which builds distributions of p(term|topic), p(study|topic), and p(voxel|topic). Then, you can use nimare.decode.discrete.gclda_decode_roi to decode a mask/ROI based on the GCLDA model. All that function does is average topic weights from the p(voxel|topic) distributions across the ROI, as in Rubin et al. (2017).

And regarding term classification as “informative” or noninformative, I do have such a list and am happy to email it to you, if you’re interested @linjing_jiang !

1 Like

Thanks for your clarification! I will try these out and keep you posted on how it goes!

If you could send me the list that would be great! Here is my email: linjing.jiang@stonybrook.edu

Thanks so much!!

MANAGED BY INCF