Neurosynth topic-based decoding via NiMARE

Hi again Taylor and/or others,

Our group has used the neurosynth decoding script in the past to conduct a topic-based decoding. We generally use the 50 topic file (https://github.com/neurosynth/neurosynth-web/blob/master/data/topics/analyses/v4-topics-50.txt). However, I am running into issues with this code and after a little search I saw a reply from Tal Yarkoni saying this package is no longer maintained (https://github.com/neurosynth/neurosynth/issues/96).

Do NiMARE allows to reproduce this topic-based analysis?

Thanks!

Have you tried downloading the topic file using nimare.extract.fetch_neurosynth? Specifically, you should be able to use vocab="LDA50" and version="7" to download the most recent version of Neurosynth with the 50-topic model included as the features.

In order to faithfully reproduce the original meta-analyses, you would need to use a threshold of 0.05 instead of the default 0.001 in Dataset.get_studies_by_label(). I found that out for How to replicate Neurosynth meta-analysis in NiMARE? - #7 by tsalo.

Hi Taylor,

Thanks for this info. I have successfully downloaded the v7 50 topics LDA file, but it only contains 50 lines with the topics, whereas the older version we used contains the “id” header in addition to values associated with the topics (https://raw.githubusercontent.com/neurosynth/neurosynth-web/master/data/topics/analyses/v4-topics-50.txt). The code you are referring from this other publication hence works with the v4 file, but not the v7. Is there a v7 file I missed which would be similarly organized?

Using the code you provided in this other publication, I was able to merge the 50 v4 topics into the neurosynth dataset. However I am not exactly sure what next steps should be? I am sorry in advance, I am not a seasoned coder, especially not with python so it is a bit difficult for me to follow. Here is my code, in case it would help:

Creation of the neurosynth dataset, as indicated on the NiMARE website:

‘’'out_dir = os.path.abspath("/Users/m246120/Desktop/dAD_BPR/Decoding/neurosynth/")
os.makedirs(out_dir, exist_ok=True)

files = nimare.extract.fetch_neurosynth(
path=out_dir,
version=“7”,
overwrite=False,
source=“abstract”,
vocab=“terms”,
)
pprint(files)
neurosynth_db = files[0]

neurosynth_dset = nimare.io.convert_neurosynth_to_dataset(
coordinates_file=neurosynth_db[“coordinates”],
metadata_file=neurosynth_db[“metadata”],
annotations_files=neurosynth_db[“features”],
)
neurosynth_dset.save(os.path.join(out_dir, “neurosynth_dataset.pkl.gz”))
print(neurosynth_dset)

neurosynth_dset = nimare.extract.download_abstracts(neurosynth_dset, “corriveau-lecavalier.nick@mayo.edu”)
neurosynth_dset.save(os.path.join(out_dir, “neurosynth_dataset_with_abstracts.pkl.gz”))’’’

“”"###Reproduction of the 50LDA neurosynth library

First, load the dataset as dset

with gzip.open("/Users/m246120/Desktop/dAD_BPR/Decoding/neurosynth/neurosynth_dataset_with_abstracts.pkl.gz") as dset:
dset = pickle.load(dset)"""

“”# Read in the topic file, rename the ID column, and

prepend a prefix to the topic names

df = pd.read_table("/Users/m246120/Desktop/dAD_BPR/Decoding/neurosynth/data-neurosynth_version-4_vocab-LDA50_vocabulary.txt")
topic_names = [c for c in df.columns if c.startswith(“topic”)]
topics_renamed = {t: “Neurosynth_LDA__” + t for t in topic_names}
topics_renamed[“id”] = “study_id”
df = df.rename(columns=topics_renamed)"""

“”"# Change the data type for the study_id column so it can be merged
df[‘study_id’] = df[‘study_id’].astype(str)

Merge the topic dataframe into the annotations dataframe

new_annotations = dset.annotations.merge(
df,
how=“inner”,
left_on=“study_id”,
right_on=“study_id”
)
dset.annotations = new_annotations"""

“”"# The topic file only contains ~10k studies,

so we must reduce the dataset to match

new_ids = new_annotations[“id”].tolist()
dset = dset.slice(new_ids)"""

What would be next? I am not sure what you mean with the dataset.get_studies_by_label() function, as I have never included that in my previous models. Do I need to run a new model, then the decoder? Would decoding be similar as for the gclda model? Such as:

“”"# Run the decoder
decoded_df, _ = decode.continuous.gclda_decode_map(model, img_eb1)
decoded_df.sort_values(by=“Weight”, ascending=True).head(50)""

Thanks for your help again.

You don’t need all of those steps. You can just fetch the LDA50 features directly and feed them into the conversion function, just like the default TFIDF weights.

Yes, the format of Neurosynth and NeuroQuery files has been changed in order to (1) minimize the space used by the database and (2) use the same convention in both. However, NiMARE can work with these new files just fine.

Take a look at this:

import os
from pprint import pprint

import nimare

out_dir = os.path.abspath("/Users/m246120/Desktop/dAD_BPR/Decoding/")
os.makedirs(out_dir, exist_ok=True)

# Fetch Neurosynth with *just* the LDA50 features
files = nimare.extract.fetch_neurosynth(
    data_dir=out_dir,  # version 0.0.10 switched to data directory
    version="7",
    overwrite=False,
    source="abstract",
    vocab="LDA50",  # Note the difference here
)
neurosynth_db = files[0]
pprint(neurosynth_db)
# Note the "keys" file. That has the top 30 words for each topic.
# It *doesn't* go in the Dataset at all though.

# Get the Dataset object
neurosynth_dset = nimare.io.convert_neurosynth_to_dataset(
    coordinates_file=neurosynth_db["coordinates"],
    metadata_file=neurosynth_db["metadata"],
    annotations_files=neurosynth_db["features"],
)

From there you can use whichever decoder best fits your analysis. No need to run a model.
The GCLDA method (gclda_decode_map) won’t work, since it relies on probability distributions that are available in a GCLDA model, but not an LDA one, but the others would.

If you want to decode an unthresholded map, you can use the CorrelationDecoder, but you should be aware that it’s currently very slow, and it’s generally easier to just loop through the features and run the meta-analyses with your own script, as discussed in Issue: fitting nimare.decode.continuous.CorrelationDecoder to Neurosynth dataset. You can find the template for such a script in that topic.

The CorrelationDecoder has a parameter, frequency_threshold, that determines how studies are divided into being “about that feature” and “not about that feature”. When you run a meta-analysis on a larger Dataset like Neurosynth, you are going to want to separate the studies into those that are “positive” (about) the feature and those that are negative. For Neurosynth’s standard TFIDF values, the default threshold to do this is 0.001, based on the original Neurosynth code. However, for the LDA topic model weights, the default in Neurosynth’s code is 0.05 (i.e., frequency_threshold=0.05).

I cannot thank you enough, Taylor. The code is presently running, fingers crossed it works.

1 Like

MANAGED BY INCF