Issue with NiMARE GCLDA model when n_topics=1

dlevitas · October 25, 2022, 12:35am

I’m attempting to perform a meta-analysis with NiMARE (version 0.0.12), where I’d like to find the voxels most often associated with the term “threat”, so p(voxel|threat). I have the following code to download the NeuroSynth dataset and subset it by the threat term:

if not os.path.isfile("{}/neurosynth_dataset_with_abstracts.pkl.gz".format(output_dir)):
    print("Downloading Neurosynth dataset with abstracts included")
    neurosynth_dset = nimare.extract.download_abstracts(neurosynth_dset, "myemail@edu")
    neurosynth_dset.save(os.path.join(output_dir, "neurosynth_dataset_with_abstracts.pkl.gz"))

print("Splitting NeuroSynth dataset by appropriate term (threat)")
threat_ids = neurosynth_dset.get_studies_by_label("terms_abstract_tfidf__threat", label_threshold=0.001)
threat_neurosynth_dset = neurosynth_dset.slice(threat_ids)

I then attempted to build and fit my GCLDA model:

if not os.path.isfile("{}/gclda_threat_model.pkl.gz".format(output_dir)):
    
    print("Building and training GCLDA model(s)")
    print("")
    
    counts_df = nimare.annotate.text.generate_counts(
        threat_neurosynth_dset.texts,
        text_column="abstract",
        tfidf=False,
        max_df=0.99,
        min_df=0.01
        )
    
    # only select columns with "threat" in name
    counts_df = counts_df[[x for x in counts_df.columns if "threat" in x]]

   # treat all columns as "threat", single topic
    counts_df = pd.DataFrame(counts_df.sum(axis=1))
    counts_df.columns = ["threat"]
    
    coordinates_df = threat_neurosynth_dset.coordinates
    coordinates_df.index = coordinates_df["id"]

    # model
    threat_model = nimare.annotate.gclda.GCLDAModel(counts_df,
                                                    coordinates_df,
                                                    n_topics=1,
                                                    n_regions=2,
                                                    symmetric=True
                                                    )    
    # fit model
    threat_model.fit(n_iters=1000, loglikely_freq=10)

This produces the error: TypeError: object of type 'numpy.float64' has no len(). If however, I don’t sum the counts_df dataframe across columns and specify n_topics > 2, then the model runs to completion. My issue though, is that I’m essentially trying to use NiMARE to produce plots akin to NeuroSynth, like this. Is there a way for me to treat the “threat” term as a single topic and produce a map similiar to NeuroSynth?

Thanks for the assistance.

JulioAPeraza · October 25, 2022, 11:50pm

Hi @dlevitas,

I’m not sure if a GCLDA model will work with a single term. Generally, the model learns a set of topics from a collection of unigrams and bigrams of words extracted from the articles’ abstracts. If your goal is to use p(voxel|topic_threat), I think you would need to run GCLDA with all the terms and with a large number of topics 100-200, and then your topic_threat is going to be the topic where “threat” is one of the top words (by sorting the distribution of p(word|topic)).

Alternatively, to reproduce the plots from the Neurosynth website you could perform a term-based meta-analysis using MKDAChi2():

from nimare.meta.cbma.mkda import MKDAChi2

frequency_threshold = 0.001
feature = "terms_abstract_tfidf__threat"
meta_estimator = MKDAChi2()

# Create dset 1
feature_ids = neurosynth_dset.get_studies_by_label(labels=feature, label_threshold=frequency_threshold)
feature_ids = sorted(feature_ids)
feature_dset = neurosynth_dset.slice(feature_ids)

# Create dset 2
nonfeature_ids = sorted(list(set(neurosynth_dset.annotations.id.to_list()) - set(feature_ids)))
nonfeature_dset = neurosynth_dset.slice(nonfeature_ids)

meta_results = meta_estimator.fit(feature_dset, nonfeature_dset)

and select the uniformity test map:

meta_results.get_map("z_desc-consistency")

where z_desc-consistency is the voxel-level z-values from the consistency/forward inference analysis.

Similarly, if you would like to generate a map associated not just with the term “threat” but with a group of similar terms (e.g., “threat”, “threatening”, “threats”), you can perform a topic-based meta-analysis. Neurosynth has some LDA models that were trained on the latest version of the database (see this topic for example, which has “threat” as one of the top words Neurosynth: topic 180). You can download the topics data:

from nimare.extract import fetch_neurosynth
from nimare.io import convert_neurosynth_to_dataset

files = fetch_neurosynth(
    data_dir=out_dir,
    version="7",
    overwrite=False,
    source="abstract",
    vocab="LDA200",
)
neurosynth_db = files[0]

neurosynth_dset = convert_neurosynth_to_dataset(
    coordinates_file=neurosynth_db["coordinates"],
    metadata_file=neurosynth_db["metadata"],
    annotations_files=neurosynth_db["features"],
)

and for the meta-analysis, you can follow the previous steps with MKDAChi2(), but now use:

frequency_threshold = 0.05
feature = "LDA200_abstract_weight__180_amygdala_threat_fear"

These two maps can also be downloaded from the Neurosynth website.

Best,
Julio A

dlevitas · October 26, 2022, 1:19am

Thanks @JulioAPeraza, I’ll look into these two options!

I was also playing around with the ALE coordinate-based analysis. I’ve been taking the coordinates from my threat_neurosynth_dset and feeding them into nimare.meta.cbma.ale.ALE, followed by multiple comparisons correction. This cursory analysis produce somewhat similiar maps, but with greater whole-brain activation. I realize this isn’t quite the same as assessing p(voxel|topic_threat), but I assume this is also a valid meta-analysis?