How to extract Neurosynth studies with specific features in NiMARE?

tsalo · April 30, 2021, 3:24pm

I’m glad you were able to find the problem.

Regarding the terms themselves- they come from a very simple TF-IDF vectorization of the article abstracts. Some terms are removed automatically using a stop list, but that list is not very extensive. Not to mention that the labeling procedure comes down to basically “is this word used in the abstract?”. All of which is to say that you shouldn’t read too much into the labels, or really any automated annotation result.

On the bright side, I do recall that @62442katieb classified all of the labels in Neurosynth’s standard feature set into informative and uninformative groups at one point. Maybe she can provide that mapping?

One alternative that might be helpful if you’re concerned with interpretability might be to use a topic model instead. Topic models find underlying distributions among term uses in the abstracts, which generally groups terms that are used together into the same topics. The caveats regarding automated annotation still apply, of course, but at least the annotation procedure tends to produce more useful results.

linjing_jiang · April 30, 2021, 8:12pm

Thanks so much for your clarification!

A follow-up question: If I were to use the topic model, how should I do that in NiMARE? Is there any specific NiMARE function or command for this, or should I use other packages?

tsalo · April 30, 2021, 9:26pm

NiMARE has a couple of topic model tools, including Latent Dirichlet Allocation (LDA) and Generalized Correspondence Latent Dirichlet Allocation (GCLDA). LDA just uses text, while GCLDA uses both text and coordinates. The docstrings for both classes include papers in which those algorithms (not the NiMARE implementations) were used. Be forewarned, though, that GCLDA takes a very long time to train.

linjing_jiang · April 30, 2021, 10:37pm

Got it! I noticed that there are multiple GCLDA functions in NiMARE. One is under “annotate” , such as nimare.annotate.gclda and others are under “decode”, such as nimare.decode.discrete.gclda_decode_roi. Are these two types of functions the same?

tsalo · May 1, 2021, 2:29am

nimare.annotate contains classes and functions for automated annotation (i.e., extraction of labels/terms from studies- generally using their abstracts), while nimare.decode contains tools for performing functional characterization analysis, or functional decoding. You’ll want to use nimare.annotate.gclda.GCLDAModel to train your GCLDA model, which builds distributions of p(term|topic), p(study|topic), and p(voxel|topic). Then, you can use nimare.decode.discrete.gclda_decode_roi to decode a mask/ROI based on the GCLDA model. All that function does is average topic weights from the p(voxel|topic) distributions across the ROI, as in Rubin et al. (2017).

62442katieb · May 10, 2021, 8:23pm

And regarding term classification as “informative” or noninformative, I do have such a list and am happy to email it to you, if you’re interested @linjing_jiang !

linjing_jiang · May 11, 2021, 2:15pm

Thanks for your clarification! I will try these out and keep you posted on how it goes!

linjing_jiang · May 11, 2021, 2:18pm

If you could send me the list that would be great! Here is my email: linjing.jiang@stonybrook.edu

Thanks so much!!

di1994 · June 17, 2021, 3:53am

Hi! I’m doing the same procedure to generate a meta-analytic dataset with specific features. But when I try to insert this piece of code

neurosynth_dataset = nimare.io.convert_neurosynth_to_dataset("database.txt", "features.txt")

The following error happens:

Traceback (most recent call last):
File “”, line 1, in
File “C:\pythonProject3\lib\site-packages\nimare\io.py”, line 123, in convert_neurosynth_to_dataset
dict_ = convert_neurosynth_to_dict(text_file, annotations_file)
File “C:\pythonProject3\lib\site-packages\nimare\io.py”, line 45, in convert_neurosynth_to_dict
dset_df = pd.read_csv(text_file, sep="\t")
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 462, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 819, in init
self._engine = self._make_engine(self.engine)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 1050, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 1867, in init
self._open_handles(src, kwds)
File “C:\pythonProject3\lib\site-packages\pandas\io\parsers.py”, line 1362, in _open_handles
self.handles = get_handle(
File “C:\pythonProject3\lib\site-packages\pandas\io\common.py”, line 642, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: ‘database.txt’

Do you have any idea what it could be? Thanks!

tsalo · June 17, 2021, 5:28am

nimare.io.convert_neurosynth_to_dataset requires two files- one containing the study coordinates and one containing the term frequencies from article abstracts. Those files are provided by Neurosynth in the neurosynth-data repository. Have you downloaded those files, and do they exist in the folder you’re running nimare.io.convert_neurosynth_to_dataset from?

foldes.andrei · June 17, 2021, 6:16pm

hi - Could i also get in on that?
foldes.andrei@gmail.com

foldes.andrei · June 17, 2021, 6:43pm

Hello,

There a certain Neurosynth topic that I would like to work with within NIMARE: Neurosynth: topic 23
It includes 140 studies on neurosynth.org; however when I do the same using NIMARE
relational_cluster_ids = neurosynth_dataset.get_studies_by_label("Neurosynth_TFIDF__23", label_threshold= 0.001) it results in 254 studies.

What am I doing wrong?

tsalo · June 17, 2021, 6:58pm

It’s a little buried in the website, but for the topic meta-analyses a threshold of 0.05 is used instead of 0.001 (check the FAQs tab in the meta-analysis page you linked):

How do you determine which studies to include in an analysis?

We use a predefined binary cut-off. For all topic-based meta-analyses, we treat all studies with a loading > 0.05 as “active” for a given topic, and all other studies as inactive. Although the choice of threshold is relatively arbitrary, in practice, varying it within a fairly broad range of values has minimal influence on the results. Adopting a continuous approach instead of dichotomizing the dataset also has a negligible effect.

I also noticed that the label you have is “Neurosynth_TFIDF__23”. Is that just because you used nimare.io.convert_neurosynth_to_dataset to create your Dataset, or did you use the “features.txt” file in neurosynth-data?

foldes.andrei · June 17, 2021, 7:29pm

Thanks for the quick response;

relational_cluster_ids = neurosynth_dataset.get_studies_by_label("Neurosynth_TFIDF__23", label_threshold= 0.05) lands me 248 studies … so there still seems to be some trickery afoot.

I used the “features.txt” file - neurosynth_dataset as follows
nimare.io.convert_neurosynth_to_dataset("database.txt", "features.txt")

tsalo · June 17, 2021, 7:37pm

Ah, okay. So in that case “Neurosynth_TFIDF__23” is really just referring to when abstracts have the number “23” in them. You’ll want to download the topic files from here. You can extract v5-topics.tar.gz and replace features.txt with v5-topics/analyses/v5-topics-200.txt in your nimare.io.convert_neurosynth_to_dataset call. Then it should be good.

foldes.andrei · June 17, 2021, 8:38pm

Thanks, that was it… weird how frequent the number 23 is ^^

linjing_jiang · July 7, 2021, 7:22pm

Thanks so much for your help! I got some more questions about how I can modify the dataset object in NiMARE:

Neurosynth dataset does not seem to contain the sample size of each study for an accurate ALE analysis. Can I manually add sample size as a column to the dataset?
I am also wondering how neurosynth retrieves the peak coordinates from each paper. Does it simply retrieve all the reported coordinates (in the table for example) regardless of which contrast the study performed (if the study examined more than one type of contrast)? If so, to what extent can I modify the existing peak coordinates, such as removing unwanted contrast in the dataset?
Finally, is it possible to add new studies to the neurosynth dataset locally?

Thanks!
Linjing

tsalo · July 7, 2021, 8:16pm

That’s correct. Neurosynth uses web-scraping to get metadata about each study, but sample size is not part of that. At one point, I did play around with some regular expressions that could possibly identify sample sizes in abstracts (GitHub - NBCLab/samplesize: Sample size extraction.), but I don’t think I ended up using it for anything, and we didn’t validate it much.

If you know the sample sizes from some other source, you could add them to the Dataset.metadata DataFrame as sample_sizes, with each cell containing a list of integers.

What I generally do when running an ALE on Neurosynth data is just set a single sample size for the KernelTransformer that would apply to all studies. There are two ways to do that:

Provide the ALE with an initialized ALEKernel:

kernel = ALEKernel(sample_size=20)
ale = ALE(kernel_transformer=kernel)

Include the sample size as a parameter to the ALE

ale = ALE(kernel__sample_size=20)

That’s correct, Neurosynth cannot identify individual contrasts within papers, so it just grabs all coordinates for a given paper. There is a column in the Dataset.coordinates attribute that should reflect the table from which each coordinate comes from, but I wouldn’t lean on that too heavily- many papers report results from multiple contrasts in the same table, and you still wouldn’t be able to identify which contrast corresponds to which table in an automated way.

Regarding manually changing the coordinates… I think that it’s an all-or-nothing proposition. Neurosynth is full of noise, but we can assume that that noise is fairly consistent across the studies we care about and those we don’t. If you correct things in only part of the dataset, then you will have a mix of very-noisy and less-noisy studies, with a bias working in favor of the studies you care about. I believe that’s one reason why Tal never tried to directly incorporate manual corrections into the database (see this section of the Neurosynth FAQ: Neurosynth: FAQs).

It’s definitely possible, but may not be a good idea for the reason I specified above. That said, you could do one of two things:

Add the data directly to the Neurosynth files you download to your machine.
Create a NiMARE Dataset object for the Neurosynth database and another one for your manual dataset, and then merge them with the new Dataset.merge() method, added in NiMARE version 0.0.9.

Best,
Taylor

linjing_jiang · July 7, 2021, 8:46pm

Got it! Thanks a lot for your quick and detailed reply!!

zhuan_zhang · May 30, 2023, 2:09pm

Can you send me a copy of the same document? Thank you very much!
zzdl33@vip.qq.com