relational_cluster_ids = neurosynth_dataset.get_studies_by_label("Neurosynth_TFIDF__23", label_threshold= 0.05) lands me 248 studies … so there still seems to be some trickery afoot.
I used the “features.txt” file - neurosynth_dataset as follows nimare.io.convert_neurosynth_to_dataset("database.txt", "features.txt")
Ah, okay. So in that case “Neurosynth_TFIDF__23” is really just referring to when abstracts have the number “23” in them. You’ll want to download the topic files from here. You can extract v5-topics.tar.gz and replace features.txt with v5-topics/analyses/v5-topics-200.txt in your nimare.io.convert_neurosynth_to_dataset call. Then it should be good.
Thanks so much for your help! I got some more questions about how I can modify the dataset object in NiMARE:
Neurosynth dataset does not seem to contain the sample size of each study for an accurate ALE analysis. Can I manually add sample size as a column to the dataset?
I am also wondering how neurosynth retrieves the peak coordinates from each paper. Does it simply retrieve all the reported coordinates (in the table for example) regardless of which contrast the study performed (if the study examined more than one type of contrast)? If so, to what extent can I modify the existing peak coordinates, such as removing unwanted contrast in the dataset?
Finally, is it possible to add new studies to the neurosynth dataset locally?
That’s correct. Neurosynth uses web-scraping to get metadata about each study, but sample size is not part of that. At one point, I did play around with some regular expressions that could possibly identify sample sizes in abstracts (GitHub - NBCLab/samplesize: Sample size extraction.), but I don’t think I ended up using it for anything, and we didn’t validate it much.
If you know the sample sizes from some other source, you could add them to the Dataset.metadata DataFrame as sample_sizes, with each cell containing a list of integers.
What I generally do when running an ALE on Neurosynth data is just set a single sample size for the KernelTransformer that would apply to all studies. There are two ways to do that:
Provide the ALE with an initialized ALEKernel:
kernel = ALEKernel(sample_size=20)
ale = ALE(kernel_transformer=kernel)
Include the sample size as a parameter to the ALE
ale = ALE(kernel__sample_size=20)
That’s correct, Neurosynth cannot identify individual contrasts within papers, so it just grabs all coordinates for a given paper. There is a column in the Dataset.coordinates attribute that should reflect the table from which each coordinate comes from, but I wouldn’t lean on that too heavily- many papers report results from multiple contrasts in the same table, and you still wouldn’t be able to identify which contrast corresponds to which table in an automated way.
Regarding manually changing the coordinates… I think that it’s an all-or-nothing proposition. Neurosynth is full of noise, but we can assume that that noise is fairly consistent across the studies we care about and those we don’t. If you correct things in only part of the dataset, then you will have a mix of very-noisy and less-noisy studies, with a bias working in favor of the studies you care about. I believe that’s one reason why Tal never tried to directly incorporate manual corrections into the database (see this section of the Neurosynth FAQ: Neurosynth: FAQs).
It’s definitely possible, but may not be a good idea for the reason I specified above. That said, you could do one of two things:
Add the data directly to the Neurosynth files you download to your machine.
Create a NiMARE Dataset object for the Neurosynth database and another one for your manual dataset, and then merge them with the new Dataset.merge() method, added in NiMARE version 0.0.9.