NiMARE gclda model using the Neuroquery database - Code error

nickcorriveaul · October 8, 2021, 3:27pm

Hi Taylor,

I have tried to run the gclda model using the Neurosynth & Neuroquery database. Everything worked well for the former (thanks for your previous advice), but I am running into an issue while creating the files object using the nimare.extract.fetch_neuroquery function. Although are the necessary neuroquery files are in the appropriate folder, it says the index list is empty. It looks like it looking for a single file containing “combined” “neuroquery7547” and “tfidf” in its name. Changing “7547” for “6308” worked, as there is a file named “data-neuroquery_version-1_vocab-neuroquery6308_source-combined_type-tfidf_features.npz” in the folder.

However, I am not sure if this is correct. I initially thought the command should fetch three different files, no? I thought this may be due to a typo on the NiMARE instructions, but I have no idea. I thought your help would be beneficial at this point.

Let me know!

Many thanks.

‘’‘files = nimare.extract.fetch_neuroquery(
dir_data=out_dir,
version=“1”,
overwrite=False,
source=“combined”,
vocab=“neuroquery7547”,
type=“tfidf”,
)
pprint(files)
neuroquery_db = files[0]’’’

tsalo · October 8, 2021, 4:32pm

Hi Nick,

The first issue is that tfidf values are only available for the neuroquery6308 vocabulary, as you noted. When you run fetch_neuroquery, using an invalid combination of source, vocabulary, etc., it should just return an empty list.

The second issue is that GCLDA must be trained on term counts, rather than tfidf values, so you’ll want to use the neuroquery7547 vocabulary (or maybe the larger neuroquery156521 vocabulary, if you’re willing to commit serious resources to fitting the model). The neuroquery6308 vocabulary only has tfidf values available.

One minor issue with your code is that you have dir_data instead of data_dir, although I’m guessing that was just a mistake in the pasted code rather than what you ran since you didn’t report a relevant exception.

There should technically be four files: (1) a coordinates file with the study-wise x, y, and z coordinates for the database; (2) a metadata file with the study-wise metadata (e.g., PubMed IDs); (3) a features file with the label values in a sparse array; and (4) a corresponding vocabulary file with the terms associated with each column in the features file.

When I run

import nimare
files = nimare.extract.fetch_neuroquery(
    version="1",
    overwrite=False,
    source="combined",
    vocab="neuroquery6308",
    type="tfidf",
)

Here is what files ends up being:

[{'coordinates': '/Users/taylor/.nimare/neuroquery/data-neuroquery_version-1_coordinates.tsv.gz',
  'features': [{'features': '/Users/taylor/.nimare/neuroquery/data-neuroquery_version-1_vocab-neuroquery6308_source-combined_type-tfidf_features.npz',
                'vocabulary': '/Users/taylor/.nimare/neuroquery/data-neuroquery_version-1_vocab-neuroquery6308_vocabulary.txt'}],
  'metadata': '/Users/taylor/.nimare/neuroquery/data-neuroquery_version-1_metadata.tsv.gz'}]

This is what I’d expect for a valid combination of parameters. You could call the function with multiple values for each parameter to download multiple feature sets, which would give you multiple dictionaries in the “features” field.

Does that make sense?

nickcorriveaul · October 8, 2021, 6:14pm

Hi Taylor,

Thanks for this valuable info. So if I understand correct, to train the GCLDA model, I should use the neuroquery7547 and drop the tfidf line? It would look like this?:

‘’’
files = nimare.extract.fetch_neuroquery(
data_dir=out_dir,
version=“1”,
overwrite=False,
source=“abstract”,
vocab=“neuroquery7547”,
)
pprint(files)
neuroquery_db = files[0]
‘’’

tsalo · October 8, 2021, 7:44pm

That looks good to me. You could use another “source” (e.g., the body of the papers) if you want, though the “combined” text isn’t an available source for that vocabulary.

nickcorriveaul · October 8, 2021, 7:59pm

Excellent. Thank you so much for your quick replies!

Nick