Neuroquery train Custom model

albud187 · December 7, 2021, 12:35am

Hi I am trying to re-train the neuroquery model and am running into issues with it. Here is the link to my repo: https://github.com/albud187/Neuroquery-Work

Basically, I want to be able to retrain the neuroquery model using a subset of the original studies in corpus metadata. For now, I am only interested in using the 30 studies I have identified in “autism_data.csv” which is in the linked github repo.

I was able to run the original training_neuroquery.ipynb just fine.

I tweaked the code to try to re-train the model so it only uses the studies in “autism_data.csv”
This code is the file “Train Custom Model.ipynb” and is linked here from my github repo: https://github.com/albud187/Neuroquery-Work/blob/main/Train%20Custom%20Model.ipynb

I have an issue when I call the encoder.

It says “Size of label ‘j’ for operand 1 (6289) does not match previous terms (6308)”

Not sure what is causing it. I’m aware that tfidf is of shape (n,V), where n = number studies, and V = 6308, from vocabulary.
Is there a way to change the shape of tfidf to have V = 6289?

jeromedockes · December 7, 2021, 4:24pm

Hello, with a few small modifications your script should work:

import pathlib

from scipy import sparse
import numpy as np
import pandas as pd
from joblib import Memory
from nilearn import plotting

from neuroquery import datasets
from neuroquery.img_utils import coordinates_to_maps
from neuroquery.smoothed_regression import SmoothedRegression
from neuroquery.tokenization import TextVectorizer
from neuroquery.encoding import NeuroQueryModel

cache_directory = "cache"


CORPUS_FILE = "autism_data.csv"
CORPUS_FILE_MASTER = "corpus_metadata.csv"

data_dir = pathlib.Path(datasets.fetch_neuroquery_model())

corpus_metadata = pd.read_csv(CORPUS_FILE)
corpus_masterdata = pd.read_csv(CORPUS_FILE_MASTER)
vectorizer = TextVectorizer.from_vocabulary_file(
    str(data_dir / "vocabulary.csv")
)
tfidf = sparse.load_npz(str(data_dir / "corpus_tfidf.npz"))
coordinates = pd.read_csv(datasets.fetch_peak_coordinates())
coordinates = coordinates[
    coordinates["pmid"].isin(corpus_metadata["pmid"].values)
]

coord_to_maps = Memory(cache_directory).cache(coordinates_to_maps)
brain_maps, masker = coord_to_maps(
    coordinates, target_affine=(6, 6, 6), fwhm=9.0
)
brain_maps = brain_maps[(brain_maps.values != 0).any(axis=1)]

pmids = brain_maps.index.intersection(corpus_metadata["pmid"])
brain_maps = brain_maps.loc[pmids, :]
rindex = pd.Series(
    np.arange(corpus_metadata.shape[0]), index=corpus_metadata["pmid"].values
)
tfidf = tfidf.A[rindex.loc[pmids].values, :]

regressor = SmoothedRegression(alphas=[1.0, 10.0, 100.0])

print(
    "Fitting smoothed regression model on {} samples...".format(tfidf.shape[0])
)
regressor.fit(tfidf, brain_maps.values)

output_directory = "autism_model"

corpus_metadata = corpus_metadata.set_index("pmid").loc[pmids, :].reset_index()
encoder = NeuroQueryModel(
    vectorizer,
    regressor,
    masker.mask_img_,
    corpus_info={
        "tfidf": sparse.csr_matrix(tfidf),
        "metadata": corpus_metadata,
    },
)
encoder.to_data_dir(output_directory)

query = "Autism"
print('Encoding "{}"'.format(query))

result = encoder(query)

plotting.view_img(result["brain_map"], threshold=3.0).open_in_browser()

print("Similar words:")
print(result["similar_words"].head())
print("\nSimilar documents:")
print(result["similar_documents"].head())

print("\nmodel saved in {}".format(output_directory))

the brain maps and the tfidf are indexed such that their rows correspond to the same pmids (those in the autism_data.csv)
only coordinates corresponding to these pmids are transformed to brain maps to save time
I believe the tfidf shape issue may have been due to loading the vocabulary from a file downloaded separated from the accompanying vocabulary.csv_voc_mapping_identity.json

In any case, this new version should run.
But note that 30 studies is definitely not enough to fit this model, so the results will not be meaningful.
We have not computed learning curves so I cannot really say how many studies you need to start getting meaningful results but probably several thousands.
You can also try reducing a bit the number of parameters in the model to help it learn from few (but still much more than 30) samples by reducing the vocabulary, and maybe using the loadings of studies’ maps on an atlas or dictionary components to reduce their dimension.
Otherwise to get results from very few studies, you may want to consider more traditional meta-analysis techniques such as MKDA or ALE rather than a multivariate model like neuroquery.

Hope this helps!

albud187 · December 8, 2021, 3:19am

Thanks so much. It helps.

Honestly, I’m not too familiar with neuroscience; I’m just an engineering grad student who knows python, and is helping out a neuroscience student.

It’s up my research partner, to decide which studies to include in the re-training of the model. It’s up to my research partner to tell me which studies to include, I just used those 30 studies to start. I figure if I can modify the code to train on any subset of the 13k studies, I’ll be good to go with which ever studies my research partner wants.

I’ll discuss with my research partner regarding those other meta-analysis techniques.