We are trying to figure out intrinsic dimensionality of the activity of population of spiking neurons and to avoid having to use arbitrary smoothing kernels to infer firing rate, we decided to use gaussian process factor analysis. But unlike PCA it doesn’t give us an measure of explained variance. What would be the best way to infer latent dimensionality?

One method we explored is Crossvalidation across different numbers of latent dimensions and finding maximum likelihood. But considering our initial space is ~100 dimensional, checking all possible intermediate is too expensive. Is there a better way? Eg. Does log likelihood vs dimensions necessarily have structure that can help us use other search algorithms?

Finally if we want to compare seperation of decisions on the activity space across populations using a classification method after running GPFA, is it necessary or better to have same number of latent dimensions?

I think you have a few possible options. You could just test a fraction of the ~100 possible latent dimensions, i.e., [2, 12, 22…] and use the cross validation scheme you describe. You could also consider reframing things as rather than finding the optimal latent dimensionality, what is the minimum latent dimensionality that lets you solve the downstream problem (separation of decisions)? Or, at a fixed, small dimensionality, how well can we solve the downstream problem.

Depending on which library you’re using for GPFA, you may also consider breaking up any long time series into batches of shorter chunks. That will make training (EM) faster.

We are using the implementation in the electrophysiology analysis tool (Elephant) package in python. Is there an alternative that allows batching?

There is a small issue with the subsampling method, when I did try it, the results were non unimodal. Maybe it was due to estimation noise but there were a lot of local extremas. If that is the case subsampling will give us uncertain estimated about what’s the best latent dimensionality.

For the dataset you’re looking at, how long are the trials and what’s the bin size you’re using?

If you have more that ~50 bins/trial, it might be worthwhile to split up the trials into chunks (more trials), before calling model.fit(data). You can still call model.transform(data) on the full-trial data.

A quick skim of the Elephant code makes it seems like they’re cutting up the trials into length 20 internally already.

EM itself can get stuck in local maxima of the loglikelihood, which you might be running into! Often times, people we restart with a few random initialization, but this increase the computational cost.

I think for the NMA project, simpler is better. It’s fine to choose just 1 dimensionality that looks like it has fairly high loglikelihood. You could always follow-up with more thorough exploration later.