W2D1 Tutorial3 Bayes question

TQi · August 2, 2020, 4:16am

This tutorial confuses me a lot…basically I understand most of the formulas, but for some steps, I would do it in a different way. Can someone point out where I am wrong? Thanks!

In exercise 1, we generated the p(\tilde x | x) matrix by going through the values of \tilde x and setting x as a normal distribution centered at \tilde x. To me, it makes more sense to go through x and setting \tilde x as a normal distribution, as the logic is “the perceived location is around the real location”, rather than “the real location is around the perceived location”. I know both cases will result in the same 2D normal distribution, but this is more logical to me.
In exercise 7, for each sound location x_0, the code generates an x-independent input matrix p(\tilde x | x=x_0) to calculate the response function p(\hat x | x=x_0). But why can’t we just directly generate the input matrix p(\tilde x | x) as the function of both x and \tilde x, which will be x-dependent, and directly calculate p(\hat x | x)? This way we don’t have to loop though the possible values of x.

Thanks in advance!

TQi · August 5, 2020, 3:58am

Another question is…when fitting the model, shouldn’t negative log-likelihood be

-LL = - \sum_i \log p(\hat{x}_i \mid x_i, p_{\text{independent}})

rather than just

-LL = - \sum_i \log p(\hat{x}_i \mid x_i)

as we are looking for the bext p_{\text{independent}}?

Would be grateful if anyone can help me out understanding them!

aep · August 5, 2020, 8:48pm

What I understand:

the observer has a prior about the position of the stimulus, p(x), and access to measurements
the observer doesn’t ‘decode’ the result of the measurement perfectly, but represents it as a probability distribution (centered on the true position): this is p(\tilde x|x) (likelihood)
after the measurement, the observer combines prior and measurement in posterior: p(x|\tilde x)
the observers reads its representation \tilde x, and decides that the true position is the mean of p(x|\tilde x) for that value of \tilde x.

This is what we assume to be the decision process.

So my answer to your first question is: p(x|\tilde x) has to contain both the prior (previous knowledge) and the likelihood (noisy measurement, imperfect representation etc). We can model these separately and combine them, but we couldn’t as easily define a function that includes both.

Now: how can we guess what the priors of the observer are? We can only see its choices. We do not know its internal representations. So we probe the observer with several true positions x, we record its estimates \hat x, then we calculate the likelihood of those responses, given several possible priors the observer might have (a range of values p_{independent}), and we infer that the real prior of the observer is p_{independent} that minimises the negative log likelihood.

So we calculate, for each fixed x=x_0:
p(\hat x|x=x_0)=\sum_{\tilde x} p(\hat x|\tilde x)p(\tilde x|x=x_0)

p(\tilde x) now is a ‘prior’, and will depend on the value of x_0. So yes, you can have a matrix p(\tilde x|x), but this is not the matrix that you multiply by p(\hat x|\tilde x), but p(\tilde x) that corresponds to each true value x_0 - because each true value will give a different p(\tilde x), and this p(\tilde x) will be a column vector (repeated for all values of \hat x), not a matrix - it makes no difference to my decision what my representation is for a different true position than the one I am observing in that moment.

This is probably more detail than you wanted, but I hope it answers your questions (does it?)

Next post, I think you’re right, they probably didn’t write it explicitly for brevity.

TQi · August 7, 2020, 4:13am

Thanks! I think I get your explanations to the second question.
For the first question, I’m not sure how your answer is related to my question…I understand what information is in the p(x | \tilde x) matrix, but what I don’t understand is the way to build this matrix. I mean shouldn’t we build \tilde x based on x rather than the other way around?

aep · August 7, 2020, 11:24am

This is what we do, we build p(\tilde x|x) (the likelihood), assuming that the observer’s internal representation of the true stimulus is a normal distribution centered on the true stimulus. Then we consider the bias of the observer, which is p(x) (the prior). Say true position is 1.2, we assume that the neurons tuned on 1.2 will fire the most, but those tuned on 1.15 will also fire, and those tuned on 1 will fire too, but with a smaller rate (the tuning curves themselves have some spread). Say that, from previous experience or for other reasons, the observer has a strong belief that the stimulus will be at position 1; this means that, likely, it will give more weight to the (smaller) firing of neurons tuned on 1 than to the (larger) firing of neurons tuned on 1.2. So we combine the two by Bayes’ rule, and we find p(x|\tilde x) (posterior). This is the probability that the true stimulus is at position x, given that the internal representation of the observer (the part of its internal representation based only on current sensory evidence) places it at \tilde x. This is the belief of the observer about the stimulus, using both its past hunches and the current sensory evidence. Then we use this belief to build the decision matrix, p(\hat x|\tilde x). In the example above, it may be that the observer decides the stimulus must be at 1.05 - the position it experiences currently, ‘adjusted’ by its previous belief.

All this forms our model of how the observer makes decisions, this model has certain parameters, and we infer the parameters from the data (actual decisions of the observer, for different values of the stimulus). When we do this, we assume that our model of how the oberver makes decisions is correct.

The observer does not have direct access to x, only through its internal representations, both stimulus-driven and its own biases. And we do not have access to the internal representation of the observer, only to its decisions. This is why we take all these steps: first build a model of how the observer generates responses, then use the true responses to tune this model.

I hope this is correct, and that it makes sense, ask again if not.

TQi · August 8, 2020, 3:47am

Thanks for explaining the whole process! This is the same with my understanding.
But still the way we built p(\tilde x | x) is not according to this logic. As you said, we build p(\tilde x | x) (the likelihood) assuming that the observer’s internal representation of the true stimulus (\tilde x) is a normal distribution centered on the true stimulus (x). But in the tutorial, the code used a for loop to go through different values of \tilde x, and for each given \tilde x, it set x as \mathcal N(\tilde x, \sigma^2_\mathrm{likelihood}). This is not “assuming that the observer’s internal representation is a normal distribution centered on the true stimulus”, but “assuming that the true stimulus is a normal distribution centered on the observer’s internal representation”. In order to follow the former assumption, I believe what we should do is to loop through different x values, and set \tilde x as \mathcal N(x, \sigma^2_\mathrm{likelihood}).
Hope I made myself clear. And thanks again!

aep · August 8, 2020, 12:02pm

I see your question - it was clear in your first post, but then, in the post before your last, I lost from sight that you were asking about the code, not about the method.

I don’t know why they did it this way. Perhaps, mathematically, they wanted to take the mean over the full gaussian for every \tilde x. Perhaps they wanted to say that the observer doesn’t consider the possibility that x might be outside the -8 to 8 interval. I might have let both x and \tilde x vary from -10 to 10, then only shown the decision for the -8 to 8 range. I think you’re right (unless someone explains why not), but, as you said in your first post, the result is the same whichever way you loop (at least if the observer doesn’t have a strong prior that the stimulus is on the edge), it’s just the end points that have to be taken care of.

I saw your other question about w2d3 tutorial 3, my take on this is that, once we model the state as being the position of the gaze, we should sort of expect F to be close to identity - I don’t see how it would work, logically and mathematically, if it were not. It doesn’t mean that the gaze is fixed, just that it is the same as the data, as we put in the model, and adjusting for estimated variability, as they say in the tutorial. But as to why this is a useful model, beyond, as they say, providing a smoothing of the data - I don’t have an intuition of that. They say this in the tutorial, too, they say that they would need to consider more information in order to identify underlying causes. It is a useful example of how to use kalman filters in python, and how to think about this sort of problem. This is my take, but I don’t understand this very well, perhaps someone will give a better answer.

TQi · August 8, 2020, 7:17pm

Hi, thanks for the reply! That they wanted to take the mean over the full Gaussian for every \tilde x looks like a good reason, and I agree with you that change the range and still build \tilde x from x would be better to understand.
And also thanks for answering the question of W2D3 T3. I get it that if we are to model it as a linear system, and it is only reasonable to estimate F as identity. So does that mean we are modeling the eye movement as pure gaussian noise, even it has structures in there (say, move between sites of attraction)?

aep · August 8, 2020, 8:11pm

My best understanding is that we model the hidden states as being the same as the measured data, and learn the variability in the data, which results in the states being a smoothed version of the observations.
To model the structure, we would have to know what is special about the sites of attraction - perhaps a visual or semantic feature, not captured by position alone.
This is my best guess, but it could be wrong, in which case I would like to know the correct answer.

TQi · August 8, 2020, 8:41pm

Hi, it seems that my understanding is different from yours…mine is that we model the observed state as being the same as the measured data, and try to find out the hidden state. Hopefully there will someone who can help us out!

aep · August 8, 2020, 8:58pm

You’re right, of course, what you describe is what we set out to do. I was trying to understand what we achieve by using this particular model.