When does the prior have the strongest influence over the posterior? When is it the weakest? What exactly are the answers of these questions asked within the tutorial after exercise 2A?
From my understanding, the influence of the prior is strongest when the sigma of the likelihood is large. That is to say that the measured information provided has high uncertainty. In those cases, the posterior distribution mostly resembles the prior (our expectations), since the measurements are so uncertain/noisy that they have negligible effect on the posterior.
Conversely, when sigma of the likelihood function is small (we are very certain about the measured information), the prior has less influence.
Exactly - if you play around a bit with the widget, you should be able to see a couple of the following observations
- if the sigmas of prior (visual) and likelihood (auditory) are equal, the posterior distribution is exactly in the middle
- if the sigma of prior is bigger than that of likelihood, the posterior distribution is closer to the likelihood
- conversely, if the sigma of the prior is smaller than that of the likelihood, the posterior distribution is closer to the prior
The overall take-away here is that probability distributions (likelihood, or prior) that are 'narrow (i.e. encode a certain variable with little uncertainty) have a big impact on the posterior distribution
Some examples below
Below prior and likelihood variances are the same (sigma_auditory = sigma_visual = 0.5)
Below, prior variance is bigger (sigma_visual = 1.0) than likelihood variance (sigma_auditory = 0.5) and the posterior is much closer to the auditory distribution.
And lastly, the prior variance here is smaller (sigma_visual = 0.5) than the likelihood variance (sigma_auditory = 1.0) which results in the posterior sitting much closer to the prior
Thatâs a great answer!
Bayesian methods are often used to characterize beliefs, and thinking about it that way can be a good way to develop some intuition. The prior captures yourâŠprior beliefs. The likelihood provides some sensory evidence, which you merge into the prior to produce the posterior. (Prior means before, Posterior=behind (or after) seeing the evidence).
For the localization example, we can think of the mean ” as âwhat we believeâ and Ï as inversely proportional to how sure we are in that belief. (In fact, some people use Ï = 1/ÏÂČ instead, which they call âprecisionâ, but letâs not dive down that particular rabbit hole just yet). A distribution like N(0,5) therefore suggests that you think the location is somewhere around 0, but you havenât located it precisely. Something like N(5,0.01) means youâre very sure itâs located at 5.
SoâŠwhen should your beliefs (=posterior) change a lot? Well, if you start with vague suspicion that something is true (=large sigma on prior) and someone gives you strong evidence that it is (narrow sigma for likelihood), your resulting belief should be strengthened (=posterior with narrow sigma). On the other hand, if you believe something strongly (=narrow prior) and someone presents you with unconvincing evidence (=broad likelihood), it shouldnât change your beliefs much, and so the posterior doesnât change. Finally, suppose you are moderately convinced that the stimulus is at ”1, but see some equally convincing evidence (i.e., the same sigma) that itâs actually at ”2. The rational thing to do would be to assume that both of your measurements are a bit imprecise, and itâs actually in the middle.
In all three cases, thatâs how the math works out!
In our group, we were wondering why in Section 1/Exercise 1, the example code implemented a different formula to the Gaussian formula given directly above (specifically, it lacked the constant term) - is it just because that term is irrelevant when you normalise?
Yes. The coefficient would serve to normalize that posterior if you were integrating from -inf to +inf. However, in this case, we are only looking at a certain range and are integrating by summation rather than analytically. If you include the coefficient and donât normalize, you might notice that the posterior curve is too large by a factor of 0.1. This 0.1 just happens to be the dx separating xâs in the original distribution. If you multiply that in when you calculate the posterior, the normalization is again done automatically.