W1D3 tutorial content discussion

amazing post! and that thing they say it’s not is PCA :wink:

it’s an application of bayes rule, see wiki section relation to bayesian inference: https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

max p(theta | y, x) = max p(y | x,theta) * p(theta) / p(y)

denominator is independent of theta so =>

max p(theta | y, x) = max p(y | x,theta) * p(theta)

if we assume p(theta) is uniform then

max p(theta | y,x) = max p(y | theta, x)

(all these max’s are wrt theta)

and @NingMei to clarify your original question it’s not that they are the same but rather maximizing both of them wrt theta give the same maximal theta result

but the notebook does set them equal to each other for some reason?

1 Like

W1D3-Tuto6 / Bonus exercise AIC:

  • Which variance are we considering here?
  • Any insight/demonstration on why the first and last terms are considered constants / end up being cancelled out?

Thanks!

1 Like

L is equal to p if we treat both as functions of x, y, and theta. Of course, as you nicely explained, L(theta| y,x) is not equal to p(theta|y,x). This is why the MLE is in general different from MAP estimators.

This is a nice derivation of how we end up maximising likelihood, given the specific conditions for the prior. But do you disagree that it is a matter of notation? They point to the same pdf, right? Or do you mean that only the optimal value is the same after maximising both?

2 Likes

yeah sorry I didn’t address that they also said L vs p, in this case they are using them interchangeably I believe

no they are not the same thing?
p(theta | y,x) = p(y | x,theta) * p(theta) / p(y)

but for maximization purposes they are the same

I was also really confused by the statement that the last term is a constant! But I think this is how it goes:

I believe here sigma^2 is also estimated via MLE, thus sigma^2 = \sum [(y - y_tilde)^2] / N (we have it immediately in the next paragraph), and if we plug this into the log-likelihood equation above, the two \sum [(y - y_tilde)^2] terms (in the numerator and denominator, respectively) cancel out with each other, thus the last term in MLE is a constant.

The first term is also not technically a constant, but since the number of data point n is the same across all models, and we only care about the relative difference, we can indeed drop it.

1 Like

I think the order the variables appear inside the parenthesis is the confusing bit.

L(θ|x,y), a likelihood function, is not equal to p(θ|x,y), a posterior function. They are related by Bayes rule as you presented it, but I think that L(θ|x,y) is equal, the very same pdf, as p(y|θ,x).

I am assuming people wanted to write L() with θ as first argument, as this is the parameter that you would usually optimise for, as we did today, but still find it a bit confusing.

1 Like

oooh that is the convention! I see here: https://en.wikipedia.org/wiki/Likelihood_function

1 Like

this question came up in my pod today, and this thread is very helpful for understanding! is there any way I can link my students to this, or is it restricted to TAs?

1 Like

I don’t think there will be any objections to open up this thread also to students? @carsen.stringer Maybe Carsen can do it?

1 Like

i don’t see a way to :frowning: maybe next time the discussion can happen in the main forum? also then maybe you’ll get help from the people who’ve actually done the content :smiley: sorry I haven’t been more helpful

2 Likes

perhaps @kevinwli, as the original poster, can change the category? :octopus:

1 Like

I changed the sub-category, you can now:)

2 Likes

In tutorial 2 appendix, what is the stimulus likelihood function? How can x be decoded given y?

I found the answers
[ http://pillowlab.princeton.edu/teaching/statneuro2018/slides/slides07_encodingmodels.pdf ]

For the T2 Exercise 1 plot, the likelihood image won’t follow changes you make to sigma in the likelihood function. The plotting code uses its own likelihood calculation with sigma=1. You can set sigma in the plotting function.

We were confused why the plots stayed the same with different sigma!

What are the Frequentist and Bayesian frameworks mentioned in the W1D3 outro video? Dr. Wei mentions that MLE is from the frequentist framework. Maximum a posteriori (MAP) estimation is MLE with the addition of a prior probability term (prior*p(params|data). Does this mean that the Frequentist framework is subsumed by the Bayesian framework? Are there good resources for understanding these “competing” frameworks?

In a frequentist framework, the unknown parameter (let’s call it theta) is treated as a fixed, but unknown quantity. Since it’s fixed (deterministic), statement such as “theta falls in the confidence interval 95% of the time” doesn’t make any sense, because theta is either in the interval, or not. The probability in a frequentist framework comes from sampling: Imagine we repeat the experiment a few times (e.g., draw samples from a Gaussian) , the observation we get will be different each time, thus the MLE estimate and confidence interval will also jump around.

In a Bayesian framework, theta is a random variable (that’s why we can assign a prior distribution to it!), and you can freely use expression such as “the posterior distribution of theta”, or “theta has a 95% probability to be within interval…”.

I wouldn’t say one is subsumed or better/wrose than the other. There are contexts where one is more applicable (e.g., base rate neglact) though.

Hope this answers your question (at least a little bit) :smiley:

1 Like

That was helpful! Thank you for clarifying.