W2D3 content discussion

Just starting the thread here.
I feel tomorrow’s ( I am at zone 2) tutorials will be quite a tough :cookie:.

Nevertheless, I want to shout out to @ch2880 and @bgalbraith for giving great explanations as to how kalmam filters work and get implemented :raised_hands:


W2D3 T3: why is the limegreen trajectories (also the scattered purple points) on the 3rd plot in the widget different from the one below (salience, default sub1 im2)? Is it because of the data sample size? i can’t find where they reduce or increase the sample… (or maybe i am too tired after trying to get thru the EM steps in T2)

If I correctly understood which plot you are refering too (the image with the statues and the baby). In the three plots of the widget, we represent the performance of the Kalman filter at smoothing the visual data. It was trained on a given image for a given subject and tested on all three images and all four subjects. In the plot below, we try to use the Kalman filter as a kind of generative model (we want to generate data with it).

As i understood, the idea is to show that Kalman filter is good at smoothing but is not very good at modelling visual saliency.


I see! Thanks heaps!

I got another more general question. What is the notion behind adding a latent state on top of the observations? In the example of firing neurons using HMM, this is easier to understand because we assume there are 3 states of neurons, and each state produce different spikes as our observations (aka state-produced data + noise). But the one with eye tracking using Kalman filter, can’t they be considered as pure “state + noise”? Or is it simply because this algorithm produces better filtering/smoothing in this case?

In this third tutorial we use the Kalman filter to smooth the observation of the latent variable we want to observe. There are kind of three layer of data corresponding to the same information :

  1. The latent variable s_t which is the true variable we want to observe. However, the observations we can get from this latent variable are noisy due to the noise added on the measure y_t. The latent variable itself is not observable (ie, you cannot know exactly the position of your arm while moving) and the sensory information (y_t) available gives you a noisy observation of this latent variable.
  2. The observation of the latent variable (y_t) which is noisy as said before and depends on the H matrix. This H matrix implies also grasp the system noise which is the one added to s.
  3. The output of the Kalman filter which filter the observation by integrating the priors with it in order to have a better estimate of the latent variable.

In summary, we try to understand to understand the system by looking at the latent state s_t. However s_t is not directly observable and we have to rely on the observation y_t (a proxy of the latent state s_t influenced by H & noise) to compute the best estimate (output of Kalman) of the true latent variable.

Hmmm… so what you meant is, unlike in the example of HMM, in this case they are not interested in the system at all? What is the system s? And why did they assume there was such a system then? (The brain / central control system?)

But the one with eye tracking using Kalman filter, can’t they be considered as pure “state + noise”?

I think that’s a good way to think about it. In this case, our hidden state and the observed data live in the same “space”, but the observed data may have additional noise. In practice this would typically mean you constrain the hidden to observed transformation matrix to be the identity.

1 Like

but doesn’t that mean there may not be such an intermediate system, but only noise?

but doesn’t that mean there may not be such an intermediate system, but only noise?

I’m not 100% sure what you mean. You could generate noise and fit a KF to it and it could learn that there are no dynamics, just noise.

A few more questions/ issues from TAs preparing for the tutorial, for reference:


(report potential bugs / errors here (ideally along with a solution))

Tutorial 1, section 1.1 equation (5) – mu_1 to be changed to mu_R

tutorial 2, equation 1.

sum should be over i, not over j

Clarifications / Inconsistencies:

(report when something needs clarification (+suggest a clarification))

Tutorial 1, Exercise 2: It would be nice to have a definition of ‘accuracy’ somewhere, given that this needs to be calculated in the exercise.

Tutorial 1, Exercise 4 : Is there a reason why the average accuracy is represented wrt the decision speed while it was represented wrt to the decision time in exercise 2? I find it quite confusing. Moreover, I found that the way the decision speed is computed (as being the inverse

of decision time) makes it very hard to interpret.

Tutorial 3, exercise 1.

Defining state[0] in the solutions is set to mu_0, on the equations it sampled from a multivariate gaussian N(mu_0,sigma_0). Consistent solution:

State[0] = stats.multivariate_normal(mean=params[‘mu_0’],cov=params[‘sigma_0’]).rvs(1)

Tutorial 3, exercise 3.

Sigma and other variables are inconsistent with the equations. It would be easier to follow with consistent notation. E.g Sigma_hat = sigma_filt


(report here when something is unclear and you don’t have a fix)

Tutorial 2, Exercise 1: without looking at the solution, I’m completely lost what should be done there. Need to explain students what we want to achieve here!
This is especially true for the covariance matrix. It is never mentioned above, and it has an unintuitive shape.

Tutorial 2, Exercise 1, model.covars_ = noise_level**2 maybe? Instead of just noise_level. It does not really change the result, though.

For T2, i am extremely confused by the situation that we sometimes use A.T sometimes not for the transition matrix. What is the reason for such inconsistency? (NOTE: I understand the construct of matrix for A_ij to represent switch from j to i, or vice versa, but I wonder why they are not fixed on one representation)

In T1, section 1.1, it would be nice to show how equation (5) and (6) were derived and why you had to do that.


I derived equation (5) and (6) myself, but it seemed different from the two equations in the notebook.
I’m not sure whether equation (6) miss a sigma_R

I find the content is really confusing too. The video lectures didn’t help with understanding the materials

1 Like

I was wondering about slide 54 of the intro lecture: From the definitions it looks like we’re counting switches from state j to state i. But the explanation says ‘joint time in states i & j’. Can the network be in some kind of superposition between two states?

In T2E2 students are asked to fill in the function markov_forward.
markov_forward is then used in the helper function simulate_prediction_only.

Aside from being not the best coding practice (the helper function is defined using a function that doesn’t exist until much later) to teach the students, I think this will make things very confusing confusing. Unless students looked at the helper functions, which they generally don’t, they won’t know if/where their function is used and what for. This makes understanding why they defined markov_forward to do what it does harder…


For Hidden Markov Models (HMMs), and especially the Viterbi Algorithm, I found Mathematicalmonk’s youtube channel to be incredibly clear.

Here’s a link to the first video on HMMs:


In the first section of tutorial 1, how is the intuition behind naming alpha, “the error rate”?

1 Like

For T1, would the diffusion drift model work with a Poisson distribution or any other type of distribution?

Why the noise is multiplied with std deviation and not added in tutorial 1 of W2D3 ?

1 Like

Nm… i think it’s okay to interpret it as a fancier version of filtering which is based on the (potential) distribution of the data instead of the data itself (like other filters) :joy:

In which part of the tutorial is it exactly?