General question about week 2

What is the relationship between vanilla HMM and the model described by Xaq in the tutorial about fish?
Also, how do these relate to RL agents?
Thank you!

The thread that connects Week 2 is states and actions.

Call Days 1 and 2 background.

Day 3 covered the estimation of state. In the first tutorial, you learned how to identify which of two states you were in, based on a series of observations. Next, we moved from seeing one fixed state to a sequence of them–a discrete sequence (in HMMs) or a continuously varying set (Kalman).

In Day 4 T2, you learned how to control a system to find and stay in a target state. These system need to be explicitly told which states are the goal.

In Day 5 (and Day 4 T1), you’ll learn about reinforcement learning. RL lets you avoid specifying the goal state ahead of time. Instead, the agent explores the environment and learns the value of states so that it can choose the ones that yields the most rewards.


@mrkrause thank you. So essentially in the fish example the states generated by the environment are an HMM, and the fisherman act as an RL agent, is that right? However in this case the decisions are guided by the beliefs about the latent states, unlike regular RL agents which update action values without inferring the underlying state. Is that correct?

Yes–but also no!

You’re absolutely correct that one way to solve a POMDP (fishing) would be to identify all of the states from the observations (i.e., fit an HMM) and then learn/choose the best actions given your inferred state sequence.

However, this turns out to be computationally painful–but you don’t have to do all of that work. In the fishing example, you might notice that we never totally commit to a state sequence. You maintain a continuous belief about the fish location, but you never make a hard decision that the fish were on the left, left, left, right, and then left sides.


@mrkrause so then why do we need to have any belief about the underlying state at all? Why use the threshold to figure out where fish probably are? Couldn’t the fisherman just choose the action based on the average expected reward from the two states (which then would be a Q-learning agent, right?)

In D2, we learned about Markov Process, which is a description of a dynamic environment that evolves through time. Now, if we think about an agent that interacts with the environment by taking action and receive reward and punishment from it, we now have a Markov Decision Process, where the state transition matrix is also determined by the action of the agent, and each state-action pair is associated with some reward value.

If the goal of the agent is to choose actions that maximize its expected long-term cumulative reward, we can call this agent a Reinforcement Learning (RL) agent. I personally view this as the border definition of RL, since we are not talking about any learning at all yet. As long as the goal of the agent is to maximize E[r_1 + \gamma r_2 + \gamma^2 r_3 + ... + \gamma^{(t-1)} r_t], we can call it a RL agent.

If you have complete knowledge about the environment (i.e., state transition probability and the reward function), you can solve the RL problem through planning. This is what we did in D4T2: We know our system is a linear dynamical system and how it reacts to input, and we define our cost function in terms of state and action. Thus, we can directly solve the RL problem through dynamical programming, which gives us LQR. A more relatable example could be looking at a map and plan how you are going to get to your destination through a combination of bus, metro, and walking.

However, what if we don’t know what the environment looks like? In that case, we have to learn our optimal policy on the fly by interacting with the environment and learn from trial and error. This is the more typical setting of reinforcement learning (and you can see why learning comes in) which is the topic for D5. The two board classes of RL algorithm include model-free learning, which learns an action policy directly, and model-based learning, which learns the MDP first and then does planning based on the learned model.

Now you can probably see that the fishing agent and LQG agent in D4 are both RL agents in the border sense: They aim to maximize the long-term expected reward. However, it’s slightly more complicated due to the fact that the states are not directly observable. These are usually called Partially Observable MDP (POMDP). In D4’s tutorial, since we assume complete knowledge of the POMDP, we are solving the RL problem with planning, but with the additional step of Bayesian inference (belief state). You can, of course, also apply a learning algorithm we talked about today.

Hope this helps!