CS234 Lecture Notes: Foundations of RL and MDPs (Lectures 1 & 2)

I have been working through Stanford's CS234 course on Reinforcement Learning, taught by Professor Emma Brunskill, as part of building a solid theoretical foundation in RL. These are my notes from the first two lectures. Lecture 1 covers the framing of RL and builds up to Markov Reward Processes. Lecture 2 introduces Markov Decision Processes and the two core planning algorithms: policy iteration and value iteration. I have included active recall questions after each section, which I have found useful for making the material actually stick rather than just reading passively.

Lecture 1: What is Reinforcement Learning?

The Core Idea

Reinforcement learning is about learning through experience to make good decisions under uncertainty. Unlike supervised learning, there is no labeled dataset. The agent learns by interacting with an environment and receiving reward signals.

Four properties define most RL problems:

Optimization. The agent has an explicit objective: find the best possible policy, not just a good enough one.

Delayed consequences. Actions taken now affect rewards far in the future. This creates two sub-challenges: during planning, you have to reason about long-term ramifications, not just immediate effects. During learning, the credit assignment problem asks which past action actually caused a good or bad outcome.

Exploration. The agent learns about the world by acting in it. You only observe the reward for the action you took. You never see the counterfactual.

Generalization. A policy is a mapping from past experience to action. Effective RL requires generalizing from a limited number of observations to unseen states.

Active Recall Questions

Q1. What are the four defining properties of most RL problems? Give a one-sentence description of each.

Show Answer

Optimization (find the best policy, not just any policy), delayed consequences (actions now affect future rewards, causing credit assignment problems), exploration (you only observe the reward for the action taken, never the counterfactual), and generalization (the policy must extend from seen to unseen states).

Q2. What is the credit assignment problem and why is it hard?

Show Answer

When a reward is received, it is unclear which earlier action caused it. In long sequences of decisions, many actions have already been taken before any reward arrives, so attributing the reward to the right action is non-trivial.

Q3. How is RL fundamentally different from supervised learning?

Show Answer

In supervised learning, labeled input-output pairs are provided. In RL, the agent must discover what actions lead to good outcomes through interaction. It receives a reward signal rather than a correct label, and it only observes the reward for the action it actually chose.

Sequential Decision Making

At each discrete timestep $t$ , the agent-environment loop proceeds as follows:

Agent takes action $a_t$
World transitions, emits observation $o_t$ and reward $r_t$
Agent receives $o_t$ and $r_t$ , updates its history

The full history up to time $t$ is:

h_t = (a_1, o_1, r_1, \ldots, a_t, o_t, r_t)

The agent selects its next action based on this history. In practice we compress the history into a state $s_t = f(h_t)$ .

Active Recall Questions

Q1. Write out the history $h_t$ . What does it contain?

Show Answer

$h_t = (a_1, o_1, r_1, \ldots, a_t, o_t, r_t)$ . It is the full sequence of past actions, observations, and rewards up to and including timestep $t$ .

Q2. Why do we compress history into a state rather than conditioning on the full history?

Show Answer

The full history grows without bound and becomes computationally intractable to condition on. Compressing it into a state $s_t = f(h_t)$ makes the problem tractable, and under the Markov assumption no information is lost.

The Markov Assumption

A state $s_t$ is Markov if and only if:

p(s_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid h_t, a_t)

The future is independent of the past, given the present. Instead of conditioning on the entire history, you only need the current state. In practice, we often assume $s_t = o_t$ (the most recent observation is a sufficient statistic). The state representation has big implications for computational complexity, data requirements, and resulting performance.

Active Recall Questions

**Q1.** State the Markov condition in words and in math.

Show Answer

The future is independent of the past given the present. Formally: $p(s_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid h_t, a_t)$ .

Q2. Why is the Markov assumption popular even when it is not strictly satisfied?

Show Answer

It is simple, and it can often be satisfied approximately by including a small window of recent history in the state. It dramatically reduces the complexity of the problem and is a useful working assumption in practice.

Q3. What are the practical consequences of choosing a poor state representation?

Show Answer

A poor state representation increases computational complexity, requires more data to learn from, and can hurt the performance of the resulting policy. If relevant history is excluded, the Markov property is violated and the agent may make systematically suboptimal decisions.

Markov Process (Markov Chain)

A Markov Process is the simplest building block: a memoryless random process over states with no rewards and no actions. It is a tuple $(\mathcal{S}, P)$ where:

$\mathcal{S}$ is a finite set of states
$P$ is a transition model: $P(s_{t+1} = s' \mid s_t = s)$

For $N$ states, $P$ is an $N \times N$ matrix. Given a Markov chain, you can sample episodes: sequences of states drawn according to the transition probabilities.

Active Recall Questions

**Q1.** What is a Markov Process? What does it have, and what does it lack compared to a full MDP?

Show Answer

A Markov Process (Markov Chain) is a tuple $(\mathcal{S}, P)$ : a set of states and a stochastic transition model. It has no actions and no rewards. It is the simplest building block on the way to an MDP.

Q2. If a Markov chain has $N$ states, what is the shape of the transition matrix $P$ ? What does entry $(i, j)$ represent?

Show Answer

$P$ is $N \times N$ . Entry $(i, j)$ gives $P(s_j \mid s_i)$ , the probability of transitioning from state $s_i$ to state $s_j$ in one step.

Markov Reward Process (MRP)

A Markov Reward Process adds a reward signal to a Markov Chain. It is a tuple $(\mathcal{S}, P, R, \gamma)$ where:

$R(s) = \mathbb{E}[r_t \mid s_t = s]$ is the expected reward in state $s$
$\gamma \in [0, 1]$ is the discount factor

There are still no actions.

The Return $G_t$ . The foundational quantity in RL. It is the actual total discounted reward received from a specific trajectory starting at time $t$ :

G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}

All value functions are simply the expected value $\mathbb{E}$ of this return $G_t$ .

State Value Function $V(s)$ . The expected return from starting in state $s$ :

V(s) = \mathbb{E}[G_t \mid s_t = s]

Discount factor intuition:

$\gamma = 0$ : agent only cares about immediate reward
$\gamma \to 1$ : future rewards are weighted almost as heavily as immediate rewards
If $H < \infty$ (finite episode), it is safe to use $\gamma = 1$

Active Recall Questions

Q1. Write the definition of the return $G_t$ as a summation. What does it represent?

Show Answer

$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$ . It is the total discounted reward collected from timestep $t$ onwards along a specific trajectory.

Q2. What is the relationship between $G_t$ and any value function?

Show Answer

Every value function is simply $\mathbb{E}[G_t]$ conditioned on different information: $V(s) = \mathbb{E}[G_t \mid s_t = s]$ , $Q(s,a) = \mathbb{E}[G_t \mid s_t = s, a_t = a]$ , etc.

Q3. What does $\gamma = 0$ imply for agent behavior? What about $\gamma \to 1$ ?

Show Answer

$\gamma = 0$ means the agent is fully myopic and only maximizes the next immediate reward. $\gamma \to 1$ means future rewards are valued almost as much as present ones, so the agent takes a long-term view.

Q4. An MRP has no actions. So what is the point of studying it?

Show Answer

An MRP is the foundation for understanding value functions and the Bellman equation before introducing the complexity of actions. It is also exactly what you get when you fix a policy in an MDP (MDP + policy = MRP).

Two Views of the Value Function

The value function can be written in two mathematically equivalent ways.

Expectation form (global view). Defines value as the infinite sum of all future discounted rewards:

V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s \right]

Bellman form (local / recursive view). Decomposes value into an immediate reward plus the discounted expected value of the next state:

V(s) = \underbrace{R(s)}_{\text{immediate reward}} + \underbrace{\gamma \sum_{s' \in \mathcal{S}} P(s' \mid s) V(s')}_{\text{discounted future value}}

These two forms are equivalent. The Bellman form is what makes computation tractable.

Active Recall Questions

Q1. What are the two equivalent forms of the value function? Name them and write both equations.

Show Answer

The expectation form (global): $V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s\right]$ . The Bellman form (local/recursive): $V(s) = R(s) + \gamma \sum_{s'} P(s' \mid s) V(s')$ .

Q2. Why is the Bellman form more useful for computation?

Show Answer

The Bellman form is recursive: it expresses $V(s)$ in terms of $V(s')$ for neighboring states. This allows iterative algorithms that update one state at a time, converging to the correct values without summing over infinite trajectories.

Q3. Derive the Bellman form from the expectation form.

Show Answer

Split $G_t = r_{t+1} + \gamma G_{t+1}$ . Then $V(s) = \mathbb{E}[G_t \mid s_t = s] = \mathbb{E}[r_{t+1} + \gamma G_{t+1} \mid s_t = s] = R(s) + \gamma \sum_{s'} P(s' \mid s) \mathbb{E}[G_{t+1} \mid s_{t+1} = s'] = R(s) + \gamma \sum_{s'} P(s' \mid s) V(s')$ .

Computing the Value of an MRP

Analytic solution. In matrix form, $V = R + \gamma P V$ , which rearranges to:

V - \gamma P V = R \implies (I - \gamma P)V = R \implies V = (I - \gamma P)^{-1} R

This requires a matrix inverse: $O(N^3)$ complexity. Only practical for small state spaces.

Iterative solution (dynamic programming). Initialize $V_0(s) = 0$ for all $s$ . Then repeat until convergence:

V_k(s) = R(s) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s) V_{k-1}(s')

Each iteration costs $O(|\mathcal{S}|^2)$ , which is generally preferred.

Active Recall Questions

Q1. Derive the analytic solution $V = (I - \gamma P)^{-1} R$ from the Bellman equation.

Show Answer

Start from $V = R + \gamma P V$ . Rearrange: $V - \gamma PV = R$ , so $(I - \gamma P)V = R$ , giving $V = (I - \gamma P)^{-1} R$ .

Q2. What is the computational complexity of the analytic solution versus the iterative one? When would you prefer each?

Show Answer

Analytic: $O(N^3)$ due to matrix inversion. Iterative: $O(|\mathcal{S}|^2)$ per iteration. Prefer the analytic solution for small, fixed state spaces. Prefer iterative for larger state spaces or when you only need an approximate solution.

Q3. What is the stopping criterion for the iterative algorithm?

Show Answer

Stop when $\max_s |V_k(s) - V_{k-1}(s)| < \theta$ for some small threshold $\theta > 0$ , meaning the value estimates have effectively stopped changing.

Agent Components

An RL agent may include any combination of:

Model: the agent's representation of how the world transitions ( $P$ ) and what rewards look like ( $R$ )
Policy $\pi$ : a mapping from states to actions. Can be deterministic $\pi(s) = a$ , or stochastic $\pi(a \mid s) = P(a_t = a \mid s_t = s)$
Value function: quantifies the expected future return from a state or state-action pair

Model-based agents have an explicit model. Model-free agents learn a policy and/or value function directly without building one.

Active Recall Questions

Q1. What are the three components an RL agent may have? Does an agent need all three?

Show Answer

Model ( $P$ and $R$ ), policy ( $\pi$ ), and value function ( $V$ or $Q$ ). No, an agent can operate with any subset of these. For example, a model-free agent has no model but may have both a policy and a value function.

Q2. What is the difference between a deterministic and a stochastic policy? Write both mathematically.

Show Answer

A deterministic policy maps each state to a single action: $\pi(s) = a$ . A stochastic policy maps each state to a distribution over actions: $\pi(a \mid s) = P(a_t = a \mid s_t = s)$ .

Q3. What is the key distinction between a model-based and a model-free agent?

Show Answer

A model-based agent has an explicit representation of the transition dynamics $P$ and reward function $R$ , which it uses for planning. A model-free agent does not learn or use a model; it learns a policy and/or value function directly from experience.

Lecture 2: MDPs, Policy Iteration, and Value Iteration

Markov Decision Process (MDP)

An MDP is an MRP with actions. It is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:

$\mathcal{A}$ is a finite set of actions
$P(s_{t+1} = s' \mid s_t = s, a_t = a)$ is the transition model conditioned on action
$R(s, a) = \mathbb{E}[r_t \mid s_t = s, a_t = a]$ is the reward for taking action $a$ in state $s$

MDP + policy = MRP. Given a policy $\pi(a \mid s)$ , the MDP reduces to an MRP with:

R^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) R(s, a), \qquad P^\pi(s' \mid s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) P(s' \mid s, a)

This means all MRP techniques carry over directly for policy evaluation.

Active Recall Questions

Q1. Write the full tuple definition of an MDP. What does each element represent?

Show Answer

$(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ : state space, action space, transition dynamics $P(s' \mid s, a)$ , reward function $R(s, a)$ , discount factor $\gamma$ .

Q2. How does fixing a policy $\pi$ transform an MDP into an MRP? Write the induced $R^\pi$ and $P^\pi$ .

Show Answer

$R^\pi(s) = \sum_a \pi(a \mid s) R(s, a)$ and $P^\pi(s' \mid s) = \sum_a \pi(a \mid s) P(s' \mid s, a)$ . The action is averaged out according to the policy, leaving a pure Markov chain with rewards.

Q3. Why is the MDP-to-MRP reduction useful?

Show Answer

It means that once we have a fixed policy, we can evaluate it using all the MRP machinery (Bellman equation, iterative DP, analytic solution), without needing new algorithms.

Value Functions in an MDP

With actions in the picture, there are two distinct value functions.

State-value function $V^\pi(s)$ . The expected return starting from state $s$ and following policy $\pi$ :

V^\pi(s) = \mathbb{E}_\pi [G_t \mid s_t = s]

Action-value function $Q^\pi(s, a)$ . The expected return starting from state $s$ , taking action $a$ , then following policy $\pi$ thereafter:

Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid s_t = s, a_t = a]

Interdependence of $V$ and $Q$

$V^\pi$ and $Q^\pi$ are intrinsically linked. Each can be written in terms of the other:

Function	Defined using	Equation	Meaning
$V^\pi(s)$	Policy $\pi$ and $Q^\pi$	$V^\pi(s) = \sum_{a} \pi(a \mid s)\, Q^\pi(s, a)$	Expected $Q$ -value, weighted by the policy's action probabilities
$Q^\pi(s,a)$	Dynamics $p$ and $V^\pi$	$Q^\pi(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^\pi(s')\right]$	Immediate reward plus discounted expected value of the next state

Active Recall Questions

Q1. What is the difference between $V^\pi(s)$ and $Q^\pi(s, a)$ ?

Show Answer

$V^\pi(s)$ is the expected return from state $s$ when following $\pi$ from the very start. $Q^\pi(s, a)$ is the expected return when you take a specific action $a$ first, then follow $\pi$ . $V^\pi$ averages over actions; $Q^\pi$ conditions on a specific action.

Q2. Express $V^\pi(s)$ in terms of $Q^\pi(s, a)$ .

Show Answer

$V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)$ . The state value is the policy-weighted average of the action values.

Q3. Express $Q^\pi(s, a)$ in terms of $V^\pi$ .

Show Answer

$Q^\pi(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^\pi(s')\right]$ . Take action $a$ , collect the immediate reward, then add the discounted value of the resulting state.

Q4. Why is the $V$ / $Q$ interdependence important in practice?

Show Answer

It lets you move fluidly between the two representations. Policy improvement uses $Q$ to identify better actions. Policy evaluation computes $V$ . Being able to convert between them in both directions is essential for implementing both algorithms.

Bellman Equations for $V^\pi$ and $Q^\pi$

Substituting the interdependence relations into each other gives the full Bellman expectation equations:

V^\pi(s) = \mathbb{E}_\pi \left[ r_{t+1} + \gamma V^\pi(s_{t+1}) \mid s_t = s \right]

Q^\pi(s, a) = \mathbb{E}_\pi \left[ r_{t+1} + \gamma Q^\pi(s_{t+1}, a_{t+1}) \mid s_t = s, a_t = a \right]

Active Recall Questions

Q1. Write the Bellman expectation equation for $V^\pi$ and explain each term.

Show Answer

$V^\pi(s) = \mathbb{E}_\pi[r_{t+1} + \gamma V^\pi(s_{t+1}) \mid s_t = s]$ . The first term $r_{t+1}$ is the immediate reward. The second term $\gamma V^\pi(s_{t+1})$ is the discounted value of the state we land in, averaged over transitions and the policy.

Q2. What is the key structural insight that the Bellman equations encode?

Show Answer

The value of a state (or state-action pair) right now can be broken into two parts: what you get immediately, and what you expect to get from here onwards. This recursive structure is what makes dynamic programming possible.

Policy Evaluation (Iterative)

Initialize $V_0^\pi(s) = 0$ for all $s$ , then apply the Bellman expectation backup repeatedly:

V_k^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} p(s' \mid s, a)\, V_{k-1}^\pi(s') \right]

For a deterministic policy this simplifies to:

V_k^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s' \in \mathcal{S}} p(s' \mid s, \pi(s))\, V_{k-1}^\pi(s')

Continue until $\Delta = \max_s |V_k(s) - V_{k-1}(s)| < \theta$ for some small threshold $\theta$ .

Active Recall Questions

Q1. Write the iterative policy evaluation update. What does each term do?

Show Answer

$V_k^\pi(s) = \sum_a \pi(a \mid s)\left[R(s,a) + \gamma \sum_{s'} p(s' \mid s, a) V_{k-1}^\pi(s')\right]$ . The outer sum averages over actions according to the policy. The inner bracket adds the immediate reward to the discounted estimated value of the next state.

Q2. How does the update simplify for a deterministic policy?

Show Answer

Since $\pi$ picks a single action, the sum over $a$ disappears: $V_k^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s'} p(s' \mid s, \pi(s)) V_{k-1}^\pi(s')$ .

Q3. What is policy evaluation computing, and why isn't it the same as finding the optimal policy?

Show Answer

Policy evaluation computes $V^\pi$ , the value of a specific fixed policy $\pi$ . It tells you how good $\pi$ is, but it does not improve it. Finding the optimal policy requires additionally doing policy improvement (as in policy iteration) or directly optimizing (as in value iteration).

MDP Control

The goal of control is to find the optimal policy:

\pi^*(s) = \arg\max_\pi V^\pi(s)

Key facts about the optimal policy in an infinite-horizon MDP:

It is deterministic: randomness is never needed at optimality
It is stationary: the same action is prescribed for a given state regardless of the timestep
The optimal value function $V^*$ is unique, but the optimal policy is not necessarily (ties are possible)

The number of deterministic policies is $|\mathcal{A}|^{|\mathcal{S}|}$ , so exhaustive search is never feasible.

Active Recall Questions

Q1. State three properties of the optimal policy in an infinite-horizon MDP.

Show Answer

It is deterministic, stationary (time-independent), and there exists a unique optimal value function $V^*$ (though there may be multiple optimal policies achieving it).

Q2. Why is the optimal policy stationary in an infinite-horizon MDP but not in a finite-horizon one?

Show Answer

In the infinite-horizon case, the remaining time is always the same (infinite), so the same decision rule applies at every step. In the finite-horizon case, the number of steps remaining decreases over time, so the optimal action in a state depends on how much time is left.

Q3. Why is exhaustive policy search not feasible?

Show Answer

The number of deterministic policies is $|\mathcal{A}|^{|\mathcal{S}|}$ , which grows exponentially in the state space. Even for modest problem sizes this is astronomically large.

Policy Iteration

Policy Iteration (PI) discovers the optimal policy by alternating between exact policy evaluation and greedy policy improvement.

Step 1: Initialization. Initialize $V(s) \in \mathbb{R}$ and $\pi(s)$ arbitrarily for all states. Set a small convergence threshold $\theta > 0$ .

Step 2: Policy Evaluation (inner loop). Repeat until $\Delta < \theta$ :

\Delta \leftarrow 0, \quad v \leftarrow V(s), \quad V(s) \leftarrow \sum_{s', r} p(s', r \mid s, \pi(s))\left[r + \gamma V(s')\right], \quad \Delta \leftarrow \max(\Delta, |v - V(s)|)

Step 3: Policy Improvement. For each state $s$ :

\pi(s) \leftarrow \arg\max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V(s')\right]

If the policy changed for any state, return to Step 2. Otherwise, stop and return $V^*$ and $\pi^*$ .

The Policy Improvement Theorem

Theorem. Given two deterministic policies $\pi$ and $\pi'$ , if for all $s \in \mathcal{S}$ :

Q^\pi(s, \pi'(s)) \geq V^\pi(s)

then $\pi'$ is globally at least as good:

V^{\pi'}(s) \geq V^\pi(s) \quad \forall s \in \mathcal{S}

Proof by unrolling the Bellman equation:

Assumption: $V^\pi(s) \leq Q^\pi(s, \pi'(s))$
Expand $Q^\pi$ at $t=1$ :

V^\pi(s) \leq \mathbb{E}_{\pi'}\left[r_{t+1} + \gamma V^\pi(s_{t+1}) \mid s_t = s\right]

Apply the inequality again at $s_{t+1}$ ( $t=2$ ):

V^\pi(s) \leq \mathbb{E}_{\pi'}\left[r_{t+1} + \gamma r_{t+2} + \gamma^2 V^\pi(s_{t+2}) \mid s_t = s\right]

Continue unrolling for $n$ steps:

V^\pi(s) \leq \mathbb{E}_{\pi'}\left[r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V^\pi(s_{t+n}) \mid s_t = s\right]

Take $n \to \infty$ . Because $\gamma < 1$ , the remainder $\gamma^n V^\pi(s_{t+n}) \to 0$ :

V^\pi(s) \leq \mathbb{E}_{\pi'}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s\right] = V^{\pi'}(s)

$\therefore V^{\pi'}(s) \geq V^\pi(s)$ $\blacksquare$

Convergence of PI. If the policy does not change after an improvement step, it can never change again (since $Q^{\pi_{i+1}} = Q^{\pi_i}$ implies $\pi_{i+2} = \pi_{i+1}$ ). Because there are at most $|\mathcal{A}|^{|\mathcal{S}|}$ distinct policies and each step is strictly improving, policy iteration always terminates.

Active Recall Questions

Q1. Describe the two phases of policy iteration in plain words.

Show Answer

First, policy evaluation: run the Bellman expectation update iteratively until $V^\pi$ converges for the current policy. Second, policy improvement: for each state, greedily pick the action that maximizes $Q^\pi(s, a)$ under the current value estimates. Repeat until the policy stops changing.

Q2. State the Policy Improvement Theorem formally.

Show Answer

If $Q^\pi(s, \pi'(s)) \geq V^\pi(s)$ for all $s \in \mathcal{S}$ , then $V^{\pi'}(s) \geq V^\pi(s)$ for all $s \in \mathcal{S}$ .

Q3. What is the key step in the proof? Why does the $\gamma^n$ term vanish?

Show Answer

The key step is repeatedly substituting the inequality $V^\pi(s) \leq Q^\pi(s, \pi'(s))$ at each successive state, unrolling the Bellman equation. The $\gamma^n V^\pi(s_{t+n})$ term vanishes as $n \to \infty$ because $\gamma < 1$ causes it to decay to zero exponentially.

Q4. Why does policy iteration always converge in finite time?

Show Answer

Each iteration either strictly improves the policy or leaves it unchanged. Once unchanged, it can never change again. Since the total number of distinct deterministic policies is finite ( $|\mathcal{A}|^{|\mathcal{S}|}$ ) and the algorithm never revisits a suboptimal policy, it must terminate.

Q5. If policy iteration stops because the policy did not change, what can you conclude?

Show Answer

The current policy is optimal. The Bellman optimality condition holds: $\pi(s) = \arg\max_a Q^\pi(s, a)$ for all $s$ , which means $V^\pi = V^*$ .

Value Iteration

Value Iteration (VI) merges evaluation and improvement into a single step by applying the Bellman optimality operator $B$ directly at each iteration.

Bellman Optimality Equations. These define the maximum expected return achievable by acting optimally:

V^*(s) = \max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^*(s')\right]

Q^*(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma \max_{a'} Q^*(s', a')\right]

Algorithm:

Initialize $V_0(s) = 0$ for all states. Set threshold $\theta > 0$ .
For each state $s$ , apply the Bellman optimality backup:

V_{k+1}(s) = \max_{a \in \mathcal{A}} \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V_k(s')\right]

Compute $\Delta = \max_s |V_{k+1}(s) - V_k(s)|$ . If $\Delta < \theta$ , stop.
Extract the optimal policy greedily:

\pi^*(s) = \arg\max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^*(s')\right]

Equivalently in operator notation: $V_{k+1} = BV_k$ .

Convergence of Value Iteration (Contraction Mapping)

VI converges because $B$ is a $\gamma$ -contraction mapping under the infinity norm $\|V\| = \max_s |V(s)|$ .

Proof. For any two value functions $V$ and $U$ :

\|(BV)(s) - (BU)(s)\| = \left|\max_a \mathbb{E}\left[r + \gamma V(s')\right] - \max_a \mathbb{E}\left[r + \gamma U(s')\right]\right|

Apply $|\max f - \max g| \leq \max|f - g|$ :

\leq \max_a \left|\mathbb{E}\left[r + \gamma V(s')\right] - \mathbb{E}\left[r + \gamma U(s')\right]\right|

Cancel the reward $r$ (identical in both):

= \max_a \left|\gamma\, \mathbb{E}\left[V(s') - U(s')\right]\right|

Bound by the worst-case difference across all states:

\leq \gamma \max_{s'} |V(s') - U(s')| = \gamma \|V - U\|_\infty

Since this holds for all $s$ :

\|BV - BU\|_\infty \leq \gamma \|V - U\|_\infty

Because $\gamma < 1$ , the distance to $V^*$ shrinks by a factor of $\gamma$ at every iteration. Value iteration converges to the unique optimal value function $V^*$ . $\blacksquare$

Active Recall Questions

Q1. Write the Bellman optimality equation for $V^*$ . How does it differ from the Bellman expectation equation?

Show Answer

$V^*(s) = \max_a \sum_{s',r} p(s',r \mid s,a)[r + \gamma V^*(s')]$ . The difference is the $\max_a$ instead of $\sum_a \pi(a \mid s)$ . Rather than averaging over the policy's action distribution, we take the best possible action.

Q2. What is a contraction mapping, and why does it guarantee convergence?

Show Answer

An operator $B$ is a contraction if $\|BV - BU\| \leq \gamma \|V - U\|$ with $\gamma < 1$ . It shrinks the distance between any two inputs by a fixed factor each time it is applied. By the Banach fixed-point theorem, repeated application converges to a unique fixed point.

Q3. Sketch the key steps of the contraction proof for the Bellman operator.

Show Answer

(1) Expand $\|BV - BU\|$ using the definition of $B$ . (2) Apply $|\max f - \max g| \leq \max|f-g|$ to move the max inside. (3) Cancel the reward $r$ (same in both). (4) Factor out $\gamma$ and bound by $\gamma\|V - U\|_\infty$ .

Q4. Does the initialization of $V_0$ in value iteration affect what it converges to?

Show Answer

No. Because $B$ is a contraction, any initialization converges to the unique fixed point $V^*$ . Different initializations may take different numbers of iterations, but the final answer is the same.

Finite Horizon Value Iteration

For a finite horizon $H$ , the algorithm runs for exactly $H$ steps rather than until convergence:

$V_k$ : optimal value with $k$ decisions remaining
$\pi_k$ : optimal policy with $k$ decisions remaining
Initialize $V_0(s) = 0$ , iterate $k = 1$ to $H$

The optimal policy in a finite-horizon problem is non-stationary: $\pi_k$ depends on how many steps remain, not just the current state. This is a fundamental difference from the infinite-horizon case.

Active Recall Questions

Q1. What does $V_k(s)$ represent in finite-horizon value iteration?

Show Answer

It is the optimal expected return from state $s$ when exactly $k$ decisions remain.

Q2. Why is the optimal policy non-stationary in the finite-horizon case?

Show Answer

With a finite horizon, the right action depends on how many steps are left. Near the end of an episode, the agent may prefer to exploit immediately; earlier in the episode, it may be worth taking actions that sacrifice short-term reward for longer-term gain. The remaining time is part of the decision context.

Policy Iteration vs Value Iteration

Feature	Policy Iteration	Value Iteration
Focus	Improving a specific policy $\pi$	Finding the optimal value $V^*$ directly
Update rule	Bellman expectation, then $\arg\max_a Q$	Bellman optimality ( $\max_a$ ) at every step
Phases	Two distinct steps: evaluate then improve	One merged step using the $\max$ operator
Convergence	Monotonically improving; terminates in $\leq	\mathcal{A}
Cost per step	Higher (full policy evaluation per iteration)	Lower (one Bellman backup per state)

Active Recall Questions

Q1. What is the fundamental difference in the update rule between PI and VI?

Show Answer

PI applies the Bellman expectation operator (weighted sum over $\pi$ 's action distribution) for evaluation, then a separate $\arg\max$ for improvement. VI applies the Bellman optimality operator ( $\max_a$ ) in a single step, merging both.

Q2. Which algorithm typically requires fewer outer iterations and why?

Show Answer

Policy iteration, because it fully evaluates the current policy before improving it. Each improvement step makes a globally informed update. Value iteration mixes partial evaluation and improvement, converging more slowly in terms of outer iterations but with cheaper per-step cost.

Q3. Both algorithms converge to $\pi^*$ . What property of policy iteration makes this obvious? What property of value iteration makes this obvious?

Show Answer

For PI: monotonic improvement guarantees we never go backwards, and termination is finite because the policy space is finite. For VI: the Bellman operator is a contraction, so $V_k \to V^*$ and the greedy policy extracted from $V^*$ is optimal.

I will keep adding notes as I work through the rest of the course, along with my solutions to each assignment. More to come.

Lecture 1: What is Reinforcement Learning?

The Core Idea

Sequential Decision Making

The Markov Assumption

Markov Process (Markov Chain)

Markov Reward Process (MRP)

Two Views of the Value Function

Computing the Value of an MRP

Agent Components

Lecture 2: MDPs, Policy Iteration, and Value Iteration

Markov Decision Process (MDP)

Value Functions in an MDP

Interdependence of VVV and QQQ

Bellman Equations for VπV^\piVπ and QπQ^\piQπ

Policy Evaluation (Iterative)

MDP Control

Policy Iteration

The Policy Improvement Theorem

Value Iteration

Convergence of Value Iteration (Contraction Mapping)

Finite Horizon Value Iteration

Policy Iteration vs Value Iteration

Interdependence of $V$ and $Q$

Bellman Equations for $V^\pi$ and $Q^\pi$