rakshit

CS234 Lecture Notes: Foundations of RL and MDPs (Lectures 1 & 2)

I have been working through Stanford's CS234 course on Reinforcement Learning, taught by Professor Emma Brunskill, as part of building a solid theoretical foundation in RL. These are my notes from the first two lectures. Lecture 1 covers the framing of RL and builds up to Markov Reward Processes. Lecture 2 introduces Markov Decision Processes and the two core planning algorithms: policy iteration and value iteration. I have included active recall questions after each section, which I have found useful for making the material actually stick rather than just reading passively.


Lecture 1: What is Reinforcement Learning?

The Core Idea

Reinforcement learning is about learning through experience to make good decisions under uncertainty. Unlike supervised learning, there is no labeled dataset. The agent learns by interacting with an environment and receiving reward signals.

Four properties define most RL problems:

Optimization. The agent has an explicit objective: find the best possible policy, not just a good enough one.

Delayed consequences. Actions taken now affect rewards far in the future. This creates two sub-challenges: during planning, you have to reason about long-term ramifications, not just immediate effects. During learning, the credit assignment problem asks which past action actually caused a good or bad outcome.

Exploration. The agent learns about the world by acting in it. You only observe the reward for the action you took. You never see the counterfactual.

Generalization. A policy is a mapping from past experience to action. Effective RL requires generalizing from a limited number of observations to unseen states.

Active Recall Questions

Q1. What are the four defining properties of most RL problems? Give a one-sentence description of each.

Show Answer

Optimization (find the best policy, not just any policy), delayed consequences (actions now affect future rewards, causing credit assignment problems), exploration (you only observe the reward for the action taken, never the counterfactual), and generalization (the policy must extend from seen to unseen states).

Q2. What is the credit assignment problem and why is it hard?

Show Answer

When a reward is received, it is unclear which earlier action caused it. In long sequences of decisions, many actions have already been taken before any reward arrives, so attributing the reward to the right action is non-trivial.

Q3. How is RL fundamentally different from supervised learning?

Show Answer

In supervised learning, labeled input-output pairs are provided. In RL, the agent must discover what actions lead to good outcomes through interaction. It receives a reward signal rather than a correct label, and it only observes the reward for the action it actually chose.


Sequential Decision Making

At each discrete timestep tt, the agent-environment loop proceeds as follows:

  1. Agent takes action ata_t
  2. World transitions, emits observation oto_t and reward rtr_t
  3. Agent receives oto_t and rtr_t, updates its history

The full history up to time tt is:

ht=(a1,o1,r1,,at,ot,rt)h_t = (a_1, o_1, r_1, \ldots, a_t, o_t, r_t)

The agent selects its next action based on this history. In practice we compress the history into a state st=f(ht)s_t = f(h_t).

Active Recall Questions

Q1. Write out the history hth_t. What does it contain?

Show Answer

ht=(a1,o1,r1,,at,ot,rt)h_t = (a_1, o_1, r_1, \ldots, a_t, o_t, r_t). It is the full sequence of past actions, observations, and rewards up to and including timestep tt.

Q2. Why do we compress history into a state rather than conditioning on the full history?

Show Answer

The full history grows without bound and becomes computationally intractable to condition on. Compressing it into a state st=f(ht)s_t = f(h_t) makes the problem tractable, and under the Markov assumption no information is lost.


The Markov Assumption

A state sts_t is Markov if and only if:

p(st+1st,at)=p(st+1ht,at)p(s_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid h_t, a_t)

The future is independent of the past, given the present. Instead of conditioning on the entire history, you only need the current state. In practice, we often assume st=ots_t = o_t (the most recent observation is a sufficient statistic). The state representation has big implications for computational complexity, data requirements, and resulting performance.

Active Recall Questions**Q1.** State the Markov condition in words and in math.
Show Answer

The future is independent of the past given the present. Formally: p(st+1st,at)=p(st+1ht,at)p(s_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid h_t, a_t).

Q2. Why is the Markov assumption popular even when it is not strictly satisfied?

Show Answer

It is simple, and it can often be satisfied approximately by including a small window of recent history in the state. It dramatically reduces the complexity of the problem and is a useful working assumption in practice.

Q3. What are the practical consequences of choosing a poor state representation?

Show Answer

A poor state representation increases computational complexity, requires more data to learn from, and can hurt the performance of the resulting policy. If relevant history is excluded, the Markov property is violated and the agent may make systematically suboptimal decisions.


Markov Process (Markov Chain)

A Markov Process is the simplest building block: a memoryless random process over states with no rewards and no actions. It is a tuple (S,P)(\mathcal{S}, P) where:

  • S\mathcal{S} is a finite set of states
  • PP is a transition model: P(st+1=sst=s)P(s_{t+1} = s' \mid s_t = s)

For NN states, PP is an N×NN \times N matrix. Given a Markov chain, you can sample episodes: sequences of states drawn according to the transition probabilities.

Active Recall Questions**Q1.** What is a Markov Process? What does it have, and what does it lack compared to a full MDP?
Show Answer

A Markov Process (Markov Chain) is a tuple (S,P)(\mathcal{S}, P): a set of states and a stochastic transition model. It has no actions and no rewards. It is the simplest building block on the way to an MDP.

Q2. If a Markov chain has NN states, what is the shape of the transition matrix PP? What does entry (i,j)(i, j) represent?

Show Answer

PP is N×NN \times N. Entry (i,j)(i, j) gives P(sjsi)P(s_j \mid s_i), the probability of transitioning from state sis_i to state sjs_j in one step.


Markov Reward Process (MRP)

A Markov Reward Process adds a reward signal to a Markov Chain. It is a tuple (S,P,R,γ)(\mathcal{S}, P, R, \gamma) where:

  • R(s)=E[rtst=s]R(s) = \mathbb{E}[r_t \mid s_t = s] is the expected reward in state ss
  • γ[0,1]\gamma \in [0, 1] is the discount factor

There are still no actions.

The Return GtG_t. The foundational quantity in RL. It is the actual total discounted reward received from a specific trajectory starting at time tt:

Gt=rt+1+γrt+2+γ2rt+3+=k=0γkrt+k+1G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}

All value functions are simply the expected value E\mathbb{E} of this return GtG_t.

State Value Function V(s)V(s). The expected return from starting in state ss:

V(s)=E[Gtst=s]V(s) = \mathbb{E}[G_t \mid s_t = s]

Discount factor intuition:

  • γ=0\gamma = 0: agent only cares about immediate reward
  • γ1\gamma \to 1: future rewards are weighted almost as heavily as immediate rewards
  • If H<H < \infty (finite episode), it is safe to use γ=1\gamma = 1
Active Recall Questions

Q1. Write the definition of the return GtG_t as a summation. What does it represent?

Show Answer

Gt=k=0γkrt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}. It is the total discounted reward collected from timestep tt onwards along a specific trajectory.

Q2. What is the relationship between GtG_t and any value function?

Show Answer

Every value function is simply E[Gt]\mathbb{E}[G_t] conditioned on different information: V(s)=E[Gtst=s]V(s) = \mathbb{E}[G_t \mid s_t = s], Q(s,a)=E[Gtst=s,at=a]Q(s,a) = \mathbb{E}[G_t \mid s_t = s, a_t = a], etc.

Q3. What does γ=0\gamma = 0 imply for agent behavior? What about γ1\gamma \to 1?

Show Answer

γ=0\gamma = 0 means the agent is fully myopic and only maximizes the next immediate reward. γ1\gamma \to 1 means future rewards are valued almost as much as present ones, so the agent takes a long-term view.

Q4. An MRP has no actions. So what is the point of studying it?

Show Answer

An MRP is the foundation for understanding value functions and the Bellman equation before introducing the complexity of actions. It is also exactly what you get when you fix a policy in an MDP (MDP + policy = MRP).


Two Views of the Value Function

The value function can be written in two mathematically equivalent ways.

Expectation form (global view). Defines value as the infinite sum of all future discounted rewards:

Vπ(s)=Eπ[k=0γkrt+k+1st=s]V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s \right]

Bellman form (local / recursive view). Decomposes value into an immediate reward plus the discounted expected value of the next state:

V(s)=R(s)immediate reward+γsSP(ss)V(s)discounted future valueV(s) = \underbrace{R(s)}_{\text{immediate reward}} + \underbrace{\gamma \sum_{s' \in \mathcal{S}} P(s' \mid s) V(s')}_{\text{discounted future value}}

These two forms are equivalent. The Bellman form is what makes computation tractable.

Active Recall Questions

Q1. What are the two equivalent forms of the value function? Name them and write both equations.

Show Answer

The expectation form (global): Vπ(s)=Eπ[k=0γkrt+k+1st=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s\right]. The Bellman form (local/recursive): V(s)=R(s)+γsP(ss)V(s)V(s) = R(s) + \gamma \sum_{s'} P(s' \mid s) V(s').

Q2. Why is the Bellman form more useful for computation?

Show Answer

The Bellman form is recursive: it expresses V(s)V(s) in terms of V(s)V(s') for neighboring states. This allows iterative algorithms that update one state at a time, converging to the correct values without summing over infinite trajectories.

Q3. Derive the Bellman form from the expectation form.

Show Answer

Split Gt=rt+1+γGt+1G_t = r_{t+1} + \gamma G_{t+1}. Then V(s)=E[Gtst=s]=E[rt+1+γGt+1st=s]=R(s)+γsP(ss)E[Gt+1st+1=s]=R(s)+γsP(ss)V(s)V(s) = \mathbb{E}[G_t \mid s_t = s] = \mathbb{E}[r_{t+1} + \gamma G_{t+1} \mid s_t = s] = R(s) + \gamma \sum_{s'} P(s' \mid s) \mathbb{E}[G_{t+1} \mid s_{t+1} = s'] = R(s) + \gamma \sum_{s'} P(s' \mid s) V(s').


Computing the Value of an MRP

Analytic solution. In matrix form, V=R+γPVV = R + \gamma P V, which rearranges to:

VγPV=R    (IγP)V=R    V=(IγP)1RV - \gamma P V = R \implies (I - \gamma P)V = R \implies V = (I - \gamma P)^{-1} R

This requires a matrix inverse: O(N3)O(N^3) complexity. Only practical for small state spaces.

Iterative solution (dynamic programming). Initialize V0(s)=0V_0(s) = 0 for all ss. Then repeat until convergence:

Vk(s)=R(s)+γsSP(ss)Vk1(s)V_k(s) = R(s) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s) V_{k-1}(s')

Each iteration costs O(S2)O(|\mathcal{S}|^2), which is generally preferred.

Active Recall Questions

Q1. Derive the analytic solution V=(IγP)1RV = (I - \gamma P)^{-1} R from the Bellman equation.

Show Answer

Start from V=R+γPVV = R + \gamma P V. Rearrange: VγPV=RV - \gamma PV = R, so (IγP)V=R(I - \gamma P)V = R, giving V=(IγP)1RV = (I - \gamma P)^{-1} R.

Q2. What is the computational complexity of the analytic solution versus the iterative one? When would you prefer each?

Show Answer

Analytic: O(N3)O(N^3) due to matrix inversion. Iterative: O(S2)O(|\mathcal{S}|^2) per iteration. Prefer the analytic solution for small, fixed state spaces. Prefer iterative for larger state spaces or when you only need an approximate solution.

Q3. What is the stopping criterion for the iterative algorithm?

Show Answer

Stop when maxsVk(s)Vk1(s)<θ\max_s |V_k(s) - V_{k-1}(s)| < \theta for some small threshold θ>0\theta > 0, meaning the value estimates have effectively stopped changing.


Agent Components

An RL agent may include any combination of:

  • Model: the agent's representation of how the world transitions (PP) and what rewards look like (RR)
  • Policy π\pi: a mapping from states to actions. Can be deterministic π(s)=a\pi(s) = a, or stochastic π(as)=P(at=ast=s)\pi(a \mid s) = P(a_t = a \mid s_t = s)
  • Value function: quantifies the expected future return from a state or state-action pair

Model-based agents have an explicit model. Model-free agents learn a policy and/or value function directly without building one.

Active Recall Questions

Q1. What are the three components an RL agent may have? Does an agent need all three?

Show Answer

Model (PP and RR), policy (π\pi), and value function (VV or QQ). No, an agent can operate with any subset of these. For example, a model-free agent has no model but may have both a policy and a value function.

Q2. What is the difference between a deterministic and a stochastic policy? Write both mathematically.

Show Answer

A deterministic policy maps each state to a single action: π(s)=a\pi(s) = a. A stochastic policy maps each state to a distribution over actions: π(as)=P(at=ast=s)\pi(a \mid s) = P(a_t = a \mid s_t = s).

Q3. What is the key distinction between a model-based and a model-free agent?

Show Answer

A model-based agent has an explicit representation of the transition dynamics PP and reward function RR, which it uses for planning. A model-free agent does not learn or use a model; it learns a policy and/or value function directly from experience.


Lecture 2: MDPs, Policy Iteration, and Value Iteration

Markov Decision Process (MDP)

An MDP is an MRP with actions. It is a tuple (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) where:

  • A\mathcal{A} is a finite set of actions
  • P(st+1=sst=s,at=a)P(s_{t+1} = s' \mid s_t = s, a_t = a) is the transition model conditioned on action
  • R(s,a)=E[rtst=s,at=a]R(s, a) = \mathbb{E}[r_t \mid s_t = s, a_t = a] is the reward for taking action aa in state ss

MDP + policy = MRP. Given a policy π(as)\pi(a \mid s), the MDP reduces to an MRP with:

Rπ(s)=aAπ(as)R(s,a),Pπ(ss)=aAπ(as)P(ss,a)R^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) R(s, a), \qquad P^\pi(s' \mid s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) P(s' \mid s, a)

This means all MRP techniques carry over directly for policy evaluation.

Active Recall Questions

Q1. Write the full tuple definition of an MDP. What does each element represent?

Show Answer

(S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma): state space, action space, transition dynamics P(ss,a)P(s' \mid s, a), reward function R(s,a)R(s, a), discount factor γ\gamma.

Q2. How does fixing a policy π\pi transform an MDP into an MRP? Write the induced RπR^\pi and PπP^\pi.

Show Answer

Rπ(s)=aπ(as)R(s,a)R^\pi(s) = \sum_a \pi(a \mid s) R(s, a) and Pπ(ss)=aπ(as)P(ss,a)P^\pi(s' \mid s) = \sum_a \pi(a \mid s) P(s' \mid s, a). The action is averaged out according to the policy, leaving a pure Markov chain with rewards.

Q3. Why is the MDP-to-MRP reduction useful?

Show Answer

It means that once we have a fixed policy, we can evaluate it using all the MRP machinery (Bellman equation, iterative DP, analytic solution), without needing new algorithms.


Value Functions in an MDP

With actions in the picture, there are two distinct value functions.

State-value function Vπ(s)V^\pi(s). The expected return starting from state ss and following policy π\pi:

Vπ(s)=Eπ[Gtst=s]V^\pi(s) = \mathbb{E}_\pi [G_t \mid s_t = s]

Action-value function Qπ(s,a)Q^\pi(s, a). The expected return starting from state ss, taking action aa, then following policy π\pi thereafter:

Qπ(s,a)=Eπ[Gtst=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid s_t = s, a_t = a]


Interdependence of VV and QQ

VπV^\pi and QπQ^\pi are intrinsically linked. Each can be written in terms of the other:

FunctionDefined usingEquationMeaning
Vπ(s)V^\pi(s)Policy π\pi and QπQ^\piVπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_{a} \pi(a \mid s)\, Q^\pi(s, a)Expected QQ-value, weighted by the policy's action probabilities
Qπ(s,a)Q^\pi(s,a)Dynamics pp and VπV^\piQπ(s,a)=s,rp(s,rs,a)[r+γVπ(s)]Q^\pi(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^\pi(s')\right]Immediate reward plus discounted expected value of the next state
Active Recall Questions

Q1. What is the difference between Vπ(s)V^\pi(s) and Qπ(s,a)Q^\pi(s, a)?

Show Answer

Vπ(s)V^\pi(s) is the expected return from state ss when following π\pi from the very start. Qπ(s,a)Q^\pi(s, a) is the expected return when you take a specific action aa first, then follow π\pi. VπV^\pi averages over actions; QπQ^\pi conditions on a specific action.

Q2. Express Vπ(s)V^\pi(s) in terms of Qπ(s,a)Q^\pi(s, a).

Show Answer

Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a). The state value is the policy-weighted average of the action values.

Q3. Express Qπ(s,a)Q^\pi(s, a) in terms of VπV^\pi.

Show Answer

Qπ(s,a)=s,rp(s,rs,a)[r+γVπ(s)]Q^\pi(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^\pi(s')\right]. Take action aa, collect the immediate reward, then add the discounted value of the resulting state.

Q4. Why is the VV/QQ interdependence important in practice?

Show Answer

It lets you move fluidly between the two representations. Policy improvement uses QQ to identify better actions. Policy evaluation computes VV. Being able to convert between them in both directions is essential for implementing both algorithms.


Bellman Equations for VπV^\pi and QπQ^\pi

Substituting the interdependence relations into each other gives the full Bellman expectation equations:

Vπ(s)=Eπ[rt+1+γVπ(st+1)st=s]V^\pi(s) = \mathbb{E}_\pi \left[ r_{t+1} + \gamma V^\pi(s_{t+1}) \mid s_t = s \right]

Qπ(s,a)=Eπ[rt+1+γQπ(st+1,at+1)st=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi \left[ r_{t+1} + \gamma Q^\pi(s_{t+1}, a_{t+1}) \mid s_t = s, a_t = a \right]

Active Recall Questions

Q1. Write the Bellman expectation equation for VπV^\pi and explain each term.

Show Answer

Vπ(s)=Eπ[rt+1+γVπ(st+1)st=s]V^\pi(s) = \mathbb{E}_\pi[r_{t+1} + \gamma V^\pi(s_{t+1}) \mid s_t = s]. The first term rt+1r_{t+1} is the immediate reward. The second term γVπ(st+1)\gamma V^\pi(s_{t+1}) is the discounted value of the state we land in, averaged over transitions and the policy.

Q2. What is the key structural insight that the Bellman equations encode?

Show Answer

The value of a state (or state-action pair) right now can be broken into two parts: what you get immediately, and what you expect to get from here onwards. This recursive structure is what makes dynamic programming possible.


Policy Evaluation (Iterative)

Initialize V0π(s)=0V_0^\pi(s) = 0 for all ss, then apply the Bellman expectation backup repeatedly:

Vkπ(s)=aπ(as)[R(s,a)+γsp(ss,a)Vk1π(s)]V_k^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} p(s' \mid s, a)\, V_{k-1}^\pi(s') \right]

For a deterministic policy this simplifies to:

Vkπ(s)=R(s,π(s))+γsSp(ss,π(s))Vk1π(s)V_k^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s' \in \mathcal{S}} p(s' \mid s, \pi(s))\, V_{k-1}^\pi(s')

Continue until Δ=maxsVk(s)Vk1(s)<θ\Delta = \max_s |V_k(s) - V_{k-1}(s)| < \theta for some small threshold θ\theta.

Active Recall Questions

Q1. Write the iterative policy evaluation update. What does each term do?

Show Answer

Vkπ(s)=aπ(as)[R(s,a)+γsp(ss,a)Vk1π(s)]V_k^\pi(s) = \sum_a \pi(a \mid s)\left[R(s,a) + \gamma \sum_{s'} p(s' \mid s, a) V_{k-1}^\pi(s')\right]. The outer sum averages over actions according to the policy. The inner bracket adds the immediate reward to the discounted estimated value of the next state.

Q2. How does the update simplify for a deterministic policy?

Show Answer

Since π\pi picks a single action, the sum over aa disappears: Vkπ(s)=R(s,π(s))+γsp(ss,π(s))Vk1π(s)V_k^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s'} p(s' \mid s, \pi(s)) V_{k-1}^\pi(s').

Q3. What is policy evaluation computing, and why isn't it the same as finding the optimal policy?

Show Answer

Policy evaluation computes VπV^\pi, the value of a specific fixed policy π\pi. It tells you how good π\pi is, but it does not improve it. Finding the optimal policy requires additionally doing policy improvement (as in policy iteration) or directly optimizing (as in value iteration).


MDP Control

The goal of control is to find the optimal policy:

π(s)=argmaxπVπ(s)\pi^*(s) = \arg\max_\pi V^\pi(s)

Key facts about the optimal policy in an infinite-horizon MDP:

  • It is deterministic: randomness is never needed at optimality
  • It is stationary: the same action is prescribed for a given state regardless of the timestep
  • The optimal value function VV^* is unique, but the optimal policy is not necessarily (ties are possible)

The number of deterministic policies is AS|\mathcal{A}|^{|\mathcal{S}|}, so exhaustive search is never feasible.

Active Recall Questions

Q1. State three properties of the optimal policy in an infinite-horizon MDP.

Show Answer

It is deterministic, stationary (time-independent), and there exists a unique optimal value function VV^* (though there may be multiple optimal policies achieving it).

Q2. Why is the optimal policy stationary in an infinite-horizon MDP but not in a finite-horizon one?

Show Answer

In the infinite-horizon case, the remaining time is always the same (infinite), so the same decision rule applies at every step. In the finite-horizon case, the number of steps remaining decreases over time, so the optimal action in a state depends on how much time is left.

Q3. Why is exhaustive policy search not feasible?

Show Answer

The number of deterministic policies is AS|\mathcal{A}|^{|\mathcal{S}|}, which grows exponentially in the state space. Even for modest problem sizes this is astronomically large.


Policy Iteration

Policy Iteration (PI) discovers the optimal policy by alternating between exact policy evaluation and greedy policy improvement.

Step 1: Initialization. Initialize V(s)RV(s) \in \mathbb{R} and π(s)\pi(s) arbitrarily for all states. Set a small convergence threshold θ>0\theta > 0.

Step 2: Policy Evaluation (inner loop). Repeat until Δ<θ\Delta < \theta:

Δ0,vV(s),V(s)s,rp(s,rs,π(s))[r+γV(s)],Δmax(Δ,vV(s))\Delta \leftarrow 0, \quad v \leftarrow V(s), \quad V(s) \leftarrow \sum_{s', r} p(s', r \mid s, \pi(s))\left[r + \gamma V(s')\right], \quad \Delta \leftarrow \max(\Delta, |v - V(s)|)

Step 3: Policy Improvement. For each state ss:

π(s)argmaxas,rp(s,rs,a)[r+γV(s)]\pi(s) \leftarrow \arg\max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V(s')\right]

If the policy changed for any state, return to Step 2. Otherwise, stop and return VV^* and π\pi^*.


The Policy Improvement Theorem

Theorem. Given two deterministic policies π\pi and π\pi', if for all sSs \in \mathcal{S}:

Qπ(s,π(s))Vπ(s)Q^\pi(s, \pi'(s)) \geq V^\pi(s)

then π\pi' is globally at least as good:

Vπ(s)Vπ(s)sSV^{\pi'}(s) \geq V^\pi(s) \quad \forall s \in \mathcal{S}

Proof by unrolling the Bellman equation:

  1. Assumption: Vπ(s)Qπ(s,π(s))V^\pi(s) \leq Q^\pi(s, \pi'(s))

  2. Expand QπQ^\pi at t=1t=1:

Vπ(s)Eπ[rt+1+γVπ(st+1)st=s]V^\pi(s) \leq \mathbb{E}_{\pi'}\left[r_{t+1} + \gamma V^\pi(s_{t+1}) \mid s_t = s\right]

  1. Apply the inequality again at st+1s_{t+1} (t=2t=2):

Vπ(s)Eπ[rt+1+γrt+2+γ2Vπ(st+2)st=s]V^\pi(s) \leq \mathbb{E}_{\pi'}\left[r_{t+1} + \gamma r_{t+2} + \gamma^2 V^\pi(s_{t+2}) \mid s_t = s\right]

  1. Continue unrolling for nn steps:

Vπ(s)Eπ[rt+1+γrt+2++γn1rt+n+γnVπ(st+n)st=s]V^\pi(s) \leq \mathbb{E}_{\pi'}\left[r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V^\pi(s_{t+n}) \mid s_t = s\right]

  1. Take nn \to \infty. Because γ<1\gamma < 1, the remainder γnVπ(st+n)0\gamma^n V^\pi(s_{t+n}) \to 0:

Vπ(s)Eπ[k=0γkrt+k+1st=s]=Vπ(s)V^\pi(s) \leq \mathbb{E}_{\pi'}\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} \mid s_t = s\right] = V^{\pi'}(s)

Vπ(s)Vπ(s)\therefore V^{\pi'}(s) \geq V^\pi(s) \blacksquare

Convergence of PI. If the policy does not change after an improvement step, it can never change again (since Qπi+1=QπiQ^{\pi_{i+1}} = Q^{\pi_i} implies πi+2=πi+1\pi_{i+2} = \pi_{i+1}). Because there are at most AS|\mathcal{A}|^{|\mathcal{S}|} distinct policies and each step is strictly improving, policy iteration always terminates.

Active Recall Questions

Q1. Describe the two phases of policy iteration in plain words.

Show Answer

First, policy evaluation: run the Bellman expectation update iteratively until VπV^\pi converges for the current policy. Second, policy improvement: for each state, greedily pick the action that maximizes Qπ(s,a)Q^\pi(s, a) under the current value estimates. Repeat until the policy stops changing.

Q2. State the Policy Improvement Theorem formally.

Show Answer

If Qπ(s,π(s))Vπ(s)Q^\pi(s, \pi'(s)) \geq V^\pi(s) for all sSs \in \mathcal{S}, then Vπ(s)Vπ(s)V^{\pi'}(s) \geq V^\pi(s) for all sSs \in \mathcal{S}.

Q3. What is the key step in the proof? Why does the γn\gamma^n term vanish?

Show Answer

The key step is repeatedly substituting the inequality Vπ(s)Qπ(s,π(s))V^\pi(s) \leq Q^\pi(s, \pi'(s)) at each successive state, unrolling the Bellman equation. The γnVπ(st+n)\gamma^n V^\pi(s_{t+n}) term vanishes as nn \to \infty because γ<1\gamma < 1 causes it to decay to zero exponentially.

Q4. Why does policy iteration always converge in finite time?

Show Answer

Each iteration either strictly improves the policy or leaves it unchanged. Once unchanged, it can never change again. Since the total number of distinct deterministic policies is finite (AS|\mathcal{A}|^{|\mathcal{S}|}) and the algorithm never revisits a suboptimal policy, it must terminate.

Q5. If policy iteration stops because the policy did not change, what can you conclude?

Show Answer

The current policy is optimal. The Bellman optimality condition holds: π(s)=argmaxaQπ(s,a)\pi(s) = \arg\max_a Q^\pi(s, a) for all ss, which means Vπ=VV^\pi = V^*.


Value Iteration

Value Iteration (VI) merges evaluation and improvement into a single step by applying the Bellman optimality operator BB directly at each iteration.

Bellman Optimality Equations. These define the maximum expected return achievable by acting optimally:

V(s)=maxas,rp(s,rs,a)[r+γV(s)]V^*(s) = \max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^*(s')\right]

Q(s,a)=s,rp(s,rs,a)[r+γmaxaQ(s,a)]Q^*(s, a) = \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma \max_{a'} Q^*(s', a')\right]

Algorithm:

  1. Initialize V0(s)=0V_0(s) = 0 for all states. Set threshold θ>0\theta > 0.
  2. For each state ss, apply the Bellman optimality backup:

Vk+1(s)=maxaAs,rp(s,rs,a)[r+γVk(s)]V_{k+1}(s) = \max_{a \in \mathcal{A}} \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V_k(s')\right]

  1. Compute Δ=maxsVk+1(s)Vk(s)\Delta = \max_s |V_{k+1}(s) - V_k(s)|. If Δ<θ\Delta < \theta, stop.
  2. Extract the optimal policy greedily:

π(s)=argmaxas,rp(s,rs,a)[r+γV(s)]\pi^*(s) = \arg\max_a \sum_{s', r} p(s', r \mid s, a)\left[r + \gamma V^*(s')\right]

Equivalently in operator notation: Vk+1=BVkV_{k+1} = BV_k.


Convergence of Value Iteration (Contraction Mapping)

VI converges because BB is a γ\gamma-contraction mapping under the infinity norm V=maxsV(s)\|V\| = \max_s |V(s)|.

Proof. For any two value functions VV and UU:

(BV)(s)(BU)(s)=maxaE[r+γV(s)]maxaE[r+γU(s)]\|(BV)(s) - (BU)(s)\| = \left|\max_a \mathbb{E}\left[r + \gamma V(s')\right] - \max_a \mathbb{E}\left[r + \gamma U(s')\right]\right|

Apply maxfmaxgmaxfg|\max f - \max g| \leq \max|f - g|:

maxaE[r+γV(s)]E[r+γU(s)]\leq \max_a \left|\mathbb{E}\left[r + \gamma V(s')\right] - \mathbb{E}\left[r + \gamma U(s')\right]\right|

Cancel the reward rr (identical in both):

=maxaγE[V(s)U(s)]= \max_a \left|\gamma\, \mathbb{E}\left[V(s') - U(s')\right]\right|

Bound by the worst-case difference across all states:

γmaxsV(s)U(s)=γVU\leq \gamma \max_{s'} |V(s') - U(s')| = \gamma \|V - U\|_\infty

Since this holds for all ss:

BVBUγVU\|BV - BU\|_\infty \leq \gamma \|V - U\|_\infty

Because γ<1\gamma < 1, the distance to VV^* shrinks by a factor of γ\gamma at every iteration. Value iteration converges to the unique optimal value function VV^*. \blacksquare

Active Recall Questions

Q1. Write the Bellman optimality equation for VV^*. How does it differ from the Bellman expectation equation?

Show Answer

V(s)=maxas,rp(s,rs,a)[r+γV(s)]V^*(s) = \max_a \sum_{s',r} p(s',r \mid s,a)[r + \gamma V^*(s')]. The difference is the maxa\max_a instead of aπ(as)\sum_a \pi(a \mid s). Rather than averaging over the policy's action distribution, we take the best possible action.

Q2. What is a contraction mapping, and why does it guarantee convergence?

Show Answer

An operator BB is a contraction if BVBUγVU\|BV - BU\| \leq \gamma \|V - U\| with γ<1\gamma < 1. It shrinks the distance between any two inputs by a fixed factor each time it is applied. By the Banach fixed-point theorem, repeated application converges to a unique fixed point.

Q3. Sketch the key steps of the contraction proof for the Bellman operator.

Show Answer

(1) Expand BVBU\|BV - BU\| using the definition of BB. (2) Apply maxfmaxgmaxfg|\max f - \max g| \leq \max|f-g| to move the max inside. (3) Cancel the reward rr (same in both). (4) Factor out γ\gamma and bound by γVU\gamma\|V - U\|_\infty.

Q4. Does the initialization of V0V_0 in value iteration affect what it converges to?

Show Answer

No. Because BB is a contraction, any initialization converges to the unique fixed point VV^*. Different initializations may take different numbers of iterations, but the final answer is the same.


Finite Horizon Value Iteration

For a finite horizon HH, the algorithm runs for exactly HH steps rather than until convergence:

  • VkV_k: optimal value with kk decisions remaining
  • πk\pi_k: optimal policy with kk decisions remaining
  • Initialize V0(s)=0V_0(s) = 0, iterate k=1k = 1 to HH

The optimal policy in a finite-horizon problem is non-stationary: πk\pi_k depends on how many steps remain, not just the current state. This is a fundamental difference from the infinite-horizon case.

Active Recall Questions

Q1. What does Vk(s)V_k(s) represent in finite-horizon value iteration?

Show Answer

It is the optimal expected return from state ss when exactly kk decisions remain.

Q2. Why is the optimal policy non-stationary in the finite-horizon case?

Show Answer

With a finite horizon, the right action depends on how many steps are left. Near the end of an episode, the agent may prefer to exploit immediately; earlier in the episode, it may be worth taking actions that sacrifice short-term reward for longer-term gain. The remaining time is part of the decision context.


Policy Iteration vs Value Iteration

FeaturePolicy IterationValue Iteration
FocusImproving a specific policy π\piFinding the optimal value VV^* directly
Update ruleBellman expectation, then argmaxaQ\arg\max_a QBellman optimality (maxa\max_a) at every step
PhasesTwo distinct steps: evaluate then improveOne merged step using the max\max operator
ConvergenceMonotonically improving; terminates in $\leq\mathcal{A}
Cost per stepHigher (full policy evaluation per iteration)Lower (one Bellman backup per state)
Active Recall Questions

Q1. What is the fundamental difference in the update rule between PI and VI?

Show Answer

PI applies the Bellman expectation operator (weighted sum over π\pi's action distribution) for evaluation, then a separate argmax\arg\max for improvement. VI applies the Bellman optimality operator (maxa\max_a) in a single step, merging both.

Q2. Which algorithm typically requires fewer outer iterations and why?

Show Answer

Policy iteration, because it fully evaluates the current policy before improving it. Each improvement step makes a globally informed update. Value iteration mixes partial evaluation and improvement, converging more slowly in terms of outer iterations but with cheaper per-step cost.

Q3. Both algorithms converge to π\pi^*. What property of policy iteration makes this obvious? What property of value iteration makes this obvious?

Show Answer

For PI: monotonic improvement guarantees we never go backwards, and termination is finite because the policy space is finite. For VI: the Bellman operator is a contraction, so VkVV_k \to V^* and the greedy policy extracted from VV^* is optimal.


I will keep adding notes as I work through the rest of the course, along with my solutions to each assignment. More to come.