CS234 Assignment 1, Part 1: Horizon, Discounting, and Reward Hacking
A few weeks ago I started working through Stanford's CS234 course on Reinforcement Learning. I have been meaning to get into RL properly for a while and decided that working through a real course with real problem sets was the best way to do it. This post covers my solutions to Problems 1 and 2 from Assignment 1. A follow-up post will cover Problems 3 and 4.
The full problem set is available on the CS234 course website.
Problem 1: How Horizon Shapes the Optimal Policy
This problem builds intuition for how the time horizon of an MDP shapes the optimal policy. The setup is a simple inventory management environment where the agent controls stock level . Each sell action yields reward and transitions . Buying transitions with no reward, except for the final step , which yields . The state is terminal. The stock always starts at .
(a) Can the optimal policy both buy and sell?
Yes. With , the agent can sell once (, earning ), then buy eight consecutive times to reach and collect the bonus. The total reward is .
Compare this to buying directly from , which takes only 7 steps and yields total. The sell-first strategy is strictly better when the time budget allows it, and is a concrete example where the optimal policy uses both actions.
(b) Can we choose so the agent never fully stocks?
Yes. Setting forces the agent to optimize only for immediate reward. Selling yields now while buying yields . So the agent sells until and then stays there, since neither action produces immediate reward at . The inventory never reaches .
More formally, with the value function collapses to:
which makes the terminal reward completely invisible to the agent.
(c) Do the two formulations ever agree?
This part asks whether there exists a and such that the optimal policy for the infinite-horizon discounted MDP matches the optimal policy for the finite-horizon undiscounted MDP.
Yes. Take and , starting from .
With , the agent has one step. Selling gives , buying gives . The optimal action is sell.
With , the agent only sees immediate reward. The optimal action is also sell.
Both formulations agree: .
(d) Does this always hold for any ?
No. A matching is not guaranteed to exist for every .
The core issue is that infinite-horizon MDPs produce stationary optimal policies: maps states to actions with no dependence on time. Finite-horizon MDPs produce non-stationary policies that depend on how many steps remain.
For large , the finite-horizon optimal policy is explicitly time-dependent. Early in the episode, with many steps remaining, the agent should buy to eventually collect the bonus. Late in the episode, with few steps left, it should sell to collect rewards before the horizon expires. No fixed can replicate this time-varying behavior via a stationary policy for a general .
Problem 2: When Proxy Rewards Go Wrong
This problem explores how a well-intentioned proxy reward can lead to an agent that is technically optimal but completely misaligned with human intent.
The setup: one AI-controlled car (red) navigates alongside many human-driven cars on a highway. The true goal is to minimize mean commute time. The proxy reward used instead is to maximize mean velocity across all cars.
(a) Why does the optimal policy park and not merge?
Merging requires nearby human-driven cars to slow down to create a gap. This reduction in velocity directly lowers the proxy reward. The AI car discovers that staying parked is the better strategy: the human-driven cars maintain their speed unimpeded, and mean velocity stays high.
Let denote mean velocity across all cars. The proxy reward is . At the moment of merging, the temporary slowdown caused by surrounding cars braking gives:
So the agent is not behaving irrationally. It is just that does not capture what we actually want.
(b) A better reward function
One alternative is to reward the AI car directly for merging promptly. Define:
where is the number of timesteps before the car successfully enters the highway. If the car never merges, and , which forces the agent to merge eventually.
To address safety, a crash state can be introduced with a large negative reward for some large constant . The combined reward:
captures both the goal (merge) and the safety constraint (do it without causing a collision), while sidestepping the mean velocity proxy entirely.
What's Next
Problems 3 and 4 go considerably deeper. Problem 3 covers Bellman residuals and performance bounds, proving properties of the Bellman backup operator and deriving guarantees on policy quality from an arbitrary value function . Problem 4 is an implementation task: building value iteration and policy iteration from scratch on the RiverSwim MDP.
I will cover those in the next post. I also have a separate post in draft on the RL theory background covering value functions, Bellman equations, and the convergence proofs for both algorithms. That one is a study reference rather than a problem walkthrough.