Basics of Reinforcement Learning
From Gridworld to Policy Gradient Methods
What is Reinforcement Learning?
An agent interacts with an environment .
At each time step \(t\) , the agent observes a state \(s\_t\) and takes an action \(a\_t\) .
The environment returns a reward \(r\_t\) and a new state \(s\_{t+1}\) .
The objective is to maximize cumulative reward over time.
Gridworld Toy Example
States: positions in a 2D grid (e.g., 3x3).
Actions: Up / Down / Left / Right.
Rewards: \(-10\) , \(0\) , \(+10\) .
Goal: learn policy to maximizes expected return .
S = start, G = goal (+10), X = pit (-10)
+---+---+---+
| | | |
+---+---+---+
| | | |
+---+---+---+
| S | X | G |
+---+---+---+
Gridworld Toy Example
+---+---+---+
| | | |
+---+---+---+
| S | | |
+---+---+---+
| | X | G |
+---+---+---+
+---+---+---+
| | | |
+---+---+---+
| | S | |
+---+---+---+
| | X | G |
+---+---+---+
+---+---+---+
| | | |
+---+---+---+
| | | S |
+---+---+---+
| | X | G |
+---+---+---+
RL for Language Models (LMs)
State: current token sequence \(x\_{1:t}\) .
Action: predict next token \(x\_{t+1}\) given \(x\_{1:t}\) .
State transition: appending token updates state to \(x\_{1:t+1}\) .
Reward: at sequence end, assign quality score \(R(x\_{1:T})\) via
Human evaluation , or
Automated reward models (trained on preferences).
Learn policy \(\pi\_\theta\) maximizing \(\mathbb{E}[R(x\_{1:T})]\) .
Return, Value, and Policy
Return: \(G_t = \sum_{k=0}^{\infty} \gamma^k r\_{t+k}\)
Value: \(V^\pi(s) = \mathbb{E}\_\pi [ G_t \mid s_t = s ]\)
Action-Value: \(Q^\pi(s,a) = \mathbb{E}\_\pi [ G_t \mid s_t = s, a_t = a ]\)
Policy: \(\pi\_\theta(a \mid s)\) outputs action distribution
Advantage Function
The advantage compares actions to state baseline: \[ A\pi(s,a) = Q\pi(s,a) - V\pi(s) \] Positive \(A\pi(s,a)\) : better than average for state \(s\) & Negative: worse than average
—##
Policy Gradient (REINFORCE)
Climb performance hill:
\nabla_\theta J(\theta) =
\mathbb{E}_{\tau \sim \pi_\theta}\!\left[
\sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \hat{A}_t
\right]
where \(\hat{A}\_t\) is advantage estimator (e.g., \(G_t - b_t\) ).
Baselines \(b_t\) reduce variance (often \(V_\phi(s_t)\) )
See: Spinning Up “Deriving the Simplest Policy Gradient”
—##
Computing Policy Gradient in Practice
Monte Carlo (REINFORCE): unbiased, high variance
Actor–Critic: learn \(V\_\phi(s)\) to form \(\hat{A}\_t\)
Generalized Advantage (GAE): \(\hat{A}^{\text{GAE}}\_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta\_{t+l}\)
Entropy bonuses encourage exploration
Trust Region Policy Optimization (TRPO)
Constrained policy updates:
\[\begin{aligned}
&\max_{\theta} \mathbb{E}\!\left[
\frac{\pi_{\theta}(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)} \hat{A}
\right] \\
&\text{s.t.}\ \ \mathbb{E}\left[
\mathrm{KL}\big(\pi_{\theta_{\text{old}}} \parallel \pi_{\theta}\big)
\right] \le \delta
\end{aligned}\]
Uses conjugate gradient + line search
Guarantees monotonic improvement
Proximal Policy Optimization (PPO)
Clipped surrogate objective: [ L^() = ] where \(r\_t(\theta)=\frac{\pi\_\theta(a\_t\mid s\_t)}{\pi\_{\theta\_\text{old}}(a\_t\mid s\_t)}\)
Appendix: Notation
\(\gamma\) : discount, \(\pi\_\theta\) : policy, \(V^\pi\) , \(Q^\pi\) , \(A^\pi\) : value functions, KL: Kullback-Leibler divergence ```