Basics of Reinforcement Learning

From Gridworld to Policy Gradient Methods

What is Reinforcement Learning?

An agent interacts with an environment.
At each time step \(t\), the agent observes a state \(s\_t\) and takes an action \(a\_t\).
The environment returns a reward \(r\_t\) and a new state \(s\_{t+1}\).
The objective is to maximize cumulative reward over time.

Gridworld Toy Example

States: positions in a 2D grid (e.g., 3x3).
Actions: Up / Down / Left / Right.
Rewards: \(-10\), \(0\), \(+10\).
Goal: learn policy to maximizes expected return.

S = start, G = goal (+10), X = pit (-10)

+---+---+---+
|   |   |   |
+---+---+---+
|   |   |   |
+---+---+---+
| S | X | G |
+---+---+---+

Gridworld Toy Example

+---+---+---+
|   |   |   |
+---+---+---+
| S |   |   |
+---+---+---+
|   | X | G |
+---+---+---+

+---+---+---+
|   |   |   |
+---+---+---+
|   | S |   |
+---+---+---+
|   | X | G |
+---+---+---+

+---+---+---+
|   |   |   |
+---+---+---+
|   |   | S |
+---+---+---+
|   | X | G |
+---+---+---+

RL for Language Models (LMs)

State: current token sequence \(x\_{1:t}\).
Action: predict next token \(x\_{t+1}\) given \(x\_{1:t}\).
State transition: appending token updates state to \(x\_{1:t+1}\).
Reward: at sequence end, assign quality score \(R(x\_{1:T})\) via
- Human evaluation, or
- Automated reward models (trained on preferences).
Learn policy \(\pi\_\theta\) maximizing \(\mathbb{E}[R(x\_{1:T})]\).

Return, Value, and Policy

Return: \(G_t = \sum_{k=0}^{\infty} \gamma^k r\_{t+k}\)
Value: \(V^\pi(s) = \mathbb{E}\_\pi [ G_t \mid s_t = s ]\)
Action-Value: \(Q^\pi(s,a) = \mathbb{E}\_\pi [ G_t \mid s_t = s, a_t = a ]\)
Policy: \(\pi\_\theta(a \mid s)\) outputs action distribution

Advantage Function

The advantage compares actions to state baseline: \[ A\pi(s,a) = Q\pi(s,a) - V\pi(s) \] Positive \(A\pi(s,a)\): better than average for state \(s\) & Negative: worse than average

—##

Policy Gradient (REINFORCE)

Climb performance hill:

\nabla_\theta J(\theta) = 
\mathbb{E}_{\tau \sim \pi_\theta}\!\left[
\sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t)  \hat{A}_t
\right]

where \(\hat{A}\_t\) is advantage estimator (e.g., \(G_t - b_t\)).

Baselines \(b_t\) reduce variance (often \(V_\phi(s_t)\))
See: Spinning Up “Deriving the Simplest Policy Gradient”

—##

Computing Policy Gradient in Practice

Monte Carlo (REINFORCE): unbiased, high variance
Actor–Critic: learn \(V\_\phi(s)\) to form \(\hat{A}\_t\)
Generalized Advantage (GAE): \(\hat{A}^{\text{GAE}}\_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta\_{t+l}\)
Entropy bonuses encourage exploration

Trust Region Policy Optimization (TRPO)

Constrained policy updates: \[\begin{aligned} &\max_{\theta} \mathbb{E}\!\left[ \frac{\pi_{\theta}(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)} \hat{A} \right] \\ &\text{s.t.}\ \ \mathbb{E}\left[ \mathrm{KL}\big(\pi_{\theta_{\text{old}}} \parallel \pi_{\theta}\big) \right] \le \delta \end{aligned}\]

Uses conjugate gradient + line search
Guarantees monotonic improvement

Proximal Policy Optimization (PPO)

Clipped surrogate objective: [ L^() = ] where \(r\_t(\theta)=\frac{\pi\_\theta(a\_t\mid s\_t)}{\pi\_{\theta\_\text{old}}(a\_t\mid s\_t)}\)

References

OpenAI Spinning Up: https://spinningup.openai.com
Schulman et al., TRPO (2015). arXiv:1502.05477
Schulman et al., PPO (2017). arXiv:1707.06347

Appendix: Notation

\(\gamma\): discount, \(\pi\_\theta\): policy, \(V^\pi\), \(Q^\pi\), \(A^\pi\): value functions, KL: Kullback-Leibler divergence ```