Basics of Reinforcement Learning

From Gridworld to Policy Gradient Methods

What is Reinforcement Learning?

  • An agent interacts with an environment.
  • At each time step \(t\), the agent observes a state \(s\_t\) and takes an action \(a\_t\).
  • The environment returns a reward \(r\_t\) and a new state \(s\_{t+1}\).
  • The objective is to maximize cumulative reward over time.

Gridworld Toy Example

  • States: positions in a 2D grid (e.g., 3x3).
  • Actions: Up / Down / Left / Right.
  • Rewards: \(-10\), \(0\), \(+10\).
  • Goal: learn policy to maximizes expected return.

S = start, G = goal (+10), X = pit (-10)

+---+---+---+
|   |   |   |
+---+---+---+
|   |   |   |
+---+---+---+
| S | X | G |
+---+---+---+

Gridworld Toy Example

+---+---+---+
|   |   |   |
+---+---+---+
| S |   |   |
+---+---+---+
|   | X | G |
+---+---+---+
+---+---+---+
|   |   |   |
+---+---+---+
|   | S |   |
+---+---+---+
|   | X | G |
+---+---+---+
+---+---+---+
|   |   |   |
+---+---+---+
|   |   | S |
+---+---+---+
|   | X | G |
+---+---+---+

RL for Language Models (LMs)

  • State: current token sequence \(x\_{1:t}\).
  • Action: predict next token \(x\_{t+1}\) given \(x\_{1:t}\).
  • State transition: appending token updates state to \(x\_{1:t+1}\).
  • Reward: at sequence end, assign quality score \(R(x\_{1:T})\) via
    • Human evaluation, or
    • Automated reward models (trained on preferences).
  • Learn policy \(\pi\_\theta\) maximizing \(\mathbb{E}[R(x\_{1:T})]\).

Return, Value, and Policy

  • Return: \(G_t = \sum_{k=0}^{\infty} \gamma^k r\_{t+k}\)
  • Value: \(V^\pi(s) = \mathbb{E}\_\pi [ G_t \mid s_t = s ]\)
  • Action-Value: \(Q^\pi(s,a) = \mathbb{E}\_\pi [ G_t \mid s_t = s, a_t = a ]\)
  • Policy: \(\pi\_\theta(a \mid s)\) outputs action distribution

Advantage Function

The advantage compares actions to state baseline: \[ A\pi(s,a) = Q\pi(s,a) - V\pi(s) \] Positive \(A\pi(s,a)\): better than average for state \(s\) & Negative: worse than average

—##

Policy Gradient (REINFORCE)

Climb performance hill:

\nabla_\theta J(\theta) = 
\mathbb{E}_{\tau \sim \pi_\theta}\!\left[
\sum_{t} \nabla_\theta \log \pi_\theta(a_t \mid s_t)  \hat{A}_t
\right]

where \(\hat{A}\_t\) is advantage estimator (e.g., \(G_t - b_t\)).

  • Baselines \(b_t\) reduce variance (often \(V_\phi(s_t)\))
  • See: Spinning Up “Deriving the Simplest Policy Gradient”

—##

Computing Policy Gradient in Practice

  • Monte Carlo (REINFORCE): unbiased, high variance
  • Actor–Critic: learn \(V\_\phi(s)\) to form \(\hat{A}\_t\)
  • Generalized Advantage (GAE): \(\hat{A}^{\text{GAE}}\_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta\_{t+l}\)
  • Entropy bonuses encourage exploration

Trust Region Policy Optimization (TRPO)

Constrained policy updates: \[\begin{aligned} &\max_{\theta} \mathbb{E}\!\left[ \frac{\pi_{\theta}(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)} \hat{A} \right] \\ &\text{s.t.}\ \ \mathbb{E}\left[ \mathrm{KL}\big(\pi_{\theta_{\text{old}}} \parallel \pi_{\theta}\big) \right] \le \delta \end{aligned}\]
  • Uses conjugate gradient + line search
  • Guarantees monotonic improvement

Proximal Policy Optimization (PPO)

Clipped surrogate objective: [ L^() = ] where \(r\_t(\theta)=\frac{\pi\_\theta(a\_t\mid s\_t)}{\pi\_{\theta\_\text{old}}(a\_t\mid s\_t)}\)

References

Appendix: Notation

\(\gamma\): discount, \(\pi\_\theta\): policy, \(V^\pi\), \(Q^\pi\), \(A^\pi\): value functions, KL: Kullback-Leibler divergence ```