From RL Basics to GRPO → GSPO for LLMs

A practical, math-first tour

RL Foundations

TRPO: Trust-Region Policy Optimization

PPO: Proximal Policy Optimization

DPO: Direct Preference Optimization

GRPO: Group Relative Policy Optimization

GSPO: Group Sequence Policy Optimization

Practical Notes

References (selected)

  • TRPO (Schulman et al., 2015)
  • PPO (Schulman et al., 2017)
  • DPO (Rafailov et al., 2023/NeurIPS 2023)
  • GRPO (DeepSeekMath; Shao et al., 2024)
  • GSPO (Qwen Team, 2025)