From RL Basics to GRPO → GSPO for LLMs
A practical, math-first tour
TRPO: Trust-Region Policy Optimization
PPO: Proximal Policy Optimization
DPO: Direct Preference Optimization
GRPO: Group Relative Policy Optimization
GSPO: Group Sequence Policy Optimization
References (selected)
- TRPO (Schulman et al., 2015)
- PPO (Schulman et al., 2017)
- DPO (Rafailov et al., 2023/NeurIPS 2023)
- GRPO (DeepSeekMath; Shao et al., 2024)
- GSPO (Qwen Team, 2025)