Instruction booklets = proofs
- step-by-step guides that show how to snap bricks together to build bigger things (“Pythagoras’ theorem”).
LEAN
- LEAN = a super-precise robot that reads the instruction booklets and shouts “ERROR!” the instant you try to snap two bricks that don’t fit.
- LEAN is a proof assistant:
- A compiler for math.
- Input: theorem and a proof
- Output:“✅ Proof accepted” or “❌ Line 7—type mismatch.”
- Example toy theorem:
- theorem silly (n : ℕ) : n + 0 = n := by rfl
- LEAN checks that rfl (“reflexivity”) is a legal tactic and that the types line up.
Mathematicians
- Mathematicians = creative types who invent new bricks (conjectures) and booklets(proofs).
Prover LLMS
- Prover LLMS = an apprentice who watches the kids and tries to invent its own bricks and booklets.
- Learning by imitation
Data Scarcity
- The problem: apprentices run out of Lego pieces to practice with.
- Existing formal math datasets (Mathlib) are tiny when compared to trillions of tokens for general LLMs.
- Not enough humans formalizing theorems
Scaling up
- How do we scale up new lego production?
The Pipeline - Generation
- Generation – Ask an LLM for New Lego Bricks
- Seed - pick 40 random files from Mathlib.
- “Given the theorems in this file, create as many new theorems as possible that are similar but not identical. Do NOT write proofs; just the statements.”
- Postprocess
- Strip prefixes like @[simp] or protected.
- Auto-add imports (import Mathlib, import Aesop).
- Supply necessary context (variables, open namespaces).
The Pipeline - Evaluation
- Filter each line on three criteria:
- Novelty:
- using LEAN’s
exact?
- check if Mathlib or previous runs already contain it
- Syntactic validity:
- using LEAN parser
- check if the statement even compiles
- Non-triviality
- using LEAN’s
aesop tactic.
- check if it is too easy (auto-provable in <1s)
- Outcome: novel, non-trivial conjectures.
The Pipeline - Iteration
- Concatenate filtered conjectures into new file, use as seed for another round of generations and filtering
- Do this 15 times
- Outcome: 100+ novel, non-trivial conjectures per run
The Pipeline - Verification
- Use DeepSeek-Prover-V2 (88.9 % success rate on MiniF2F) to attempt full proofs.
- Those that pass → verified conjectures (true theorems).
- Those that fail → still useful for training (they’re plausible puzzles).
- Total raw conjectures 12k+ –> Syntactically valid 11k –> Novel 4k –> Non-trivial (aesop fails) 3.6k
The Pipeline - Reinforcement Learning
- GRPO (Group Relative Policy Optimization):
- Reward = 1 if proof found, 0 otherwise.
- Fine-tune DeepSeek-Prover-V2 on the conjectures.
- 10 epochs
Past works
- STP (Dong & Ma) Self-play loop: prover ↔︎ conjecturer, each training the other. Eliminates human data bottleneck; proves 26.3 % of LeanWorkbook vs. 13.2 % baseline.
- Safe (Liu et al.) Uses Lean to verify every single reasoning step in natural-language CoT, not just the final answer. Step-level formal verification reduces hallucinations; introduces 30 k “FormalStep” benchmark.
- DeepSeek-Prover-V2 (Ren et al.) 7 B model + RL + Monte-Carlo tree search; SOTA on MiniF2F (61.7 %). Shows the prover side that LeanConjecturer later exploits.
- PutnamBench (Tsoukalas et al.) 644 undergrad competition problems formalized in Lean. Stress-test for new provers/conjecturers.
- LeanNavigator (Yin & Gao) Generates 4.7 M new theorems by state-graph exploration (breadth-first search through tactic states). Massive synthetic data; complementary to LLM-based conjecturing.
- Autoformalization (Ying et al.) Natural-language Olympiad problems Cheap, huge Limited to textbook-style problems
- State-graph exploration (LeanNavigator) Existing Lean proofs Scales to millions Mostly shallow variants
- Self-play (STP) Self-generated Curriculum adapts Risk of mode collapse
Issues
- Reliance on DeepSeek-Prover to verify conjectures; a false positive slips through occasionally.
- Early iterations cluster around the topics of seed files
- 16 GPU-hours per GRPO epoch - can probably be sped up using VLLM rollouts
Citations & Further Reading