LEAN Conjecturer

Laith Zumot • 2025

Mathematics is Lego city

  • Lego bricks = tiny facts

Instruction booklets = proofs

  • step-by-step guides that show how to snap bricks together to build bigger things (“Pythagoras’ theorem”).

LEAN

  • LEAN = a super-precise robot that reads the instruction booklets and shouts “ERROR!” the instant you try to snap two bricks that don’t fit.
  • LEAN is a proof assistant:
    • A compiler for math.
    • Input: theorem and a proof
    • Output:“✅ Proof accepted” or “❌ Line 7—type mismatch.”
  • Example toy theorem:
    • theorem silly (n : ℕ) : n + 0 = n := by rfl
    • LEAN checks that rfl (“reflexivity”) is a legal tactic and that the types line up.

Mathematicians

  • Mathematicians = creative types who invent new bricks (conjectures) and booklets(proofs).

Prover LLMS

  • Prover LLMS = an apprentice who watches the kids and tries to invent its own bricks and booklets.
  • Learning by imitation

Data Scarcity

  • The problem: apprentices run out of Lego pieces to practice with.
  • Existing formal math datasets (Mathlib) are tiny when compared to trillions of tokens for general LLMs.
  • Not enough humans formalizing theorems

Scaling up

  • How do we scale up new lego production?

The Pipeline - Generation

  • Generation – Ask an LLM for New Lego Bricks
    • Seed - pick 40 random files from Mathlib.
    • “Given the theorems in this file, create as many new theorems as possible that are similar but not identical. Do NOT write proofs; just the statements.”
  • Postprocess
    • Strip prefixes like @[simp] or protected.
    • Auto-add imports (import Mathlib, import Aesop).
    • Supply necessary context (variables, open namespaces).

The Pipeline - Evaluation

  • Filter each line on three criteria:
    • Novelty:
      • using LEAN’s exact?
      • check if Mathlib or previous runs already contain it
    • Syntactic validity:
      • using LEAN parser
      • check if the statement even compiles
    • Non-triviality
      • using LEAN’s aesop tactic.
      • check if it is too easy (auto-provable in <1s)
  • Outcome: novel, non-trivial conjectures.

The Pipeline - Iteration

  • Concatenate filtered conjectures into new file, use as seed for another round of generations and filtering
  • Do this 15 times
  • Outcome: 100+ novel, non-trivial conjectures per run

The Pipeline - Verification

  • Use DeepSeek-Prover-V2 (88.9 % success rate on MiniF2F) to attempt full proofs.
    • Those that pass → verified conjectures (true theorems).
    • Those that fail → still useful for training (they’re plausible puzzles).
    • Total raw conjectures 12k+ –> Syntactically valid 11k –> Novel 4k –> Non-trivial (aesop fails) 3.6k

The Pipeline - Reinforcement Learning

  • GRPO (Group Relative Policy Optimization):
    • Reward = 1 if proof found, 0 otherwise.
    • Fine-tune DeepSeek-Prover-V2 on the conjectures.
    • 10 epochs

Past works

  • STP (Dong & Ma) Self-play loop: prover ↔︎ conjecturer, each training the other. Eliminates human data bottleneck; proves 26.3 % of LeanWorkbook vs. 13.2 % baseline.
  • Safe (Liu et al.) Uses Lean to verify every single reasoning step in natural-language CoT, not just the final answer. Step-level formal verification reduces hallucinations; introduces 30 k “FormalStep” benchmark.
  • DeepSeek-Prover-V2 (Ren et al.) 7 B model + RL + Monte-Carlo tree search; SOTA on MiniF2F (61.7 %). Shows the prover side that LeanConjecturer later exploits.
  • PutnamBench (Tsoukalas et al.) 644 undergrad competition problems formalized in Lean. Stress-test for new provers/conjecturers.
  • LeanNavigator (Yin & Gao) Generates 4.7 M new theorems by state-graph exploration (breadth-first search through tactic states). Massive synthetic data; complementary to LLM-based conjecturing.
  • Autoformalization (Ying et al.) Natural-language Olympiad problems Cheap, huge Limited to textbook-style problems
  • State-graph exploration (LeanNavigator) Existing Lean proofs Scales to millions Mostly shallow variants
  • Self-play (STP) Self-generated Curriculum adapts Risk of mode collapse

Issues

  • Reliance on DeepSeek-Prover to verify conjectures; a false positive slips through occasionally.
  • Early iterations cluster around the topics of seed files
  • 16 GPU-hours per GRPO epoch - can probably be sped up using VLLM rollouts

Citations & Further Reading

  • Images created w Gemeni
  • [1] “LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving” arXiv 2506.22005v1 https://arxiv.org/abs/2506.22005v1

Thank You for Reading!