logo
Loading...
Solving olympiad geometry without human demonstrations

Mathematics

Solving olympiad geometry without human demonstrations

T. H. Trinh, Y. Wu, et al.

Discover how AlphaGeometry, developed by Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, and Thang Luong, is revolutionizing the challenge of proving olympiad-level geometry theorems by utilizing millions of synthesized theorems and proofs, pushing the boundaries of mathematical intelligence.... show more
Introduction

The paper addresses the challenge of automated theorem proving at the olympiad level, focusing on Euclidean plane geometry. The central research question is how to achieve human-level (or near human-level) performance in geometry theorem proving without relying on scarce human-annotated, machine-verifiable proofs. The context highlights that translating human proofs into formal languages is costly and particularly difficult for geometry, creating a data bottleneck that has limited learning-based approaches. Existing geometry-specific formal languages are narrow and cannot capture many human proof techniques (e.g., complex numbers), while general-purpose proof assistants have limited geometry content due to translation challenges. As a result, most prior geometry provers rely on symbolic methods and human-crafted heuristics. The purpose of this study is to overcome data scarcity by synthesizing large-scale training data (theorems and proofs) and to combine learned guidance with a strong symbolic engine to navigate the infinite branching introduced by auxiliary constructions. The importance lies in demonstrating a scalable, neuro-symbolic approach that approaches IMO gold-medallist performance, outputs human-readable proofs, and generalizes problem statements, marking a milestone in automated reasoning.

Literature Review

The paper situates its contribution within several strands of prior work: (1) Formal mathematics with neural theorem proving has advanced using human proof corpora in general-purpose systems (e.g., Lean/Isabelle), but geometry lags due to domain-specific translation difficulty and scarcity of formal proofs. (2) Geometry provers historically include computer algebra methods (Wu’s method, Gröbner bases) that can decide truth but often lack human-readable proofs and can be computationally expensive on IMO-scale problems. (3) Synthetic or axiomatic search methods use geometric axioms and human-designed heuristics (e.g., Chou et al.), producing interpretable proofs but lacking guarantees and struggling with exogenous term generation (auxiliary constructions). (4) Large language models (e.g., GPT-4) show general reasoning ability but perform poorly on generating correct, full natural-language geometry proofs and have issues interfacing reliably with symbolic engines. The paper leverages insights from these areas—combining symbolic deduction with learned guidance—while avoiding reliance on human demonstrations by generating synthetic theorems and proofs at scale.

Methodology

The approach, AlphaGeometry, is a neuro-symbolic theorem prover for Euclidean plane geometry composed of: (1) Large-scale synthetic data generation; (2) A combined symbolic engine for deduction; (3) A language model trained to propose auxiliary constructions; and (4) An interleaved proof-search loop.

  1. Synthetic data generation: The system samples nearly one billion random sets of theorem premises using a diverse set of geometric construction actions (including composite actions like taking centroids or excentres). For each sampled set, a symbolic deduction engine computes a forward-deduction closure, producing a directed acyclic graph (DAG) of reachable conclusions. A traceback algorithm identifies, for each conclusion node N, the minimal dependency subgraph G(N) and the minimal set of necessary premises P, yielding a synthetic example (premises P, conclusion N, proof G(N)). This process discovers over 100 million unique theorems and proofs without using human-designed problem sets.

  2. Symbolic engine (DD + AR): The core deduction engine is a deductive database (DD) of Horn-clause geometric rules (e.g., equal segments, collinearity) extended with algebraic rules (AR) for angle, ratio, and distance chasing. The forward deduction plus traceback (for both DD and AR) constitutes a state-of-the-art symbolic reasoning backbone, enabling efficient pure-deduction proofs and extraction of minimal proof dependencies.

  3. Dependency difference and auxiliary constructions: To move beyond what pure symbolic deduction can reach, the method identifies auxiliary construction steps as the dependency difference—objects and constructions appearing in the premises but independent of the conclusion statement’s core objects. These exogenous terms are reclassified as proof steps to be generated by a model. This targets geometry’s key challenge: infinite branching from auxiliary point constructions, traditionally handled via hand-crafted templates and heuristics.

  4. Language model training: The triples (P, N, G(N)) are serialized as text strings in the format “.” A Transformer language model is trained via next-token prediction to generate proofs conditioned on premises and conclusions. Pretraining uses all ~100M synthetic proofs (including pure-deduction ones). Fine-tuning focuses on the ~9% subset (~9M proofs) requiring auxiliary constructions, sharpening the model’s ability to propose effective exogenous constructions during search.

  5. Proof-search loop and decoding: During solving, the symbolic engine runs first to exhaust pure-deduction consequences. If the conclusion is not derived, the language model proposes one auxiliary construction per turn (e.g., “construct point X so that ABCX is a parallelogram”), after which the symbolic engine recomputes the closure. This loop repeats until the conclusion is found or a maximum number of iterations is reached. Beam search explores the top-k model proposals, and the procedure is parallelized across beams/workers to increase search coverage.

Benchmarks and setup: The team adapted 30 IMO geometry problems since 2000 that fit the classical geometry environment (excluding geometric inequalities and combinatorial geometry), forming the IMO-AG-30 benchmark. Comparisons include computer algebra methods (Wu’s method, Gröbner basis), symbolic search baselines (DD; DD + human heuristics; full-angle method), LLM-based prompting (GPT-4), and ablations (without pretraining, without fine-tuning).

Key Findings
  • Performance on IMO-AG-30: AlphaGeometry solves 25/30 problems, surpassing the previous state of the art Wu’s method (10/30) and the strongest symbolic baseline DD + AR + human-designed heuristics (18/30). Ablations show 21/30 without pretraining and 23/30 without fine-tuning.
  • Human-level comparison: Approaches the performance of an average IMO gold medallist; solved all geometry problems in the IMO 2000 and 2015 sets. Expert evaluation (USA IMO team coach) recommended full scores for AlphaGeometry’s 2000 and 2015 geometry solutions, exceeding the medal threshold (14/42) for those years.
  • Generalization discovery: Identified an unused premise in translated IMO 2004 P1 via traceback, yielding a more general theorem in which the midpoint condition on point O is unnecessary for collinearity of P, B, C.
  • Synthetic data scale and properties: Generated over 100M unique theorems and proofs; 9% of synthetic proofs require auxiliary constructions; some proofs exceed the length of the hardest IMO test-set problems (max synthetic proof length 247 with two auxiliary constructions). Many synthetic premises are not aesthetically symmetric, suggesting broader coverage than human-discovered theorems.
  • Contribution of components: Adding AR to DD increases solved problems from 7 to 14; adding the language model’s auxiliary constructions lifts performance by 11 more to 25. Pretraining on pure-deduction proofs improves the LM’s auxiliary construction success; fine-tuning on auxiliary-construction proofs further improves performance.
  • Efficiency/robustness: With only 20% of training data, the system still solves 21 problems; with <2% of the search budget (beam size 8 vs 512), it also solves 21. On a larger mixed benchmark of 231 geometry problems (textbook, regional olympiads, famous theorems), AlphaGeometry solves ~98.7%, Wu’s method 75%, and DD + AR + human heuristics 92.2% (per extended data).
Discussion

The study demonstrates that large-scale synthetic data combined with a strong symbolic engine and a trained language model can overcome the data bottleneck that has limited learning-based geometry theorem proving. By identifying and learning the dependency difference, AlphaGeometry tackles the core challenge of exogenous term generation (auxiliary constructions), turning an infinite-branching search into a guided process. The pretraining on pure symbolic deductions allows the language model to align its proposals with the symbolic engine’s capabilities, improving the relevance and efficacy of auxiliary constructions during search. Empirically, the system not only outperforms strong computer algebra and heuristic-driven symbolic baselines but also produces human-readable proofs and can generalize problem statements by pruning unnecessary premises via traceback. The correlation between harder human problems and longer AlphaGeometry proofs in select cases suggests that proof length partially reflects intrinsic difficulty. Collectively, these results validate a neuro-symbolic paradigm that reduces reliance on human demonstrations, provides interpretability, and offers a framework potentially transferable to other mathematical domains with similar data-scarcity and exogenous-term challenges.

Conclusion

AlphaGeometry introduces a neuro-symbolic framework for olympiad-level Euclidean plane geometry that sidesteps human-annotated proof corpora by generating over 100 million synthetic theorems and proofs, training a language model to propose auxiliary constructions, and coupling it with an advanced symbolic engine (DD + AR). It achieves near gold-medallist performance on a curated IMO-derived benchmark (25/30), produces human-readable proofs, solves the 2000 and 2015 geometry IMO sets under expert evaluation, and even discovers a generalization of a translated IMO problem. The work provides a general guiding framework for other mathematical domains facing data scarcity and exogenous-term generation challenges. Future directions suggested by the paper include expanding applicability to broader domains of mathematics, improving the breadth and aesthetic realism of synthetic data, enhancing proof readability, and studying time- and compute-constrained settings for more human-comparable evaluations.

Limitations
  • Domain scope: Focused on classical Euclidean plane geometry; excludes geometric inequalities and combinatorial geometry, which cannot be represented in the chosen environment.
  • Translation and benchmark constraints: Problems are adapted to a specialized geometry environment rather than solved directly from natural-language statements; comparisons to human performance are approximate.
  • Proof toolkit granularity: AlphaGeometry operates with a lower-level proving toolkit than humans, which can limit coverage of synthetic data, test-time performance, and proof readability.
  • No formal guarantees: As a search-based method, it lacks the universal decision guarantees of computer algebra methods (though it outperforms them empirically on the benchmark).
  • Data bias and coverage: Despite the vast synthetic set, it remains a subset of possible geometry theorems and may lack human-style symmetry or higher-level constructs used in some human proofs.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny