HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath et al.|March 16, 2026arXiv

Key Takeaway

AI systems can now potentially contribute novel mathematical insights on real unsolved problems, but we need better benchmarks to measure this—HorizonMath provides one by focusing on problems where verification is cheap but discovery is genuinely hard.

Summary

HorizonMath is a benchmark of 100+ unsolved math problems across 8 domains designed to test whether AI can make genuine mathematical discoveries. Unlike existing benchmarks, it focuses on problems that are hard to solve but easy to verify automatically, avoiding data contamination issues. Early results show GPT-5.4 Pro found solutions to two problems that may improve on published results.

evaluation reasoning

Key Terms

benchmark data-contamination mathematical-reasoning automated-verification