AI systems can now potentially contribute novel mathematical insights on real unsolved problems, but we need better benchmarks to measure this—HorizonMath provides one by focusing on problems where verification is cheap but discovery is genuinely hard.
HorizonMath is a benchmark of 100+ unsolved math problems across 8 domains designed to test whether AI can make genuine mathematical discoveries. Unlike existing benchmarks, it focuses on problems that are hard to solve but easy to verify automatically, avoiding data contamination issues. Early results show GPT-5.4 Pro found solutions to two problems that may improve on published results.