Epoch AI in collaboration with over 60 mathematicians from leading institutions worldwide has introduced FrontierMath, a new benchmark designed to evaluate AI systems' capabilities in advanced mathematical reasoning.
Epoch AI’s benchmark development team includes distinguished mathematicians, including 14 International Mathematical Olympiad (IMO) gold medalists and a Fields Medal recipient. This new benchmark reveals a significant gap between current AI capabilities and expert-level mathematical problem-solving, with even the most advanced models solving less than 2% of the problems.
FrontierMath features hundreds of original, exceptionally challenging mathematics problems that span most major branches of modern mathematics—from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory.
Source: Network Visualization revealing Mathematical Subjects in combination with individual problems
The release comes at a time when existing mathematical benchmarks like MATH dataset and GSM8K are approaching saturation points, with top AI models achieving near-perfect scores. These earlier benchmarks, focused on high-school and early undergraduate mathematics, no longer provide meaningful differentiation between advanced AI systems.
FrontierMath addresses two critical challenges in AI evaluation: the saturation of existing mathematics benchmarks and the risk of data contamination. The data contamination challenges in AI evaluation are addressed by using entirely new, unpublished problems with automated verification systems; the benchmark ensures that performance metrics genuinely reflect an AI system's mathematical reasoning capabilities rather than pattern matching against training data.
The benchmark's development process emphasizes rigorous quality control of LLMs through a multi-stage review system that verifies problem correctness, checks for ambiguities, assesses guess proofness, and validates difficulty ratings.
Each problem in the benchmark requires multiple hours of effort from researchers in the relevant branch of mathematics, with some problems demanding several days to solve.
Recent evaluations of leading AI models on FrontierMath have yielded revealing results. Tests included major models such as OpenAI's o1-preview, o1-mini, and GPT-4o, Anthropic's Claude 3.5 Sonnet, XAI's Grok 2 Beta, and Google DeepMind's Gemini 1.5 Pro 002. The results showed that no model achieved even a 2% success rate on the full benchmark, highlighting the substantial gap between current AI capabilities and expert-level mathematical problem-solving.
Source: Performance of leading language models on FrontierMath
Epoch AI claims the benchmark addresses several key challenges in AI evaluation through automated verification systems that enable efficient, reproducible assessment of both open and closed-source AI systems. However, it also has limitations. The focus on automatically verifiable and numerical answers excludes proof-writing and open-ended exploration, which are significant aspects of modern mathematical research.
Andrej Karpathy, Eureka Labs founder, former senior director of AI at Tesla and a founding member at OpenAI, contextualizes this development through the lens of historical AI challenges:
This is Moravec's paradox in disguise, who observed 30+ years ago that what is easy/hard for humans can be non-intuitively very different to what is easy/hard for computers.
He supports the creation of such benchmarks while noting the importance of evaluating AI systems on seemingly "easy" tasks that prove challenging for machines.
Jack Clark co-founder of Anthropic suggests that LLM skeptics might be surprised by AI capabilities:
I think if people who are true LLM skeptics spent 10 hours trying to get modern AI systems to do tasks that the skeptics are experts in they'd be genuinely shocked by how capable these things are.
A more cautionary perspective comes from the developer community. As one Hacker News user points out:
The problem with all benchmarks, one that we just don't know how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after.
Mathematician Evan Chen highlights FrontierMath's unique approach:
The FrontierMath benchmark does something different from the IMO and Putnam... a problem in the FrontierMath benchmark should test for insight rather than just standard techniques or knowledge.
Researchers and organizations interested in evaluating their models against the FrontierMath benchmark can contact math_evals@epochai.org for access. For more on Hugging Face Open LLM leaderboard v2 benchmarking checkout this InfoQ article.