AI Just Cracked Math Problems Experts Thought Were Decades Away

Remember when calculators were controversial in schools because people worried we’d forget how to do basic math? Well, hold onto your protractors, because Google DeepMind just dropped something that makes that debate look quaint.

Their new AI co-mathematician just scored 48% on FrontierMath Tier 4 a benchmark so brutally difficult that when it launched in late 2024, the best AI models could barely scratch 2% of the problems. We’re not talking about high school algebra here. These are research-level mathematics problems that would stump most PhD mathematicians and take experts hours or even days to solve.

To put this in perspective: some of these problems were expected to remain unsolvable by AI for decades. Not years. Decades. And yet, here we are in May 2026, watching an AI system casually solve nearly half of them.

But here’s what makes this truly wild this isn’t just about getting the right answer. One mathematician used this system to solve a 60-year-old open problem that’s been sitting in a famous collection of unsolved questions since 1965. That’s not incremental progress. That’s a leap.

What Exactly Is FrontierMath, and Why Should You Care?

Before we dive into what this AI can do, let’s talk about what it’s being tested on, because FrontierMath is not your average benchmark.

Most AI math benchmarks are essentially glorified standardized tests. GSM8K tests grade-school math. The MATH dataset covers high school and early undergraduate material. These were useful for a while, but by 2024, top AI models were scoring near-perfect on them. Benchmarks were saturating faster than a sponge in a rainstorm.

So Epoch AI, a nonprofit research organization, decided to build something different. Something that wouldn’t become obsolete six months after release. They gathered expert mathematicians we’re talking IMO gold medalists and Fields Medal recipients and asked them to create original, unpublished problems across all major branches of modern mathematics.

The result? FrontierMath.

Here’s what makes it genuinely hard:

Original problems only. Every question is brand new and unpublished. This eliminates the risk of “data contamination”—the AI having seen the answer during training. When a model solves a FrontierMath problem, it’s genuinely solving it, not regurgitating a memorized solution.

Research-level difficulty. These aren’t textbook exercises with worked examples. A typical problem requires multiple hours of concentrated effort from an expert researcher in that specific field of mathematics. The hardest ones can take days.

Automated verification. Answers can be checked programmatically, which means evaluation is objective and scalable. No subjective grading, no ambiguity.

Covers the full spectrum of modern math. Number theory, algebraic geometry, category theory, real analysis, computational problems if it’s in the 2020 Mathematics Subject Classification, it’s probably in FrontierMath.

The benchmark has four tiers of difficulty, with Tier 4 being the absolute pinnacle. To give you a sense of the progression: Tier 1-3 problems are roughly advanced undergraduate through graduate level. Tier 4? That’s early postdoc-level mathematics and beyond.

When FrontierMath launched, state-of-the-art AI models were solving under 2% of problems. By late 2025, the best models had climbed to around 30-40% on Tiers 1-3. But Tier 4 remained stubbornly difficult.

Until now.

The 48% That Changed Everything

Google DeepMind’s AI co-mathematician just scored 48% on FrontierMath Tier 4. That’s 23 problems correctly solved out of 48 non-public questions.

To understand why this is significant, consider what happened before this:

The underlying Gemini 3.1 Pro base model (the foundation this system is built on) scored just 19% on the same test
GPT-5.5 Pro, the previous leader, scored 39.6%
GPT-5.4 Pro managed 37.5%
Claude Opus 4.6 and 4.7 scored 22.9%

The AI co-mathematician didn’t just edge past the competition. It blew past them.

But here’s the really interesting part: this isn’t about building a bigger, more powerful base model. The jump from 19% to 48% isn’t because Google threw more computing power at the problem or trained on more data. It’s about architecture.

It’s Not a Chatbot It’s a Research Workspace

The fundamental insight behind the AI co-mathematician is that mathematics research isn’t a one-shot process. You don’t sit down, stare at a problem, and instantly produce a perfect proof. Real mathematical research is messy, iterative, and collaborative.

You explore dead ends. You try different approaches in parallel. You have colleagues review your work and spot errors. You search through literature for relevant theorems. You write code to test conjectures computationally. You refine ideas over days or weeks, building up a persistent context of what you’ve tried and what you’ve learned.

Traditional AI chatbots don’t work like this. You ask a question, it gives an answer, and the context resets. Even with extended context windows, the interaction model is fundamentally reactive rather than proactive.

The AI co-mathematician is different. It’s designed as a research workbench a persistent workspace where mathematicians can collaborate with AI agents over extended periods.

Here’s how it actually works:

Parallel investigation branches. Instead of pursuing one approach and hoping it works, the system explores multiple solution strategies simultaneously. This mirrors how researchers often sketch out several potential proof approaches before committing to one.

Enforced review cycles. The system includes reviewer agents that critique intermediate work. These aren’t just rubber stamps they actually catch errors and force revisions. In one documented case, a reviewer agent spotted a flaw in an AI-generated proof attempt, and the human mathematician realized he knew how to fix the gap. The collaboration was the point.

Literature access tools. The system can search through mathematical literature, retrieve relevant theorems, and incorporate them into proofs. This is crucial because higher-level mathematics is cumulative you build on existing results rather than proving everything from scratch.

Persistent code execution infrastructure. Many modern math problems require computational exploration. The system can write code, run experiments, analyze results, and use those findings to inform theoretical work. And crucially, this computational context persists across the research session.

Long-horizon stateful workspace. The system maintains research context over days or weeks. It tracks failed hypotheses so you don’t repeat dead ends. It builds up a knowledge base of what approaches have been tried and what’s been learned. It’s not starting fresh with every interaction.

According to the research paper, this architecture is explicitly modeled on what coding agents like Claude Code and Google Antigravity have done for software development providing the scaffolding that lets AI work autonomously over long timescales while remaining steerable by humans.

The authors argue that mathematics has lacked an equivalent collaborative agent framework, and the AI co-mathematician is an attempt to fill that gap.

The 60-Year-Old Problem That Finally Got Solved

Numbers on a leaderboard are one thing. Solving actual open problems is another.

Marc Lackenby, a mathematician at Oxford University, used the AI co-mathematician to resolve Problem 21.10 from the Kourovka Notebook.

If you’re not familiar with the Kourovka Notebook, here’s the background: It’s a collection of unsolved problems in group theory that’s been continuously published since 1965. The first edition was proposed at a symposium in Kourovka, a small village near Sverdlovsk (now Yekaterinburg) in Russia. Every 2-4 years, a new edition comes out with fresh problems and updates on which ones have been solved.

The notebook has become legendary in the mathematics community. More than three-quarters of the problems from the first two editions have now been solved, often leading to significant advances in group theory.

Problem 21.10 had been sitting unsolved for 60 years. And then Lackenby decided to try the AI co-mathematician on it.

Here’s what happened: The AI produced a proof attempt. The reviewer agent spotted a flaw. Instead of the system grinding to a halt, this triggered something interesting Lackenby realized he knew how to fill the gap the AI couldn’t bridge.

The back-and-forth collaboration led to a complete solution.

This is the vision the developers are going for. Not an “oracle that delivers a solution in one shot,” but a tool that holds research context, works through problems systematically, and collaborates with human mathematicians over extended periods.

Other mathematicians who tested the system reported similar experiences:

Gergely Bérczi used it to obtain claimed proofs for conjectures about Stirling coefficients for symmetric power representations.

Semon Rezchikov posed a technical subproblem in Hamiltonian systems and received a key lemma that “withstood careful checking.” Notably, Rezchikov mentioned that other AI systems had failed to produce anything useful on the same problem.

These aren’t just benchmark victories. They’re real research outcomes.

Why This Isn’t Just About Mathematics

If you’re reading this and thinking, “Okay, cool, but I’m not a mathematician, so why should I care?” fair question. Let me tell you why this matters beyond pure mathematics.

Mathematics is a uniquely good testbed for AI reasoning. Unlike natural language or image recognition, math requires precise, logical thinking over extended chains of reasoning. A single error anywhere in a multi-step proof renders the entire solution incorrect. There’s no room for “close enough” or plausible-sounding nonsense.

This makes math an ideal domain for evaluating whether AI systems are actually reasoning or just pattern-matching their way to superficially correct answers.

When an AI can solve research-level math problems, it’s demonstrating capabilities that transfer to other domains requiring rigorous logical reasoning: formal verification of software, scientific hypothesis generation, complex system design, strategic planning.

The architecture patterns matter more than the math. The real innovation here isn’t specifically about mathematics it’s about building AI systems that can work on hard problems over long time horizons while remaining collaborative and steerable.

The same architectural principles that make the AI co-mathematician effective parallel exploration, enforced review cycles, persistent context, tool use are already being applied to software development, scientific research, and business strategy.

If you’re in any field that requires sustained cognitive effort on complex problems, the patterns demonstrated here are coming for your domain soon.

It changes the bottleneck in knowledge work. For decades, the constraint on mathematical progress has been human bandwidth. There are only so many mathematicians, they can only work on so many problems at once, and each problem takes substantial time to solve.

If AI systems can serve as force multipliers—not replacing mathematicians but enabling each one to explore more approaches, test more conjectures, and pursue more research directions simultaneously the pace of mathematical discovery could accelerate dramatically.

And mathematics is the foundation for physics, computer science, engineering, cryptography, economics, and dozens of other fields. Faster mathematical progress means faster progress everywhere.

The Limitations Nobody’s Talking About (But Should)

Now, before we get carried away imagining AI-powered mathematical utopias, let’s talk about what this system still can’t do.

It’s not autonomous. Despite the “co-mathematician” branding, this system still requires human direction and intervention. It’s not going out and independently discovering new theorems or formulating novel research questions. A mathematician still needs to pose the problem, guide the investigation, and validate the results.

Reviewer agents can be wrong. The enforced review cycles are powerful, but they’re not infallible. Sometimes reviewer agents converge on plausible but incorrect reasoning. In other cases, they fall into what the developers call a “death spiral” endless revision with no exit strategy.

Access is extremely limited. This isn’t something you can sign up for and start using. As of now, access is restricted to a small group of testers. There’s no public release timeline.

Elegance is subjective. Mathematicians care about more than just correct answers. They care about elegant proofs, insights that generalize to other problems, and solutions that illuminate the underlying structure of mathematical objects. AI-generated proofs, while correct, don’t always exhibit these qualities. As one tester noted, the “elegance” of AI-generated proofs remains subjective and often lacking.

Cost and efficiency. Each problem attempt uses a “broadly comparable number of model and tool calls to a long AI-assisted software engineering session.” The system was designed to be efficient enough to serve multiple external users, but it’s still computationally intensive. This isn’t something you can casually run on consumer hardware.

What Comes Next?

The pace of progress here is genuinely startling. Let me give you some context:

2021: GPT-3 couldn’t solve more than 35% of grade school math problems
Early 2024: State-of-the-art models were solving over 50% of high school math problems
November 2024: FrontierMath launches; best models solve under 2% of problems
Late 2025: Best models reach 30-40% on FrontierMath Tiers 1-3
May 2026: AI co-mathematician scores 48% on Tier 4

We went from grade school math to research-level mathematics in five years. And the trajectory shows no signs of flattening.

But what happens when these benchmarks saturate? Epoch AI already had to add Tier 4 problems because AI capabilities were advancing faster than expected. They also maintain a collection of genuinely open problems that remain unsolved by human mathematicians.

The next frontier isn’t just solving harder problems it’s AI systems that can formulate novel research questions, identify promising areas of investigation, and make genuine mathematical discoveries without human prompting.

We’re not there yet. But the gap is closing faster than most experts predicted.

The Philosophical Question Lurking in the Background

Here’s something that keeps me up at night: What does it mean when AI systems can solve problems that most human experts can’t?

Mathematics has always been considered one of the highest expressions of human intelligence. It requires creativity, intuition, the ability to recognize deep patterns, and the capacity for rigorous logical reasoning.

If AI systems can do this and not just do it, but do it better and faster than most humans what does that say about the nature of intelligence? About the uniqueness of human cognition?

I don’t have answers. But I think we need to be asking these questions.

Some mathematicians are embracing AI as a powerful collaborative tool. Others are concerned about what happens to the field if AI becomes the dominant mode of mathematical discovery. Will human mathematicians become curators and interpreters of AI-generated results rather than primary discoverers?

Will students still learn to prove theorems by hand if AI can generate proofs more quickly and correctly? Or will we shift to teaching mathematical intuition and problem formulation, leaving the formal verification to machines?

These aren’t just abstract philosophical questions. They have real implications for education, research funding, career paths, and what it means to be a mathematician in the 21st century.

Why Businesses Should Pay Attention

If you’re a business leader reading this, you might be thinking, “This is fascinating, but how does it affect my quarterly earnings?”

Let me connect the dots.

The same AI reasoning capabilities demonstrated on FrontierMath apply to:

Complex optimization problems. Supply chain optimization, resource allocation, scheduling these are mathematically rigorous domains where better AI reasoning translates directly to cost savings and efficiency gains.

Strategic planning. Business strategy requires reasoning through complex scenarios with multiple variables and constraints. AI systems that can handle research-level mathematical complexity can absolutely help with strategic planning at scale.

Risk modeling. Financial risk, operational risk, cybersecurity risk accurate modeling requires sophisticated mathematical reasoning. Better AI math capabilities mean better risk models.

R&D acceleration. If you’re in pharmaceuticals, materials science, or any domain that relies on computational modeling and simulation, AI systems that can handle complex mathematical reasoning can dramatically accelerate discovery.

The tools might not be publicly available today, but the capabilities are being demonstrated. And in technology, the gap between “demonstrated in research” and “available as a commercial product” is shrinking rapidly.

Companies that understand this trajectory and position themselves accordingly will have a significant advantage over those that dismiss it as academic curiosity.

The Timeline Nobody Expected

Let’s revisit those professor expectations for a moment.

When FrontierMath was designed, problems were deliberately crafted to be so difficult that some were expected to remain unsolvable by AI for decades. Not a few years decades.

The AI co-mathematician shattered that timeline in about 18 months.

What other timelines are we getting wrong?

The pattern we’re seeing across AI capabilities in language, vision, reasoning, coding, and now mathematics is that expert predictions about what AI can’t do tend to have a shelf life measured in months, not years.

This doesn’t mean AI will achieve artificial general intelligence tomorrow. Progress is uneven, and there are still plenty of tasks where AI struggles or fails completely.

But it does mean that anchoring your strategic planning to assumptions about what AI can’t do is increasingly risky. The ground is shifting faster than most planning cycles can adapt.

The Real Takeaway

Here’s what I think the 48% on FrontierMath Tier 4 actually means:

It’s not primarily a story about mathematics. It’s a story about AI systems that can tackle genuinely hard problems through sustained, systematic effort rather than one-shot pattern matching.

It’s about moving from AI as a tool for automating routine tasks to AI as a collaborator on frontier problems that push the boundaries of human knowledge.

It’s about architectural patterns parallel exploration, review cycles, persistent context, tool use that make AI systems more capable without necessarily making the underlying models bigger or more powerful.

And it’s about timelines accelerating in ways that consistently surprise even the experts building these systems.

Whether you’re a mathematician, a business leader, a researcher, or just someone trying to make sense of where technology is headed, this development matters.

Not because AI solved some hard math problems. But because of what it reveals about the trajectory we’re on and how quickly we’re moving along it.

The 60-year-old problem from the Kourovka Notebook got solved. How many other decades-old problems in science, engineering, medicine, and every other domain are about to fall?

We’re about to find out.

And honestly? I’m both excited and slightly terrified to see what happens next.

ThunDroid

AI Just Cracked Math Problems Experts Thought Were Decades Away – Here’s Why It Matters