The International Mathematical Olympiad (IMO) is arguably the leading mathematical problem-solving competition. Every year, high school students from around the world attempt six problems over the span of three hours. Students whose scores cross a threshold, roughly corresponding to solving five of the six problems, obtain Gold medals, with Silver and Bronze medals for those crossing other thresholds. The problems do not require advanced mathematical knowledge, but instead test for mathematical creativity. They are always new, and it is ensured that no similar problems are online or in the literature.
The AI gold medallist
IMO 2025 had some unusual participants. Even before the Olympiad closed, OpenAI, the maker of ChatGPT, announced that an experimental reasoning model of theirs had answered the Olympiad at the Gold medal level, following the same time limits as the human participants. Remarkably, this was not a model specifically trained or designed for the IMO, but a general-purpose reasoning model with reasoning powers good enough for an IMO Gold.
The OpenAI announcement raised some issues. Many felt that announcing an AI result while the IMO had not concluded overshadowed the achievements of the human participants. Also, the Gold medal score was graded and given by former IMO medalists hired by OpenAI, and some disputed whether the grading was correct. However, a couple of days later, another announcement came. Google-DeepMind attempted the IMO officially, with an advanced version of Gemini Deep Think. Three days after the Olympiad, with the permission of the IMO organisers, they announced that they had obtained a score at the level of a Gold medal. The IMO president Prof. Gregor Dolinar stated, “We can confirm that Google DeepMind has reached the much-desired milestone, earning 35 out of a possible 42 points — a gold medal score. Their solutions were astonishing in many respects. IMO graders found them to be clear, precise and most of them easy to follow.”
Stages of development
Even as it became a popular sensation, ChatGPT was infamous both for hallucinations (making up facts) and for simple arithmetic mistakes. Both these would make solving even modest mathematical problems mostly impossible.
The first advance that greatly reduced these errors, which came a few months after the launch of ChatGPT, was the use of so-called agents. Specifically, models were now able to use web searches to gather accurate information, and Python interpreters to run programs to perform calculations and check reasoning using numerical experiments. These made the models dramatically more accurate, and good enough to solve moderately hard mathematical problems. However, as a single error in a mathematical solution makes the solution invalid, these were not yet accurate enough to reach IMO (or research) level.
Greater accuracy can be obtained by pairing language models with formal proof systems such as the Lean prover — a computer software that can understand and check proofs. Indeed, for IMO 2024 such a system from Google-DeepMind called AlphaProof obtained a silver medal score (but it ran for two days).
Finally, a breakthrough came with the so-called reasoning models, such as o3 from OpenAI and Google-DeepMind’s Gemini-2.5-pro. These models are perhaps better described as internal monologue models. Before answering a complex question, they generate a monologue considering approaches, carrying them out, revisiting their proposed solutions, sometimes dithering and starting all over again, before finally giving a solution with which they are satisfied. It were such models, with some additional advances, that got Olympiad Gold medal scores.
Analogical reasoning and combining ingredients from different sources gives language models some originality, but probably not enough for hard and novel problems. However, verification either through the internal consistency of reasoning models or, better still, checking by the Lean prover, allows training by trying a large number of things and seeing what works, in the same way that AI systems became chess champions starting with just the rules.
Such reinforcement learning has allowed recent models to go beyond training data by creating their own synthetic data.
The implications
Olympiad problems, for both humans and AIs, are not ends in themselves but tests of mathematical problem-solving ability. There are other aspects of research besides problem-solving.
Growing anecdotal experiences suggest that AI systems have excellent capabilities in many of these too, such as suggesting approaches and related problems.
However, the crucial difference between problem-solving and research/development is scale. Research involves working for months or years without errors creeping in, and without wandering off in fruitless directions. As mentioned earlier, coupling models with the Lean prover can prevent errors. Indications are that it is only a matter of time before this is successful.
In the meantime, these models can act as powerful collaborators with human researchers, greatly accelerating research and development in all areas involving mathematics. The era of the super-scientist is here.
Siddhartha Gadgil is a professor in the Department of Mathematics, IISc
Published – August 11, 2025 08:30 am IST