The word is everywhere in the announcements since 2024: the new models “reason”. OpenAI calls its line o1, o3. Anthropic talks about “extended thinking”. Google has its Gemini Thinking mode. DeepSeek has published R1. The idea: models that think before they respond.
Measurable performance really does improve on certain tasks. That’s the honest part of the speech. The rest is worth opening up.
Technical changes
The underlying technique is called chain-of-thought. It was described in a Google Brain article in 2022: instead of directly asking for the answer, the model is asked to describe the intermediate steps. Performance on mathematical and logical reasoning tasks increases significantly.
Providing step-by-step reasoning examples in the prompt improves performance on mathematical and common sense reasoning benchmarks dramatically, for models of a certain size and above.
What o1, R1 and their equivalents do: they internalize this chain of thought. Instead of you asking for it in your prompt, the model automatically generates it as intermediate tokens (often hidden from the user) before arriving at the final answer. Thinking” tokens better condition the context and enable you to arrive at a more precise final answer.
The improvement is real. On formal mathematical problems, code problems, structured logic puzzles, these models do better than their predecessors. MATH, AIME and other benchmarks bear witness to this.
What doesn’t change
The model always generates tokens, one after the other, based on probabilities. The chain of thought is a sequence of predicted tokens, not a trace of a cognitive process. The model has no internal representation of the problem. It has no hypotheses that it tests and abandons. It generates text that looks like reasoning because it has been trained on traces of human reasoning.
This distinction is important in contexts where non-distributional robustness is required. Reasoning models excel on problem types that are well represented in their training data. On structurally different problems, performance plummets. A true formal reasoning system (a theorem prover, for example) does not exhibit this behavior: it either demonstrates or fails, without hallucinating an incorrect demonstration.
DeepSeek-R1: a crack in the myth of opacity
In January 2025, DeepSeek released R1, an open-weights reasoning model that achieves performance comparable to OpenAI’s o1 on several benchmarks, at a fraction of the stated training cost.
DeepSeek-R1 achieves comparable performance to OpenAI-o1 on mathematical reasoning and code benchmarks, using pure reinforcement techniques (without intensive human supervision) during the thinking phase.
This moment is revealing on two levels. Firstly, it shows that reasoning techniques are not exclusive to players with multi-billion budgets. Secondly, it shows that competition is global, and that sustainable technological lead claims are fragile.
What it changes (and doesn’t change) for your projects
If you’re evaluating models for analysis or problem-solving tasks, reasoning models are worth testing. On structured tasks (extraction with conditional logic, rule validation, code generation with tests), they often outperform standard models.
Limitations persist. They cost more (more tokens generated for the chain of thought). They are slower. They are not more reliable on the facts (reasoning improves the structure of the answer, not the veracity of the facts used). And they can “reason” towards a false conclusion with as much confidence as a true one.
The rule remains the same: test on your own data, not on benchmarks published by manufacturers.