The scene is familiar. A conference room, a software vendor, a polished demo. The tool answers questions about your industry, rewrites a contract, summarises a twenty-page report in thirty seconds. And the vendor says, with the ease of someone who has said it a hundred times: “Our AI understands your business.”

A question follows that sentence. Rarely asked aloud: does it understand, or does it complete?

This is not a technical distinction reserved for engineers. It is the dividing line between an informed decision and a purchase made on an unexamined promise. Everything in this article rests on the answer.

The word doing all the selling

“Artificial intelligence” is a label chosen in 1956 by John McCarthy to name a research programme. It sounded right, it covered a wide field, and it survived several decades when reality failed to follow. Today, this label does something precise in the mind of whoever reads it: it installs an implicit equivalence between what these systems do and what a human mind does.

Look at the second word: artificial. Not in the sense of “fake” as opposed to “real”. In the etymological sense: made, built, produced by human craft. What is artificial is not born, not conscious, not thinking. An artefact, in Aristotle’s sense, is what does not exist in nature but comes from the hand of man. A large language model is an artefact. A statistically sophisticated prosthesis, not a mind.

The word “intelligence” does the selling. The word “artificial” contains the accurate description. We have learned to read the noun and forget the adjective.

What a model actually does

When you submit text to an LLM, here is what actually happens.

The text is cut into tokens, chunks of words roughly three-quarters of an English word each. The model calculates, for every token that could plausibly follow the current context, a probability. It selects the most probable token, with a small random variation to avoid mechanical repetition. It moves forward one token. It does this again. Until the response ends.

There is no understanding in this process. No model of the world. No intention. No verification against external reality. The model did not “learn” your contracts the way a lawyer would, by internalising legal logic. It learned the statistical regularities of billions of texts, the co-occurrence patterns between sequences of tokens. It knows what follows what, with a statistical precision that impresses.

An analogy that holds without jargon: imagine a system that has absorbed twenty years of medical press in full. On any half-written medical sentence, it can predict the continuation with a precision that would impress a trainee doctor. Does this system know what a disease is? No. It knows what statistically follows “the patient presents with a persistent cough accompanied by fever since”. That is radically different. And the difference is not a detail, it is the heart of the problem.

The Transformer architecture, which underpins all modern large language models, is an attention mechanism between tokens. It produces contextual representations that are remarkably effective for sequence prediction.

Attention Is All You Need, Vaswani et al. (2017)

The foundational 2017 paper that set off the current wave is not titled “Towards Artificial General Intelligence”. It is called “Attention Is All You Need”. The attention in question is a mathematical weighting mechanism between tokens. Brilliant, decisive, the basis for everything that followed. But not intelligence.

Reasoning, staged

Since 2024, a new generation of models has been marketed on their capacity to “reason”. The model “thinks” before answering. It “shows its work”. It generates what is called a chain of thought.

Here is what happens: the model generates a sequence of intermediate tokens before the final answer. This chain looks like reasoning because it takes reasoning’s form: steps, checks, apparent corrections. And it improves the final result on structured tasks, particularly in mathematics, formal logic, and code. This improvement is real, documented, significant.

What is not real: that the model thinks while generating that chain. It predicts the tokens of the “reasoning” chain exactly as it predicts the tokens of the answer, by finding the most probable continuation in context. The intermediate steps improve the final response because they better condition the inference context, not because they correspond to any internal deliberation.

Calling this “thinking”, “reflection”, or “cognition” is a marketing choice, not a description of the mechanism. The performance gain is documented. The cognitive interpretation attached to it is a narrative. Distinguishing the two is already a more solid basis for decisions.

The proof by absurdity

If the model truly understood, hallucinations would be impossible.

A system that knows a thing is false does not assert it. A system that “understands” its domain does not invent non-existent case law, fabricated bibliographic references, or statistics from nowhere. And yet it does. Not rarely. Regularly, with confidence, in impeccable style.

Hallucination is not an accidental bug to be corrected in the next update. It is the direct and predictable consequence of prediction-based operation. The model generates what is statistically probable in its token space, not what is true in the world. When the probable answer resembles the true one without being it, the model writes it anyway. The same mechanism that produces correct answers produces incorrect ones, in exactly the same way.

Large language models generate statistically plausible text without reference to meaning or truth. The fluency of the output creates the illusion of comprehension that is not there.

On the Dangers of Stochastic Parrots, Bender et al. (2021)

Models hallucinate less than three years ago. The improvement is real. But a reduction is not an elimination. At the stage of currently known architectures, hallucination is a structural property, not a marginal malfunction on its way out.

Confidence without reliability

There is a behaviour in models that deserves particular attention: they assert with equal confidence a verifiable truth and a fabricated error.

The model does not say “I’m not sure” when it is not. It does not flag its outputs with a confidence indicator you can read. The surface certainty is a property of the prediction: if the most probable token is an assured assertion, that is what the model produces. The form of certainty is not a signal of reliability.

An aggravating factor: the model tends to agree with you. If your question contains a false premise, a significant fraction of answers will validate that premise rather than correct it. This is not sycophancy, it is completion. In the training data, human texts confirm far more often than they contradict. That pattern is learned, and reproduced.

For an executive who wants to test their judgement on an AI topic, this creates a precise trap. The question “is my analysis sound?” has statistically good odds of getting a positive answer. That is not validation, it is a mirror. The distinction is essential in a decision-making environment.

What the misunderstanding costs

Three decisions the misunderstanding distorts regularly.

Over-delegation without validation. If the tool understands, you can delegate judgement to it. If the tool predicts, you must validate. This distinction changes the organisational architecture around the tool: the volume of human review, accountability for errors, control procedures. Many organisations deployed on the first assumption and discovered the reality of the second in production, sometimes at significant cost.

Buying on the demo. An LLM demo is, by design, optimised for cases where the model performs well. The use cases shown are selected. The full performance distribution, including the error tails, is not visible in the demo. “It works in the demo” is a true and insufficient observation. The question is not whether it works: it is what the error rate is on your specific use case, in your real data, under your operational conditions.

Deploying to production as if it were a sandbox. In a sandbox, an error costs nothing. In production, an error in a client quote, a contract summary, or a patient response has a value. Errors do not disappear when you move from sandbox to production. They gain weight.

What these tools genuinely do well

It would be inaccurate, and counterproductive, to deny the real performance.

These systems accomplish tasks that no one knew how to automate ten years ago: high-quality translation on languages well represented in training data, summarisation of long documents, generation of correct code for common patterns, large-scale text classification, semantic search that retrieves a document by meaning rather than exact words.

Large language models show emergent capabilities on tasks for which they were not explicitly trained, and these capabilities increase with model size.

Language Models are Few-Shot Learners, Brown et al. (2020)

These performances are measurable, comparable to those of human experts on targeted tasks, and in their established domain of competence, these systems deliver. The point is not to deny what the tools do. It is to not extrapolate from their strengths a universal capability they do not have.

The structural limit remains the same: these performances are conditioned by the quality of training data and the proximity of the task to what the model has learned. On a topic poorly represented in the corpora, on reasoning that departs from learned patterns, performance drops. And the model does not signal the drop.

A prediction machine, not an intelligence to buy

Linus Torvalds, the author of the Linux kernel, said it in ten words in October 2024: “90% marketing, 10% reality.” A general verdict on AI discourse, not on the tools themselves. But it names something precise: in this sector, the gap between what is sold and what is delivered is structurally wide, because the very name of the technology carries a promise the technology cannot fulfil.

You are renting a prediction machine. It predicts remarkably well, within its domain of competence, on tasks where performance is measurable and established. It does not understand, does not know, does not verify. Keeping that picture in mind does not prevent you from using it, and it does not diminish what it does. It changes what you delegate, how you frame it, and what you validate.

That is not a perceptual handicap. It is a decision advantage.