The prototype works. Everyone’s impressed. The CTO says it’s remarkable. The vendor shows a polished demo on clean data, with pre-selected questions.
Nobody mentions tokens. Nobody shows what happens six months after launch.
What the demo doesn’t show
An AI demo is, by design, a favorable staging. Data is curated, questions prepared, edge cases absent. That’s a legitimate evaluation step, as long as you know what you’re watching: not a product, but a proof of concept under favorable conditions.
The problem starts when the investment decision rests on that staging as if it were production reality.
Three gaps separate demo from production.
Volume. A demo handles ten documents, a hundred queries. Production handles ten thousand per day. LLMs charge per token: every word in and every word out has a price. Pricing changes quarterly and varies by a factor of a hundred depending on the model; the order of magnitude to hold onto in 2026: output costs several times input, and capable models range from a few dollars to several dozen dollars per million tokens. That calculation, run on real volume, produces a monthly bill the vendor doesn’t mention in the pitch.
Actual quality. In the demo, the model answers correctly on the shown examples. In production, it meets cases nobody anticipated: malformatted documents, out-of-scope questions, users typing something other than expected. The real error rate is never the demo’s error rate.
Oversight. In the demo, an expert watches each output. In production, nobody reviews 10,000 responses a day. Errors pass through. Some have consequences.
Klarna: the case study you can no longer ignore
For two years, the Klarna example featured in every AI agency deck: in February 2024, the Swedish fintech announced that its AI assistant handled 75% of customer service conversations (the equivalent of 700 agents, across 23 countries). CEO Sebastian Siemiatkowski called it one of the biggest productivity gains in the company’s history. The announcement spread through every vendor pitch.
What those decks didn’t show: what came next.
In May 2025, Siemiatkowski publicly admitted the pivot had gone too far. The AI handled the volume, but quality had degraded. Customers complained. Responses were generic, repetitive, inadequate on complex cases.
Klarna began rehiring human agents, targeting students and rural populations in freelance arrangements. The model they landed on: AI on high-volume routine cases, humans on escalations and judgment-heavy situations. A hybrid. Not the replacement that was announced.
This reversal is the thesis of this article made concrete. The demo metrics shone: volume, handling rate, agent equivalents. The quality metrics (customer satisfaction, re-escalation rate, complexity of remaining cases) forced the correction. The launch announcement was a marketing snapshot. The full trajectory is something else.
The token is an invisible unit of cost
Nobody thinks in tokens naturally. We think in words, pages, documents. But the model counts tokens, and the platform bills in tokens.
One token is roughly three-quarters of an English word. A two-page document: around 500 input tokens. The generated response: 200 output tokens. Output costs several times input across all capable models on the market.
Pricing moves fast. Reference models have dropped by a factor of 10 over two years; advanced reasoning models still cost several dozen dollars per million output tokens. Naming GPT-4o or Claude-3 in an evergreen article means citing stale prices within six months. What stays stable: multiply the real production volume by the token/document ratio, apply the model’s current-day rate, and compare to budget. That calculation, done in real conditions rather than from demo extrapolation, reveals the gap.
When hallucinations reach a judge
The demo doesn’t show hallucinations. In production, they reach real users.
This case was presented as an isolated accident. It was just the beginning.
In 2025, US courts sanctioned attorneys in dozens of similar cases: Dubinin v. Papazian (S.D. Florida, November 2025, fictitious citations, case dismissed), In re Loletha Hale (N.D. Georgia, October 2025, “the overwhelming majority of cases cited either did not exist, did not support the proposition, or misquoted authority”), Idehen v. Stoute-Phillip (Civil Court New York, July 2025, seven fictitious cases in an 88-page appendix). By mid-2026, the database maintained by jurist Damien Charlotin has catalogued over 1,500 documented cases worldwide (approximately 1,600 as of June 1, 2026) where AI produced hallucinated content submitted to courts. The pace: five to six new cases per day.
In legal, medical, or regulatory production, a hallucination isn’t a bug fixed in the next version. It’s professional fault with immediate consequences. The article on hallucinations covers why this property is structural and permanent.
Oversight in production
A frequent argument for AI in production: it reduces headcount. Sometimes true. But the math assumes AI errors are negligible or easily detected. Neither is guaranteed.
In a demo, an expert watches each output. In production, oversight shifts to escalated cases: exceptions, complaints, disputes. Lower volume, but significantly higher cost per case, because these are precisely the situations the AI failed to handle.
For high-stakes domains (legal, medical, financial, compliance), precision requirements impose human review on cases exceeding a confidence threshold. In critical production, AI filters, humans validate the risky cases. The real economics look like the hybrid model Klarna eventually adopted, not the launch announcement.
Before signing off on a prototype
The prototype convinces. The reflex is to move to production fast. Three points are typically missing from that decision.
The real cost per unit processed. Not the demo cost: the production cost, with real volume, real contexts, real tokens. A calculation, not a vague extrapolation. For a framework on costing an AI project end-to-end, see The real price of an AI project.
The acceptable error rate. An LLM makes errors. Define the threshold above which operational or legal consequences become unacceptable. That threshold must be set before deployment, and tested on real data, not on the vendor’s examples.
The oversight plan. List cases requiring human review, estimate their frequency, calculate the cost of that oversight. If this calculation hasn’t been done, the project isn’t ready.
These three points mirror the method in Three questions before investing in AI. They apply to every deployment, without exception by sector.
What the vendor doesn’t say unprompted
They don’t lie. They show what’s favorable. That’s rational on their part.
What they typically don’t mention unless asked: the monthly API cost at target production volume, the model’s behavior on edge cases specific to your sector, the oversight plan if the error rate exceeds the acceptable threshold, the data retention policy on their platform.
On that last point, see Your data in AI: what leaves, what stays: API calls to an externally-hosted LLM externalize the request data, sensitive or not.
Deploy, but with eyes open
AI in production works. It functions in dozens of documented use cases. The Klarna trajectory (high-profile launch, quiet correction, hybrid settled on) isn’t a failure. It’s the normal path of an honest deployment.
What causes problems: deciding on demo metrics, citing launch announcements without their sequel, and not calculating real costs before signing. These three shortcuts explain most AI projects that return to square one after six months.
Testing on vendor data, extrapolating cost, assuming supervision disappears: the short path to expensive disappointment.