You’ve received twenty AI pitches in the past year. Maybe more. They all promised pretty much the same thing: 40% lower costs, 30% less processing time, more autonomy for teams. A few were true. The majority repackaged a public API with a system prompt layer and a blue logo in the top right-hand corner. Here are the phrases that should put you on alert, and the precise question to ask behind each one.

The demo that works the first time

A salesperson shows you a demo. The model produces impeccable output. “Look, it’s perfect. You’re impressed.

What you don’t see: the ten iterations of prompt engineering that preceded this demo. The test cases carefully chosen to make the model perform. The absence of your real data in the demo. The fact that on your own data, the “perfect first time” rate will be different, and you have no way of knowing this before you pay.

Ask for a demo on your data. Not on generic data, not on ideal cases. On your documents, your conventions, your formats. If the vendor refuses or can’t: the product probably can’t be industrialized in your context.

The model who knows your industry

Common expressions: “he knows your sector”, “he understands the regulations”, “he masters your trade jargon”.

What an LLM “knows” is what was in its training data. Current public data up to a certain date. He doesn’t know your internal regulations. He doesn’t know your standard contracts. He doesn’t know about recent case law that hasn’t been publicly documented. When he answers these questions, he extrapolates. And extrapolation can be wrong with a lot of confidence.

Ask for test cases on points specific to your sector, with verification by an in-house expert. Document failures, not just successes.

The tool that solves your three main problems

The most grandiose version of the AI pitch. Your productivity problem: AI solves it. Your quality problem: AI solves it. Your recruitment problem: AI solves it.

A tool that promises to solve everything in a given field promises not to really solve anything. AI performs well on well-defined tasks, with well-structured inputs, in the domains covered by its data. It is mediocre on ambiguous tasks, with noisy inputs and under-represented domains.

Ask what is the specific task the model solves, what is the performance measure on this task, and what is the procedure when the model is wrong.

Try it, we’ll see

In a development or prototyping context: legitimate phrase. Experimentation is the right way to evaluate an LLM.

In a context of deployment on real processes with real stakes: dangerous phrase. “We’ll see” is acceptable when the error costs a failed prompt. It is not when the error costs an incorrect legal decision. A missed medical diagnosis. A fraudulent transfer that goes unnoticed.

Tolerance to undetected error is context-dependent. In contexts where it is high, “try and see” is appropriate. In contexts where it is low, a validation and re-reading plan is mandatory before any deployment.

What you must contractually require

Four things to contractually demand from the supplier. First, the underlying model: which LLM is running under the hood, and what happens when it changes version. Then the error rate measured on test cases representative of your context, not on demo cases. The destination of the data submitted (EU, non-EU, training use). And the SLA if performance drops after an update of the underlying model. The rest (retention policy, AI Act perimeter, etc.) is covered in the appendices. But these four points are the entry requirement.

If a supplier can’t answer these four questions, you’re not dealing with an industrial solution. You’re dealing with a demo dressed up as a product.