80,000 euros on the estimate. 240,000 eighteen months later. The discrepancy is not a fraud. The integrator costed what he was asked to cost: development, integration, production start-up. Nobody asked him to put a figure on the rest. And the rest is precisely what consumes the production budget. Six line items are systematically missing from these quotes, in order of surprise when the CFO sees the invoice.

The cost of tokens in production

LLM is invoiced per token. During the development phase, the volume of tokens is low: you test and iterate. In production, the volume explodes with use.

Realistic example: a document assistance application for 100 users, each making 20 requests per day, with an average context of 2,000 tokens per request (document + question + answer). That’s 4 million tokens per day. At the current day-rate for a capable model (verify on the provider’s pricing page at the time of evaluation, rates change every quarter), this volume can cost anywhere from a few dozen to several hundred euros per day depending on which model you choose. The calculation method doesn’t change; the rate itself may have halved since you last looked.

If your usage is ten times greater, the API line item easily exceeds €100,000 to €200,000 per year. This cost was not in the initial quote.

The cost of data preparation

For your LLM to respond correctly on your domain, your data must be accessible. In practice: cleaning up documents, converting them into usable formats (PDF → clean text, not trivial), building and maintaining a vector database, managing updates.

This work is systematically underestimated. A company’s real data is contained in poorly structured PDFs, Word documents with tables, Excel exports and e-mails. Extracting clean text from these sources is a project in itself.

A serious RAG project often allocates 30 to 40% of the total budget to data preparation and maintenance. This budget is rarely included in the initial quote.

The cost of prompt maintenance

An LLM responds to instructions (prompts). These instructions must be adjusted and maintained when the model changes version, when test cases reveal unexpected behavior, when use cases evolve.

Prompt engineering is not a one-shot task. It’s a continuous process. Models change version (GPT-4 → GPT-4o → GPT-4.5 → GPT-5…) and their behavior evolves. A prompt that was working well may degrade after an update of the underlying model.

This maintenance cost is virtually absent from our quotations. It often represents 15 to 25% of the total cost over 3 years.

The cost of human proofreading

An LLM makes mistakes. These errors must be detected. In serious cases, human proofreading is integrated into the workflow.

This proofreading has a cost. If the model processes 1,000 documents per week, and an operator has to proofread 10% of the output (the uncertain cases), that’s 100 documents to proofread per week. If it takes 5 minutes per document, that’s 500 minutes of human operator time per week, to be budgeted as a recurring post.

This cost is systematically absent from the ROI calculations presented in the demos. The demo shows the time saved. It does not subtract the cost of verification.

Infrastructure costs

Beyond API costs, if you deploy a model locally (on-premise) for reasons of confidentiality or latency: GPU, servers, storage, network, system maintenance. An H100 server costs between €25,000 and €35,000 to purchase. Add electricity, cooling and maintenance. A cluster of 4 GPUs dedicated to inference represents a capex of 100,000 to 150,000 euros and an annual opex of 15,000 to 30,000 euros.

If you use the GPU cloud (AWS, GCP, Azure), costs are in opex, but can be high for continuous workloads.

How to build a realistic TCO

Before you sign, ask your supplier for six figures: the estimated volume of tokens in production with a 50% safety margin, the budget for data preparation and maintenance over three years, the cost of human re-reading integrated as a recurring item, the cloud or on-prem infrastructure with its 36-month TCO. If your supplier can’t produce them, it’s not a question of transparency. It’s because they’ve never calculated them.