How to think about AI model costs when you're scoping a project

One of the things that surprises people when they start scoping an AI project is how quickly model costs can add up — and how hard they are to estimate accurately without understanding the specifics of what you're building.

This post isn't about optimizing model costs. It's about developing the right mental model for them early in a project, so you're not surprised later.

The cost drivers that matter

LLM API pricing is typically quoted per million tokens. A token is roughly three-quarters of a word. Pricing varies significantly across providers and models, and there are often different rates for input tokens (what you send to the model) and output tokens (what the model returns).

The variables that most affect your production costs:

Context window size. For every request, you're paying for everything in the context window — your system prompt, any retrieved documents, the conversation history, and the user's input. A system that includes 2,000-word documents in every request will cost dramatically more per query than one that's passing short structured inputs. Context size is usually the biggest lever on per-query cost.

Query volume. How many requests per day, per hour, at peak? A system handling 1,000 queries per day looks very different from one handling 100,000. Volume multiplies everything.

Output length. Longer generated outputs cost more than shorter ones. For extraction and classification tasks, outputs are usually short. For generation tasks (drafting, summarization), outputs can be long and are a significant cost component.

Model choice. The gap between the most capable frontier models and smaller, faster models can be 10x to 100x in price per token. For many well-defined tasks, a smaller model performs nearly as well at a fraction of the cost. For complex reasoning or highly variable tasks, the frontier model is often worth it.

Caching and retrieval efficiency. If the same content appears in context across many requests (a static system prompt, a product catalog), prompt caching can reduce costs significantly. The efficiency of your retrieval layer — fetching exactly the right context rather than including large irrelevant chunks — also directly affects cost.

The estimate most people make wrong

The most common scoping mistake is to estimate costs based on demo usage and extrapolate linearly to production volume.

Demo usage is almost never representative. It uses cleaner inputs, simpler queries, and doesn't include all the context that a production system needs to handle real cases well.

Production systems typically have higher per-query costs than demos because:

They include more context (retrieved documents, conversation history, system instructions)
They handle more edge cases, which often require longer or more complex processing
They have error handling and retry logic that adds overhead

When you're estimating production costs from a demo, add a meaningful buffer. What's expensive in a demo will be more expensive in production.

How to build a useful cost model

Before you finalize a project scope, build a back-of-envelope cost model with these inputs:

Estimated daily query volume (and peak multiplier)
Average context window size per request (system prompt + retrieval + history)
Average output length
Model selection (and unit price per million tokens for input and output)
Expected retry rate (what percentage of requests fail and get retried)

Monthly cost ≈ (daily queries × 30) × ((average input tokens × input price) + (average output tokens × output price)) × (1 + retry rate)

This won't be precise, but it'll tell you whether you're looking at a $200/month system or a $20,000/month system. That order-of-magnitude range is what matters for scoping and architecture decisions.

The architecture decisions that most affect cost

Model costs are often an architecture problem more than a purchasing problem. The decisions that most affect cost per query:

Context management. How much do you include in each request? Can you use shorter, more targeted retrieval rather than including whole documents? Can you summarize conversation history rather than passing the full thread?

Model routing. Can you use a cheaper model for simpler requests and reserve the expensive model for complex ones? This requires building routing logic, but the cost reduction can be significant at scale.

Caching. What's the same across requests? System prompts, static reference content, and common query patterns can often be cached, reducing the tokens processed per request.

Batching. For asynchronous workloads, batching requests reduces per-request overhead and often qualifies for discounted batch pricing from providers.

Task decomposition. For complex tasks, a chain of smaller, cheaper model calls often produces better results at lower cost than a single large, expensive model call.

When to optimize and when not to

At prototype scale, model costs almost never matter. Don't optimize prematurely.

At production scale with meaningful volume, cost becomes a real consideration that shapes architecture. The time to think about it is during scoping, not after you've committed to an architecture that's expensive to change.

The specific threshold varies, but if your back-of-envelope model shows monthly costs in the thousands of dollars at your projected volume, it's worth spending some time during architecture design on cost optimization. If it shows costs in the hundreds, ship first and optimize later.

One practical note: model prices have generally trended down over time as the market matures. The costs you're estimating today are likely higher than what you'll pay in 12 months. Don't over-optimize for a cost environment that may change significantly.

If you want help building a cost model for a specific AI project you're scoping, book a discovery call — it's one of the things we work through in early project conversations.

How to Think About AI Model Costs When You're Scoping a Project