The token economics of scale: keeping AI costs flat as usage 10×s

Token cost grows linearly with usage. Five well-known levers — model routing, prompt caching, response budgets, batch APIs, eval-driven downgrades — compound to flatten that curve. Most teams pull them out of order. The eval suite is the prerequisite for the biggest savings.

AdminFounder & Engineering Lead · May 19, 2026 · 6 min read

Token cost grows linearly with usage. Most enterprises don’t optimize until the monthly bill becomes uncomfortable — usually in the third or fourth quarter of a successful AI rollout. The good news: by then there are five well-known levers, and they compound. The bad news: most teams pull them out of order, and a few of them require eval infrastructure that should have been built sooner.

Where cost actually comes from

Input tokens (what you send in — the prompt, the retrieved context, the conversation history) are 3–5× cheaper than output tokens (what the model generates) at most providers. Then there’s the model tier — the frontier model costs roughly 10× its small-model sibling, sometimes more. Workload shape matters: chatbot-style with short responses is bottlenecked by per-call overhead; RAG with long retrieved context is dominated by input tokens; agent workflows pay both sides several times per task. Map your actual cost distribution before applying any lever.

Lever 1 — Model routing

The highest-leverage cost lever. Not every query needs the frontier model. A small model handles 60–80% of routine questions at 5–10× lower cost. The pattern: a fast cheap router scores incoming queries on complexity, sends the simple ones to the small model, escalates only when needed. Run an eval suite to confirm the small model meets your quality bar on routed traffic; the eval is the prerequisite, not an afterthought.

Lever 2 — Prompt caching

Provider-level prompt caching (Anthropic, OpenAI both support it now) discounts repeated prefix tokens by 50–90%. RAG systems with stable system prompts and shared retrieved context win the most. Your own caching — at the retrieval layer, the answer layer, the embedding layer — adds another 10–30% on top, especially for FAQ-shaped workloads.

Lever 3 — Response budgets

Set max_tokens. Most outputs don’t need 4,000-token responses; most production responses sit comfortably under 500. Cap them. The model paraphrases more concisely under the constraint. A 1,500-token cap on responses that average 600 keeps the long-tail from running away while preserving normal behavior.

Lever 4 — Batch APIs

For non-interactive workloads — overnight document processing, weekly report generation, nightly content tagging — provider batch APIs offer ~50% discounts at the cost of latency (results returned in hours, not seconds). If your workload tolerates delayed delivery, this is free money.

Lever 5 — Eval-driven downgrades

As models improve, last year’s frontier model becomes this year’s mid-tier — often at half the price. Maintain an eval suite so you can confidently downgrade when a cheaper model passes your bar. Without the eval, you stay on the expensive model out of risk aversion. With the eval, you ride the price curve down.

10× growth, flat cost

When the five levers stack, 10× usage growth without 10× cost growth is a tractable goal. Most enterprises see 40–70% cost reduction on existing workloads after a focused optimization pass — and the savings recur every month going forward.

IDS AI Solutions runs the FinOps for AI assessment as part of the AI Audit — current cost distribution, top three savings opportunities, evaluation infrastructure status. Talk to our team.

Frequently asked questions

Which lever should we pull first?

Model routing — but only after you have an evaluation suite. The small-model handles 60–80% of routine traffic at a fraction of the cost, but you need eval data to confirm it meets your quality bar on routed queries. Without the eval, routing is risky and most teams retreat to the expensive model.

Does prompt caching actually save money in practice?

Yes, dramatically for RAG-style workloads with stable system prompts and shared retrieved context. Provider-level prompt caching discounts repeated prefix tokens by 50–90%. Your own caching at the retrieval and answer layer adds another 10–30%. Combined, RAG systems typically see 40–60% input-token cost reduction.

When does the batch API trade-off make sense?

For workloads that tolerate latency measured in hours. Overnight document processing, weekly report generation, nightly content tagging, monthly compliance reviews. Batch APIs typically offer ~50% discounts. The trade is throughput vs. interactivity — if no human is waiting, the discount is free money.