Token cost grows linearly with usage. Most enterprises don’t optimize until the monthly bill becomes uncomfortable — usually in the third or fourth quarter of a successful AI rollout. The good news: by then there are five well-known levers, and they compound. The bad news: most teams pull them out of order, and a few of them require eval infrastructure that should have been built sooner.
Where cost actually comes from
Input tokens (what you send in — the prompt, the retrieved context, the conversation history) are 3–5× cheaper than output tokens (what the model generates) at most providers. Then there’s the model tier — the frontier model costs roughly 10× its small-model sibling, sometimes more. Workload shape matters: chatbot-style with short responses is bottlenecked by per-call overhead; RAG with long retrieved context is dominated by input tokens; agent workflows pay both sides several times per task. Map your actual cost distribution before applying any lever.
Lever 1 — Model routing
The highest-leverage cost lever. Not every query needs the frontier model. A small model handles 60–80% of routine questions at 5–10× lower cost. The pattern: a fast cheap router scores incoming queries on complexity, sends the simple ones to the small model, escalates only when needed. Run an eval suite to confirm the small model meets your quality bar on routed traffic; the eval is the prerequisite, not an afterthought.
Lever 2 — Prompt caching
Provider-level prompt caching (Anthropic, OpenAI both support it now) discounts repeated prefix tokens by 50–90%. RAG systems with stable system prompts and shared retrieved context win the most. Your own caching — at the retrieval layer, the answer layer, the embedding layer — adds another 10–30% on top, especially for FAQ-shaped workloads.
Lever 3 — Response budgets
Set max_tokens. Most outputs don’t need 4,000-token responses; most production responses sit comfortably under 500. Cap them. The model paraphrases more concisely under the constraint. A 1,500-token cap on responses that average 600 keeps the long-tail from running away while preserving normal behavior.
Lever 4 — Batch APIs
For non-interactive workloads — overnight document processing, weekly report generation, nightly content tagging — provider batch APIs offer ~50% discounts at the cost of latency (results returned in hours, not seconds). If your workload tolerates delayed delivery, this is free money.
Lever 5 — Eval-driven downgrades
As models improve, last year’s frontier model becomes this year’s mid-tier — often at half the price. Maintain an eval suite so you can confidently downgrade when a cheaper model passes your bar. Without the eval, you stay on the expensive model out of risk aversion. With the eval, you ride the price curve down.
10× growth, flat cost
When the five levers stack, 10× usage growth without 10× cost growth is a tractable goal. Most enterprises see 40–70% cost reduction on existing workloads after a focused optimization pass — and the savings recur every month going forward.
IDS AI Solutions runs the FinOps for AI assessment as part of the AI Audit — current cost distribution, top three savings opportunities, evaluation infrastructure status. Talk to our team.
