Why this matters now
LLM costs were not a line item most businesses cared about two years ago. Today, companies are spending tens of thousands per month on Anthropic, OpenAI, and Google API fees — and that spend is growing faster than most CFOs realize.
The good news: LLM costs are far more controllable than they seem. Most businesses are spending 3-5x what they need to because they never invested in the discipline. This guide walks through where the waste is and how to cut it.
Where LLM spend actually goes
You're paying for three things: input tokens (the context and prompt you send), output tokens (what the model generates), and in some cases, extra features like extended thinking, vision, or cached context.
For most business workloads, input tokens dominate. Long system prompts, retrieved documents, and conversation histories stack up. Output is usually a small fraction of cost. That means trimming inputs gives the biggest immediate savings.
Lever 1: Model selection per task
Don't route every request to the frontier model. In 2026 pricing, Claude Opus costs roughly 15x Claude Haiku. GPT-4 Turbo costs 20x GPT-4 Mini. Gemini Pro costs 10x Gemini Flash.
Most workflows have a mix of complexity:
- Classification, routing, simple extractions → small model (Haiku, Flash, Mini). 70-90% of most workloads.
- Medium-complexity reasoning, drafting, summarization → mid-tier (Claude Sonnet, GPT-4 Turbo, Gemini Pro).
- Complex reasoning, long-form writing, hard coding → frontier (Claude Opus, GPT-4, Gemini Ultra).
Build a router that picks the right model for each task. The infrastructure investment pays back within weeks on most production systems.
Real numbers
A client running 50K AI operations per month was spending $8,400/mo on Claude Opus for everything. After routing, spend dropped to $1,900/mo with no measurable quality loss — because 80% of their operations were classification and didn't need Opus in the first place.
Lever 2: Prompt and context optimization
Long system prompts are a hidden cost. Every API call sends the entire prompt again — even if it's identical to the last 10,000 calls.
- Use prompt caching (Anthropic and Gemini support it natively) to avoid re-charging for repeated system prompts. Can cut input costs 60-90% for chat-style workloads.
- Trim system prompts ruthlessly. A 4,000-token system prompt is almost always doing 1,000 tokens of real work.
- Use retrieval instead of dumping everything into context. Retrieve the 3 most relevant documents, not all 50.
- Summarize long conversation histories. At the 50-message mark, compress older turns into a summary.
Lever 3: Caching
Anything that repeats can be cached. Most production systems repeat far more than their engineers realize:
- System prompts — use provider-native prompt caching
- Retrieved context — cache RAG results for identical queries
- Output — cache responses for frequently-asked questions or deterministic transformations
- Classifications — cache classification results per input (e.g., spam / not spam)
- Embeddings — never recompute the same embedding twice; cache aggressively
Lever 4: Output discipline
Output costs can balloon when models over-explain. Control them:
- Set max_tokens appropriately — if you need a 1-sentence answer, don't let the model write 3 paragraphs
- Ask for structured output (JSON, enum values) where possible — shorter and more reliable
- Use "be concise" instructions when verbosity isn't the point
- Avoid chain-of-thought unless you actually need the reasoning exposed — it adds 3-10x output tokens
Observability: you can't optimize what you don't measure
Every LLM call should be logged with model, input tokens, output tokens, cost, latency, and the workflow/user it belongs to. Pipe that to a dashboard.
- Weekly cost-per-workflow reports — where is money going?
- Anomaly alerts — if a workflow spikes 3x in a day, you want to know
- User-level cost tracking — is one power user driving 40% of your bill?
- Model mix tracking — what fraction of calls goes to each model? If 90% is going to Opus, you have optimization to do.
Budget caps and circuit breakers
- Per-user rate limits — a single runaway user (or bot) can spike your bill 10x overnight
- Per-workflow budget caps — if a workflow exceeds its budget, it's paused automatically
- Monthly spend alerts at 50%, 75%, 90% of budget
- Hard kill switch per workflow — ability to disable a workflow that's misbehaving without touching others
Contract tier selection
At scale, enterprise tiers from Anthropic, OpenAI, and Google offer meaningful discounts. The break-even is usually around $5K-$10K/mo of spend. Below that, stay on standard tiers. Above that, negotiate.
Volume commitments get 20-50% off typical list pricing. Multi-year commitments get more. Dedicated capacity (reserved throughput) costs more but gives SLA guarantees — valuable if you're building customer-facing products.
The 30-day cost audit
- Week 1: Instrument everything. Log every call with tokens, model, workflow, user.
- Week 2: Review the data. Where is spend concentrated? Which workflows are most expensive?
- Week 3: Pick the top 3 cost centers and optimize — trim prompts, route to smaller models, add caching.
- Week 4: Measure impact, document learnings, implement ongoing monitoring.
Typical audit outcome
A focused 30-day audit cuts LLM spend 40-70% for most teams without reducing quality. If you haven't done one yet, this is the single highest-ROI engineering investment on your AI roadmap.