All GuidesAI

The LLM Cost Playbook

A practical guide to managing LLM API spend as it scales — model selection, prompt optimization, caching, routing, and the observability that keeps costs predictable.

Adam SmithApril 16, 202611 min read
TL;DR
  • LLM costs compound fast. A usage pattern that costs $200/mo at 10 users can hit $30K/mo at 1,000 users without discipline.
  • Three biggest levers: model selection per task, prompt + context optimization, and aggressive caching of anything reusable.
  • Routing simple queries to cheap models (Haiku, Gemini Flash) and complex ones to frontier models can cut spend 70% with no quality loss.
  • Without observability, you cannot optimize. Instrument first, optimize second.

Why this matters now

LLM costs were not a line item most businesses cared about two years ago. Today, companies are spending tens of thousands per month on Anthropic, OpenAI, and Google API fees — and that spend is growing faster than most CFOs realize.

The good news: LLM costs are far more controllable than they seem. Most businesses are spending 3-5x what they need to because they never invested in the discipline. This guide walks through where the waste is and how to cut it.

Where LLM spend actually goes

You're paying for three things: input tokens (the context and prompt you send), output tokens (what the model generates), and in some cases, extra features like extended thinking, vision, or cached context.

For most business workloads, input tokens dominate. Long system prompts, retrieved documents, and conversation histories stack up. Output is usually a small fraction of cost. That means trimming inputs gives the biggest immediate savings.

Lever 1: Model selection per task

Don't route every request to the frontier model. In 2026 pricing, Claude Opus costs roughly 15x Claude Haiku. GPT-4 Turbo costs 20x GPT-4 Mini. Gemini Pro costs 10x Gemini Flash.

Most workflows have a mix of complexity:

  • Classification, routing, simple extractions → small model (Haiku, Flash, Mini). 70-90% of most workloads.
  • Medium-complexity reasoning, drafting, summarization → mid-tier (Claude Sonnet, GPT-4 Turbo, Gemini Pro).
  • Complex reasoning, long-form writing, hard coding → frontier (Claude Opus, GPT-4, Gemini Ultra).

Build a router that picks the right model for each task. The infrastructure investment pays back within weeks on most production systems.

Real numbers

A client running 50K AI operations per month was spending $8,400/mo on Claude Opus for everything. After routing, spend dropped to $1,900/mo with no measurable quality loss — because 80% of their operations were classification and didn't need Opus in the first place.

Lever 2: Prompt and context optimization

Long system prompts are a hidden cost. Every API call sends the entire prompt again — even if it's identical to the last 10,000 calls.

  • Use prompt caching (Anthropic and Gemini support it natively) to avoid re-charging for repeated system prompts. Can cut input costs 60-90% for chat-style workloads.
  • Trim system prompts ruthlessly. A 4,000-token system prompt is almost always doing 1,000 tokens of real work.
  • Use retrieval instead of dumping everything into context. Retrieve the 3 most relevant documents, not all 50.
  • Summarize long conversation histories. At the 50-message mark, compress older turns into a summary.

Lever 3: Caching

Anything that repeats can be cached. Most production systems repeat far more than their engineers realize:

  • System prompts — use provider-native prompt caching
  • Retrieved context — cache RAG results for identical queries
  • Output — cache responses for frequently-asked questions or deterministic transformations
  • Classifications — cache classification results per input (e.g., spam / not spam)
  • Embeddings — never recompute the same embedding twice; cache aggressively

Lever 4: Output discipline

Output costs can balloon when models over-explain. Control them:

  • Set max_tokens appropriately — if you need a 1-sentence answer, don't let the model write 3 paragraphs
  • Ask for structured output (JSON, enum values) where possible — shorter and more reliable
  • Use "be concise" instructions when verbosity isn't the point
  • Avoid chain-of-thought unless you actually need the reasoning exposed — it adds 3-10x output tokens

Observability: you can't optimize what you don't measure

Every LLM call should be logged with model, input tokens, output tokens, cost, latency, and the workflow/user it belongs to. Pipe that to a dashboard.

  • Weekly cost-per-workflow reports — where is money going?
  • Anomaly alerts — if a workflow spikes 3x in a day, you want to know
  • User-level cost tracking — is one power user driving 40% of your bill?
  • Model mix tracking — what fraction of calls goes to each model? If 90% is going to Opus, you have optimization to do.

Budget caps and circuit breakers

  • Per-user rate limits — a single runaway user (or bot) can spike your bill 10x overnight
  • Per-workflow budget caps — if a workflow exceeds its budget, it's paused automatically
  • Monthly spend alerts at 50%, 75%, 90% of budget
  • Hard kill switch per workflow — ability to disable a workflow that's misbehaving without touching others

Contract tier selection

At scale, enterprise tiers from Anthropic, OpenAI, and Google offer meaningful discounts. The break-even is usually around $5K-$10K/mo of spend. Below that, stay on standard tiers. Above that, negotiate.

Volume commitments get 20-50% off typical list pricing. Multi-year commitments get more. Dedicated capacity (reserved throughput) costs more but gives SLA guarantees — valuable if you're building customer-facing products.

The 30-day cost audit

  • Week 1: Instrument everything. Log every call with tokens, model, workflow, user.
  • Week 2: Review the data. Where is spend concentrated? Which workflows are most expensive?
  • Week 3: Pick the top 3 cost centers and optimize — trim prompts, route to smaller models, add caching.
  • Week 4: Measure impact, document learnings, implement ongoing monitoring.

Typical audit outcome

A focused 30-day audit cuts LLM spend 40-70% for most teams without reducing quality. If you haven't done one yet, this is the single highest-ROI engineering investment on your AI roadmap.

Frequently asked questions

What's a reasonable monthly LLM spend for a small business?

+

For a typical small business with one or two AI workflows in production, $200-$1,500/month is normal. Scale-ups running customer-facing AI features often reach $5K-$50K/month. Anyone above that without a cost audit is almost certainly overspending.

Is Claude or GPT cheaper?

+

Depends entirely on the model tier and workload. Claude Haiku is cheaper per token than GPT-4. Claude Opus is more expensive than GPT-4 Turbo. The right comparison is quality-per-dollar for your specific task — test both.

Can I self-host open models to save money?

+

For high-volume, specific workloads — yes. Llama 3 and Mistral Large can run for cents per thousand requests on your own GPU infrastructure. But there's a setup cost, operational burden, and quality gap vs. frontier models. Break-even is usually at $5K+/month of API spend on a single workflow.

Do you help optimize existing AI systems?

+

Yes — LLM cost audits are a specific service we offer as part of AI consulting engagements. We instrument your existing system, identify waste, and implement optimizations. Typical outcome is 40-70% cost reduction within 30 days.

Want us to do this for you?

Book a conversation — we'll scope the work and send you a proposal within one business day.