Your GPT-4o API bill hit $2,400 last month on 50 million tokens. The product works great — but at that burn rate, you're looking at $28,800/year before you've even scaled. Here's how to cut that bill by 60% without sacrificing output quality.
The Actual Numbers First
Before optimizing, you need a baseline. GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens (as of early 2026). At 50M tokens/month with a 4:1 input-to-output ratio, you're spending roughly:
- 40M input tokens × $2.50 = $100
- 10M output tokens × $10.00 = $100
Wait, that's only $200 — not $2,400. The gap usually comes from one of three places: you're generating far more output tokens than you realize, your context windows are bloated, or you're calling the wrong model for the task.
Run your actual numbers: input tokens per request, output tokens per request, requests per day. That's your real starting point.
Strategy 1: Model Routing Saves 40-80%
The single biggest cost lever is not using GPT-4o for everything. Most production workloads have two distinct task types:
Tasks that need GPT-4o: complex reasoning, multi-step planning, nuanced writing, code generation with unclear requirements.
Tasks that don't: classification, extraction, summarization of structured data, filling templates, answering factual questions with retrieved context.
For the second category, switch to a smaller model:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
| Claude Sonnet 3.7 | $3.00 | $15.00 |
If 60% of your requests are classification or extraction tasks, routing those to GPT-4o-mini (96% cheaper per token) cuts your total bill by roughly 58% — while GPT-4o handles the 40% that actually needs it.
Build a simple router: check if the task type is in your "simple" bucket, call the cheap model, otherwise escalate to GPT-4o. Track quality on a sample of outputs for 2 weeks. Most teams find the cheap model matches quality on 80%+ of the routed tasks.
Strategy 2: Prompt Caching Gives You 50% Off Repeated Prefixes
If you have a long system prompt that doesn't change between requests — say, a 2,000-token product specification or retrieval context — prompt caching saves 50% on those tokens.
Both Anthropic and OpenAI support prompt caching. The mechanics differ slightly:
Anthropic (Claude): Add "cache_control": {"type": "ephemeral"} to the last message in the static prefix. Cache lasts 5 minutes; reading a cached prefix costs $0.30 per million tokens instead of $3.00 for Claude Sonnet 3.7.
OpenAI: Automatic for prompts over 1,024 tokens. Cached input tokens are billed at 50% of the standard rate. No code changes needed.
If your system prompt is 3,000 tokens and you run 100,000 requests per day, prompt caching on Claude 3.5 Sonnet saves: 3,000 tokens × 100,000 requests × $2.70/M savings = $810/day in input token costs.
The catch: caching only helps if the prefix is actually identical across requests. If you're injecting per-user or per-request data at the top of your system prompt, restructure it so the static portion comes first and the dynamic portion comes at the end.
Strategy 3: Context Window Hygiene
The average production RAG pipeline sends 8,000-12,000 tokens of context per request. Most of that context contributes nothing to the answer.
Audit your context by running 100 requests and checking which retrieved chunks actually appear in the model's output or reasoning. Empirically, you'll find 3-4 chunks doing 90% of the work — the rest is noise that costs tokens and can confuse the model.
Reducing from 12 retrieved chunks to 4 cuts context tokens by 67% on retrieval-augmented requests. At GPT-4o pricing, that's a 67% reduction in your input token costs for that call type.
Other context hygiene wins:
- Remove duplicate instructions (if you have the same instruction in 3 places in your system prompt, once is enough)
- Summarize conversation history after 6 turns instead of carrying raw transcripts
- Strip boilerplate from retrieved documents (navigation text, disclaimers, headers) before inserting into context
Strategy 4: Batching vs. Streaming
If your use case doesn't require real-time responses — reports, nightly processing, batch enrichment — use the Batch API.
OpenAI's Batch API is 50% cheaper than the standard API for the same models. You submit a JSONL file of requests, wait up to 24 hours for results, and pay half price. No streaming, no real-time — but for offline workflows, there's no reason not to.
OpenAI Batch API pricing for GPT-4o: $1.25 input / $5.00 output (per 1M tokens). For a batch job processing 1M input tokens and 250K output tokens, that's $1.25 + $1.25 = $2.50 vs. $5.00 standard pricing.
Putting It Together
For a $2,400/month GPT-4o-only bill, a realistic optimization stack:
- Route 60% of requests to GPT-4o-mini → saves ~$1,440/month
- Enable prompt caching on 3,000-token system prompt at 100K req/day → saves ~$300/month
- Trim retrieval context from 12 chunks to 4 → saves ~$180/month (on the remaining GPT-4o calls)
- Move batch processing to Batch API → saves ~$80/month
Projected monthly total: ~$400 — an 83% reduction from $2,400.
The order matters: fix model routing first (highest ROI), then caching, then context, then batching. Measure after each change so you know exactly what's working.
Use the API Cost Calculator to model your specific numbers before and after each change.
API Cost Calculator
Compare token costs across GPT-4o, Claude, and Gemini to find the cheapest model for your use case.