LLM inference -- the cost of sending prompts and receiving responses from AI models -- accounts for 50-65% of most organizations' AI budgets. This is a key FinOps challenge for AI-adopting organizations. Unlike traditional compute where you optimize instance sizes, inference optimization requires a completely different playbook: model selection, token engineering, caching strategies, and architectural decisions about when to use APIs versus self-hosted models.
The cost spread across models is staggering. Sending the same prompt to Claude 3 Opus costs 100x more than sending it to GPT-4o-mini. For most tasks, the cheaper model produces perfectly adequate results. The savings opportunity is massive — if you know where to look.
TL;DR: Five strategies that cut inference costs 40-70%: (1) Multi-model routing — send simple tasks to cheap models, complex tasks to expensive ones (40-60% savings). (2) Token optimization — reduce prompt sizes 30-50% with concise prompting. (3) Batch processing — 50% off for non-real-time workloads. (4) Semantic caching — avoid paying for duplicate queries. (5) Prompt caching — reuse system prompts across requests for reduced input costs.
Understanding Inference Cost Structure
Every inference call has two cost components:
- Input tokens — Your prompt, system instructions, and context. Typically $0.15-$15 per million tokens.
- Output tokens — The model's response. Typically 3-5x more expensive than input tokens, ranging from $0.60-$75 per million.
Output tokens dominate costs because: (1) they're more expensive per token, and (2) models often generate more text than necessary when not constrained.
Strategy 1: Multi-Model Routing (40-60% Savings)
The highest-impact optimization. Route each request to the cheapest model that can handle it reliably.
Building a Model Router
Classify incoming tasks by complexity, then route to the appropriate tier:
| Complexity | Example Tasks | Recommended Model | Cost per 1M Output |
|---|---|---|---|
| Simple | Classification, entity extraction, formatting | GPT-4o-mini | $0.60 |
| Standard | Summarization, translation, Q&A | Claude 3.5 Haiku or Llama 3.1 70B | $0.72-5.00 |
| Complex | Content generation, analysis, code | Claude 3.5 Sonnet or GPT-4o | $10-15 |
| Expert | Research, legal analysis, complex reasoning | Claude 3 Opus or o1 | $60-75 |
Implementation Approaches
Keyword-based routing: Simple rules based on task type headers or API endpoints. Fast, deterministic, easy to debug. Works well when task types are clearly separated.
Classifier-based routing: Train a lightweight classifier (or use GPT-4o-mini itself) to classify incoming requests by complexity. Adds a small cost per request but makes more nuanced routing decisions.
Cascade routing: Start with the cheapest model. If the response doesn't meet quality thresholds (confidence score, output validation), escalate to a more expensive model. Most requests never escalate.
Real-World Impact
A customer support system processing 500,000 queries/month:
| Approach | Model Used | Monthly Cost |
|---|---|---|
| All flagship | Claude 3.5 Sonnet for everything | $4,200 |
| Multi-model routed | 70% Haiku, 25% Sonnet, 5% Opus | $1,680 |
| Savings | $2,520/month (60%) |
Strategy 2: Token Optimization (30-50% Savings)
Reduce Input Tokens
Compress system prompts. Most system prompts contain redundant instructions, excessive examples, and verbose formatting rules. Audit and compress them:
- Remove obvious instructions ("Please respond in English")
- Use shorthand for formatting rules
- Include only the minimum few-shot examples needed
- Move static context to a preprocessed reference format
Manage conversation context. Sending full chat history with every request is the most common token waste:
- Sliding window: Keep only the last 5-10 messages
- Summarization: Periodically summarize older context into a compact summary
- RAG replacement: Replace conversation history with relevant retrieved context
Optimize RAG chunks. If using retrieval-augmented generation, smaller and more relevant chunks reduce input tokens without sacrificing answer quality. Typical optimization: reduce chunk size from 1,000 to 300-500 tokens and increase relevance threshold.
Control Output Tokens
Output tokens cost 3-5x more than input. Control them aggressively:
- Set
max_tokensexplicitly for every API call. Match it to the expected response length. - Request structured output (JSON) instead of prose. JSON responses are typically 40-60% shorter.
- Use stop sequences to terminate generation when the useful part is complete.
- Specify output format in the prompt — "Respond in 2-3 sentences" prevents verbose answers.
Strategy 3: Batch Processing (50% Savings)
Any workload that doesn't require real-time responses should use batch processing.
Eligible Workloads
- Document classification and extraction
- Content generation pipelines
- Data enrichment and annotation
- Embedding generation for vector databases
- Email analysis and categorization
- Compliance and moderation checks
Platform-Specific Batch Pricing
| Platform | Standard | Batch | Savings |
|---|---|---|---|
| OpenAI Batch API | Full price | 50% off | 50% |
| Bedrock Batch Inference | Full price | 50% off | 50% |
Implementation Pattern
- Collect requests into a batch file (JSONL format)
- Submit the batch job via API
- Poll for completion (typically 1-24 hours)
- Process results asynchronously
For OpenAI, batch jobs complete within 24 hours. For Bedrock, batch inference processes at 50% cost with variable completion times depending on queue depth.
Strategy 4: Caching (20-40% Savings)
Semantic Caching
Many applications see repeated or similar queries. A semantic cache returns stored responses for queries that are semantically equivalent to previously answered ones.
How it works:
- Generate an embedding for each incoming query
- Search the cache for similar queries (cosine similarity over 0.95)
- If a cache hit is found, return the stored response immediately
- If no hit, send to the model, store the response in cache
Cache hit rates by application type:
| Application | Typical Cache Hit Rate | Cost Reduction |
|---|---|---|
| Customer support chatbot | 30-50% | 25-40% |
| FAQ / knowledge base Q&A | 50-70% | 40-55% |
| Code completion | 15-25% | 12-20% |
| Document processing | 5-10% | 4-8% |
Prompt Caching (Provider-Level)
Both Anthropic and OpenAI offer prompt caching features that reduce input token costs for repeated system prompts:
- Anthropic prompt caching: Cache system prompts and large context blocks. Cached tokens cost 90% less on reads. Ideal for applications that send the same system prompt with every request.
- OpenAI automatic caching: OpenAI automatically caches prompt prefixes shared across requests within a short time window.
For a system prompt of 2,000 tokens sent with every request across 100,000 requests/month, prompt caching saves approximately $500-600/month on a flagship model.
Strategy 5: Self-Hosting Economics
Self-hosting open models (Llama, Mistral) on GPU instances eliminates per-token pricing. The trade-off: you pay fixed infrastructure costs regardless of usage, and you take on operational complexity.
Break-Even Analysis
| Monthly Token Volume | API Cost (Llama 70B on Bedrock) | Self-Hosted (g5.12xlarge) | Cheaper Option |
|---|---|---|---|
| 10M tokens | $14 | $700+ (instance cost) | API |
| 100M tokens | $144 | $700+ | API |
| 1B tokens | $1,440 | $700+ | Self-hosted |
| 5B tokens | $7,200 | $1,400 (2 instances) | Self-hosted |
Break-even point: Approximately 500M-1B tokens/month for a 70B model. Below that, API pricing is cheaper. Above that, self-hosting wins — but only if you have ML engineering capacity.
When to Self-Host
- Token volume exceeds 1B/month on a single model
- You need customized model behavior (fine-tuned weights)
- Data residency requirements prevent using external APIs
- You have ML/DevOps engineering capacity
- Latency requirements demand local inference
When to Stay on APIs
- Variable or unpredictable traffic patterns
- Multiple models needed (routing strategy)
- No ML engineering team
- Rapid iteration on model choice
- Volume under 500M tokens/month
Putting It All Together: Optimization Roadmap
Week 1: Measure
- Instrument all API calls with token counts and costs
- Calculate cost per inference, cost per conversation, cost per business outcome
- Identify your top 5 cost drivers by model and use case
Week 2-3: Quick Wins
- Implement multi-model routing for clear task tiers
- Set
max_tokenson all API calls - Enable batch processing for non-real-time workloads
- Compress system prompts by 30-50%
Month 2: Architecture
- Deploy semantic caching for high-repeat query patterns
- Implement prompt caching for shared system prompts
- Optimize RAG pipeline chunk sizes and relevance thresholds
- Set up sliding window context management
Month 3: Infrastructure
- Evaluate self-hosting for high-volume, single-model workloads
- Consider Bedrock Provisioned Throughput for consistent API usage
- Implement cost monitoring dashboards
- Set budget alerts per model and use case
Related Guides
- AWS Bedrock Cost Optimization Guide
- AI Cost Optimization Guide
- GPU Cost Optimization Playbook
- AWS Bedrock Batch Inference Guide
- AWS Bedrock Pricing Guide
Frequently Asked Questions
What's the cheapest way to run LLM inference?
For low volume, GPT-4o-mini at $0.15/$0.60 per million tokens is the cheapest API option for simple tasks. For high volume (over 1B tokens/month), self-hosting open models like Llama 3.1 on GPU instances is cheapest. For mid-range volume, batch processing (50% off) and multi-model routing provide the best cost/quality balance.
How much can prompt optimization save?
Compressing prompts and managing context typically reduces input token consumption by 30-50%. Combined with output length control (max_tokens, structured output), total token savings reach 40-60%. On a $5,000/month inference bill, that's $2,000-$3,000/month.
Is semantic caching worth implementing?
Yes, for any application with repeated or similar queries. Customer support chatbots see 30-50% cache hit rates, and FAQ systems see 50-70%. The implementation cost (embedding generation + vector store) is minimal compared to the inference savings.
Should I use Bedrock or OpenAI for cost optimization?
It depends on your stack and workload. Bedrock offers multi-model access and Provisioned Throughput (30-40% off at scale). OpenAI offers batch API (50% off) and GPT-4o-mini for ultra-cheap simple tasks. For AWS-native organizations, Bedrock keeps data in your VPC and uses existing IAM authentication.
Start Optimizing Inference Costs Today
LLM inference costs are the most compressible line item in your cloud bill. The combination of multi-model routing, token optimization, batch processing, and caching typically reduces inference spend by 40-70%:
- Route smartly — Match every task to the cheapest capable model
- Optimize tokens — Compress prompts, control output length, manage context
- Batch everything possible — 50% savings for async workloads
- Cache aggressively — Don't pay twice for the same answer
- Monitor per-outcome — Track cost per resolved query, not just total spend
Lower Your Cloud Costs with Wring
Wring helps you access AWS credits and volume discounts to reduce your cloud bill. Through group buying power, Wring negotiates better per-unit rates across all AWS services.
