Wring

LLM Inference Cost Optimization: Cut AI API Spend 40-70%

Cut LLM inference costs 40-70% with model routing, token optimization, prompt caching, and batch processing on AWS Bedrock and OpenAI.

Wring Team
March 13, 2026
11 min read
LLM inference costsAI API optimizationtoken optimizationBedrock costsOpenAI costsinference cost reduction
AI technology concept with data processing and machine learning visualization
AI technology concept with data processing and machine learning visualization

LLM inference -- the cost of sending prompts and receiving responses from AI models -- accounts for 50-65% of most organizations' AI budgets. This is a key FinOps challenge for AI-adopting organizations. Unlike traditional compute where you optimize instance sizes, inference optimization requires a completely different playbook: model selection, token engineering, caching strategies, and architectural decisions about when to use APIs versus self-hosted models.

The cost spread across models is staggering. Sending the same prompt to Claude 3 Opus costs 100x more than sending it to GPT-4o-mini. For most tasks, the cheaper model produces perfectly adequate results. The savings opportunity is massive — if you know where to look.

TL;DR: Five strategies that cut inference costs 40-70%: (1) Multi-model routing — send simple tasks to cheap models, complex tasks to expensive ones (40-60% savings). (2) Token optimization — reduce prompt sizes 30-50% with concise prompting. (3) Batch processing — 50% off for non-real-time workloads. (4) Semantic caching — avoid paying for duplicate queries. (5) Prompt caching — reuse system prompts across requests for reduced input costs.


Understanding Inference Cost Structure

Output Token Cost per Million (the expensive direction)March 2026 pricing. Output tokens cost 3-5x more than input tokens.Claude 3 Opus$75.00o1$60.00Claude 3.5 Sonnet$15.00GPT-4o$10.00Claude 3.5 Haiku$5.00GPT-4o-mini$0.60

Every inference call has two cost components:

  • Input tokens — Your prompt, system instructions, and context. Typically $0.15-$15 per million tokens.
  • Output tokens — The model's response. Typically 3-5x more expensive than input tokens, ranging from $0.60-$75 per million.

Output tokens dominate costs because: (1) they're more expensive per token, and (2) models often generate more text than necessary when not constrained.

Llm Inference Cost Optimization savings comparison

Strategy 1: Multi-Model Routing (40-60% Savings)

The highest-impact optimization. Route each request to the cheapest model that can handle it reliably.

Building a Model Router

Classify incoming tasks by complexity, then route to the appropriate tier:

ComplexityExample TasksRecommended ModelCost per 1M Output
SimpleClassification, entity extraction, formattingGPT-4o-mini$0.60
StandardSummarization, translation, Q&AClaude 3.5 Haiku or Llama 3.1 70B$0.72-5.00
ComplexContent generation, analysis, codeClaude 3.5 Sonnet or GPT-4o$10-15
ExpertResearch, legal analysis, complex reasoningClaude 3 Opus or o1$60-75

Implementation Approaches

Keyword-based routing: Simple rules based on task type headers or API endpoints. Fast, deterministic, easy to debug. Works well when task types are clearly separated.

Classifier-based routing: Train a lightweight classifier (or use GPT-4o-mini itself) to classify incoming requests by complexity. Adds a small cost per request but makes more nuanced routing decisions.

Cascade routing: Start with the cheapest model. If the response doesn't meet quality thresholds (confidence score, output validation), escalate to a more expensive model. Most requests never escalate.

Real-World Impact

A customer support system processing 500,000 queries/month:

ApproachModel UsedMonthly Cost
All flagshipClaude 3.5 Sonnet for everything$4,200
Multi-model routed70% Haiku, 25% Sonnet, 5% Opus$1,680
Savings$2,520/month (60%)

Strategy 2: Token Optimization (30-50% Savings)

Reduce Input Tokens

Compress system prompts. Most system prompts contain redundant instructions, excessive examples, and verbose formatting rules. Audit and compress them:

  • Remove obvious instructions ("Please respond in English")
  • Use shorthand for formatting rules
  • Include only the minimum few-shot examples needed
  • Move static context to a preprocessed reference format

Manage conversation context. Sending full chat history with every request is the most common token waste:

  • Sliding window: Keep only the last 5-10 messages
  • Summarization: Periodically summarize older context into a compact summary
  • RAG replacement: Replace conversation history with relevant retrieved context

Optimize RAG chunks. If using retrieval-augmented generation, smaller and more relevant chunks reduce input tokens without sacrificing answer quality. Typical optimization: reduce chunk size from 1,000 to 300-500 tokens and increase relevance threshold.

Control Output Tokens

Output tokens cost 3-5x more than input. Control them aggressively:

  • Set max_tokens explicitly for every API call. Match it to the expected response length.
  • Request structured output (JSON) instead of prose. JSON responses are typically 40-60% shorter.
  • Use stop sequences to terminate generation when the useful part is complete.
  • Specify output format in the prompt — "Respond in 2-3 sentences" prevents verbose answers.

Strategy 3: Batch Processing (50% Savings)

Any workload that doesn't require real-time responses should use batch processing.

Eligible Workloads

  • Document classification and extraction
  • Content generation pipelines
  • Data enrichment and annotation
  • Embedding generation for vector databases
  • Email analysis and categorization
  • Compliance and moderation checks

Platform-Specific Batch Pricing

PlatformStandardBatchSavings
OpenAI Batch APIFull price50% off50%
Bedrock Batch InferenceFull price50% off50%

Implementation Pattern

  1. Collect requests into a batch file (JSONL format)
  2. Submit the batch job via API
  3. Poll for completion (typically 1-24 hours)
  4. Process results asynchronously

For OpenAI, batch jobs complete within 24 hours. For Bedrock, batch inference processes at 50% cost with variable completion times depending on queue depth.

Llm Inference Cost Optimization process flow diagram

Strategy 4: Caching (20-40% Savings)

Semantic Caching

Many applications see repeated or similar queries. A semantic cache returns stored responses for queries that are semantically equivalent to previously answered ones.

How it works:

  1. Generate an embedding for each incoming query
  2. Search the cache for similar queries (cosine similarity over 0.95)
  3. If a cache hit is found, return the stored response immediately
  4. If no hit, send to the model, store the response in cache

Cache hit rates by application type:

ApplicationTypical Cache Hit RateCost Reduction
Customer support chatbot30-50%25-40%
FAQ / knowledge base Q&A50-70%40-55%
Code completion15-25%12-20%
Document processing5-10%4-8%

Prompt Caching (Provider-Level)

Both Anthropic and OpenAI offer prompt caching features that reduce input token costs for repeated system prompts:

  • Anthropic prompt caching: Cache system prompts and large context blocks. Cached tokens cost 90% less on reads. Ideal for applications that send the same system prompt with every request.
  • OpenAI automatic caching: OpenAI automatically caches prompt prefixes shared across requests within a short time window.

For a system prompt of 2,000 tokens sent with every request across 100,000 requests/month, prompt caching saves approximately $500-600/month on a flagship model.


Strategy 5: Self-Hosting Economics

Self-hosting open models (Llama, Mistral) on GPU instances eliminates per-token pricing. The trade-off: you pay fixed infrastructure costs regardless of usage, and you take on operational complexity.

Break-Even Analysis

Monthly Token VolumeAPI Cost (Llama 70B on Bedrock)Self-Hosted (g5.12xlarge)Cheaper Option
10M tokens$14$700+ (instance cost)API
100M tokens$144$700+API
1B tokens$1,440$700+Self-hosted
5B tokens$7,200$1,400 (2 instances)Self-hosted

Break-even point: Approximately 500M-1B tokens/month for a 70B model. Below that, API pricing is cheaper. Above that, self-hosting wins — but only if you have ML engineering capacity.

When to Self-Host

  • Token volume exceeds 1B/month on a single model
  • You need customized model behavior (fine-tuned weights)
  • Data residency requirements prevent using external APIs
  • You have ML/DevOps engineering capacity
  • Latency requirements demand local inference

When to Stay on APIs

  • Variable or unpredictable traffic patterns
  • Multiple models needed (routing strategy)
  • No ML engineering team
  • Rapid iteration on model choice
  • Volume under 500M tokens/month

Putting It All Together: Optimization Roadmap

Week 1: Measure

  • Instrument all API calls with token counts and costs
  • Calculate cost per inference, cost per conversation, cost per business outcome
  • Identify your top 5 cost drivers by model and use case

Week 2-3: Quick Wins

  • Implement multi-model routing for clear task tiers
  • Set max_tokens on all API calls
  • Enable batch processing for non-real-time workloads
  • Compress system prompts by 30-50%

Month 2: Architecture

  • Deploy semantic caching for high-repeat query patterns
  • Implement prompt caching for shared system prompts
  • Optimize RAG pipeline chunk sizes and relevance thresholds
  • Set up sliding window context management

Month 3: Infrastructure

  • Evaluate self-hosting for high-volume, single-model workloads
  • Consider Bedrock Provisioned Throughput for consistent API usage
  • Implement cost monitoring dashboards
  • Set budget alerts per model and use case
Llm Inference Cost Optimization optimization checklist

Related Guides


Frequently Asked Questions

What's the cheapest way to run LLM inference?

For low volume, GPT-4o-mini at $0.15/$0.60 per million tokens is the cheapest API option for simple tasks. For high volume (over 1B tokens/month), self-hosting open models like Llama 3.1 on GPU instances is cheapest. For mid-range volume, batch processing (50% off) and multi-model routing provide the best cost/quality balance.

How much can prompt optimization save?

Compressing prompts and managing context typically reduces input token consumption by 30-50%. Combined with output length control (max_tokens, structured output), total token savings reach 40-60%. On a $5,000/month inference bill, that's $2,000-$3,000/month.

Is semantic caching worth implementing?

Yes, for any application with repeated or similar queries. Customer support chatbots see 30-50% cache hit rates, and FAQ systems see 50-70%. The implementation cost (embedding generation + vector store) is minimal compared to the inference savings.

Should I use Bedrock or OpenAI for cost optimization?

It depends on your stack and workload. Bedrock offers multi-model access and Provisioned Throughput (30-40% off at scale). OpenAI offers batch API (50% off) and GPT-4o-mini for ultra-cheap simple tasks. For AWS-native organizations, Bedrock keeps data in your VPC and uses existing IAM authentication.


Start Optimizing Inference Costs Today

LLM inference costs are the most compressible line item in your cloud bill. The combination of multi-model routing, token optimization, batch processing, and caching typically reduces inference spend by 40-70%:

  1. Route smartly — Match every task to the cheapest capable model
  2. Optimize tokens — Compress prompts, control output length, manage context
  3. Batch everything possible — 50% savings for async workloads
  4. Cache aggressively — Don't pay twice for the same answer
  5. Monitor per-outcome — Track cost per resolved query, not just total spend
Llm Inference Cost Optimization key statistics

Lower Your Cloud Costs with Wring

Wring helps you access AWS credits and volume discounts to reduce your cloud bill. Through group buying power, Wring negotiates better per-unit rates across all AWS services.

Start saving on AWS →