AWS Bedrock usage is growing faster than almost any other AWS service, and so are the bills. A single Claude or Llama model powering a production application can easily cost $10,000-50,000/month in token charges. The model you choose, how you structure prompts, and whether you use features like batch inference or prompt caching determine whether your AI costs are sustainable or spiraling.
TL;DR: The three biggest Bedrock savings: (1) Use the smallest model that meets quality requirements — Claude Haiku costs 95% less than Claude Opus for many tasks. (2) Enable batch inference for non-real-time workloads — 50% discount on token prices. (3) Implement prompt caching for repeated system prompts — up to 90% reduction on cached input tokens. These three strategies combined can reduce Bedrock costs by 60-80%.
Bedrock Pricing Quick Reference
Per-Token Pricing (Popular Models)
| Model | Input (per 1K tokens) | Output (per 1K tokens) |
|---|---|---|
| Claude 4 Opus | $0.015 | $0.075 |
| Claude 4 Sonnet | $0.003 | $0.015 |
| Claude 4 Haiku | $0.0008 | $0.004 |
| Llama 3.1 70B | $0.00099 | $0.00099 |
| Llama 3.1 8B | $0.00022 | $0.00022 |
| Titan Text Express | $0.0002 | $0.0006 |
| Mistral Large | $0.004 | $0.012 |
Strategy 1: Choose the Right Model Size
The most impactful cost decision is model selection. Many tasks don't require the most capable model.
| Task Type | Recommended Model | vs Opus Savings |
|---|---|---|
| Classification, routing | Haiku or Llama 8B | 95% |
| Summarization, extraction | Sonnet or Llama 70B | 80% |
| Simple Q&A, formatting | Haiku or Titan | 95% |
| Complex reasoning, coding | Sonnet | 80% |
| Research, creative writing | Opus | Baseline |
Implementation: Build a model router that classifies incoming requests and sends them to the appropriate model. Simple classification requests should never hit Opus.
Strategy 2: Use Batch Inference
Bedrock Batch Inference processes large volumes of requests asynchronously at a 50% discount on token prices.
| Use Case | On-Demand Cost | Batch Cost | Savings |
|---|---|---|---|
| 1M documents classification | $800 | $400 | 50% |
| 100K email summaries | $300 | $150 | 50% |
| 500K content generation | $7,500 | $3,750 | 50% |
Batch is ideal for: document processing pipelines, content generation queues, data extraction jobs, and any workload that doesn't need sub-second response times.
Strategy 3: Implement Prompt Caching
Bedrock supports prompt caching for Claude models. Cached input tokens cost up to 90% less than standard input tokens.
| Component | Cost |
|---|---|
| Standard input tokens | Full price |
| Cache write (first use) | 25% premium |
| Cache read (subsequent uses) | 90% discount |
| Cache TTL | 5 minutes (auto-extended on hit) |
Best for: System prompts, few-shot examples, and context documents that repeat across many requests. A 10,000-token system prompt used 1,000 times costs $0.15 cached vs $1.50 uncached (with Claude Sonnet).
Strategy 4: Optimize Token Usage
Every token costs money. Reduce token consumption without sacrificing output quality.
Input token reduction:
- Trim system prompts to essential instructions only
- Use concise few-shot examples instead of lengthy descriptions
- Summarize long context documents before injection
- Remove redundant instructions
Output token reduction:
- Set
max_tokensto the minimum needed - Use structured output formats (JSON) with constrained schemas
- Request concise responses explicitly in the prompt
- Use
stop_sequencesto prevent unnecessary continuation
Strategy 5: Use Provisioned Throughput
For predictable, high-volume workloads, Provisioned Throughput offers reserved capacity at lower per-token cost.
| Commitment | Discount |
|---|---|
| 1-month | ~20% |
| 6-month | ~35% |
Break-even: Provisioned Throughput is cheaper when you consistently use more than 60% of the provisioned capacity. Monitor actual usage for 2-4 weeks before committing.
Strategy 6: Implement Semantic Caching
Cache LLM responses for semantically similar queries. When a new query is similar enough to a previously answered one, return the cached response instead of calling Bedrock.
Tools: Use vector databases (OpenSearch, ElastiCache for Valkey with vector search) to store embeddings of previous queries and responses. Set a similarity threshold (e.g., 0.95 cosine similarity) for cache hits.
Savings: 30-70% reduction in API calls depending on query diversity.
Strategy 7: Implement Request Deduplication
Track in-flight requests and return the same response for duplicate or near-duplicate concurrent requests. This is common in applications where multiple users ask similar questions simultaneously.
Strategy 8: Monitor and Set Budgets
Use CloudWatch metrics to track:
InvocationCount— total API callsInputTokenCountandOutputTokenCount— token usage per modelInvocationLatency— helps identify model sizing opportunities
Set AWS Budgets alerts at 50%, 80%, and 100% of your monthly AI budget to catch runaway costs early.
Related Guides
- AWS Bedrock Pricing Guide
- AWS Bedrock Batch Inference Guide
- AI Cost Optimization Guide
- AWS Bedrock vs OpenAI Pricing
FAQ
How do I choose between Claude Haiku and Sonnet?
Test both on your actual use cases with a quality evaluation framework. If Haiku achieves above 90% of Sonnet's quality for a given task, use Haiku (5x cheaper). Common Haiku-appropriate tasks: classification, entity extraction, simple summarization, formatting.
Is self-hosting open-source models on SageMaker cheaper?
For high-volume workloads (over $5,000/month in Bedrock costs), self-hosting Llama models on SageMaker Inference endpoints can be 40-60% cheaper. Below that volume, Bedrock's per-token pricing is usually more cost-effective due to zero infrastructure management.
How much does prompt caching actually save?
For applications with stable system prompts over 1,000 tokens, prompt caching saves 80-90% on input token costs. The cache write premium (25%) is recovered after just 2-3 requests. For a system prompt used 10,000 times daily, the savings are substantial.
Lower Your Bedrock Costs with Wring
Wring helps you access AWS credits and volume discounts to lower your Bedrock costs. Through group buying power, Wring negotiates better rates so you pay less per model inference.
