Wring
All articlesAWS Guides

AWS Bedrock Cost Optimization: Cut LLM API Costs

Reduce AWS Bedrock costs by 40-70% with model routing, batch inference, and prompt caching. Proven strategies for cutting LLM API spend.

Wring Team
March 14, 2026
6 min read
AWS BedrockLLM cost optimizationAI costsBedrock pricingmodel selectiontoken optimization
AI model optimization and cost management
AI model optimization and cost management

AWS Bedrock usage is growing faster than almost any other AWS service, and so are the bills. A single Claude or Llama model powering a production application can easily cost $10,000-50,000/month in token charges. The model you choose, how you structure prompts, and whether you use features like batch inference or prompt caching determine whether your AI costs are sustainable or spiraling.

TL;DR: The three biggest Bedrock savings: (1) Use the smallest model that meets quality requirements — Claude Haiku costs 95% less than Claude Opus for many tasks. (2) Enable batch inference for non-real-time workloads — 50% discount on token prices. (3) Implement prompt caching for repeated system prompts — up to 90% reduction on cached input tokens. These three strategies combined can reduce Bedrock costs by 60-80%.


Bedrock Pricing Quick Reference

Per-Token Pricing (Popular Models)

ModelInput (per 1K tokens)Output (per 1K tokens)
Claude 4 Opus$0.015$0.075
Claude 4 Sonnet$0.003$0.015
Claude 4 Haiku$0.0008$0.004
Llama 3.1 70B$0.00099$0.00099
Llama 3.1 8B$0.00022$0.00022
Titan Text Express$0.0002$0.0006
Mistral Large$0.004$0.012
Bedrock Cost Optimization Guide process flow diagram

Strategy 1: Choose the Right Model Size

The most impactful cost decision is model selection. Many tasks don't require the most capable model.

Task TypeRecommended Modelvs Opus Savings
Classification, routingHaiku or Llama 8B95%
Summarization, extractionSonnet or Llama 70B80%
Simple Q&A, formattingHaiku or Titan95%
Complex reasoning, codingSonnet80%
Research, creative writingOpusBaseline

Implementation: Build a model router that classifies incoming requests and sends them to the appropriate model. Simple classification requests should never hit Opus.

Strategy 2: Use Batch Inference

Bedrock Batch Inference processes large volumes of requests asynchronously at a 50% discount on token prices.

Use CaseOn-Demand CostBatch CostSavings
1M documents classification$800$40050%
100K email summaries$300$15050%
500K content generation$7,500$3,75050%

Batch is ideal for: document processing pipelines, content generation queues, data extraction jobs, and any workload that doesn't need sub-second response times.

Strategy 3: Implement Prompt Caching

Bedrock supports prompt caching for Claude models. Cached input tokens cost up to 90% less than standard input tokens.

ComponentCost
Standard input tokensFull price
Cache write (first use)25% premium
Cache read (subsequent uses)90% discount
Cache TTL5 minutes (auto-extended on hit)

Best for: System prompts, few-shot examples, and context documents that repeat across many requests. A 10,000-token system prompt used 1,000 times costs $0.15 cached vs $1.50 uncached (with Claude Sonnet).

Strategy 4: Optimize Token Usage

Every token costs money. Reduce token consumption without sacrificing output quality.

Input token reduction:

  • Trim system prompts to essential instructions only
  • Use concise few-shot examples instead of lengthy descriptions
  • Summarize long context documents before injection
  • Remove redundant instructions

Output token reduction:

  • Set max_tokens to the minimum needed
  • Use structured output formats (JSON) with constrained schemas
  • Request concise responses explicitly in the prompt
  • Use stop_sequences to prevent unnecessary continuation

Strategy 5: Use Provisioned Throughput

For predictable, high-volume workloads, Provisioned Throughput offers reserved capacity at lower per-token cost.

CommitmentDiscount
1-month~20%
6-month~35%

Break-even: Provisioned Throughput is cheaper when you consistently use more than 60% of the provisioned capacity. Monitor actual usage for 2-4 weeks before committing.

Strategy 6: Implement Semantic Caching

Cache LLM responses for semantically similar queries. When a new query is similar enough to a previously answered one, return the cached response instead of calling Bedrock.

Tools: Use vector databases (OpenSearch, ElastiCache for Valkey with vector search) to store embeddings of previous queries and responses. Set a similarity threshold (e.g., 0.95 cosine similarity) for cache hits.

Savings: 30-70% reduction in API calls depending on query diversity.

Strategy 7: Implement Request Deduplication

Track in-flight requests and return the same response for duplicate or near-duplicate concurrent requests. This is common in applications where multiple users ask similar questions simultaneously.

Strategy 8: Monitor and Set Budgets

Use CloudWatch metrics to track:

  • InvocationCount — total API calls
  • InputTokenCount and OutputTokenCount — token usage per model
  • InvocationLatency — helps identify model sizing opportunities

Set AWS Budgets alerts at 50%, 80%, and 100% of your monthly AI budget to catch runaway costs early.

Bedrock Cost Optimization Guide optimization checklist

Related Guides


FAQ

How do I choose between Claude Haiku and Sonnet?

Test both on your actual use cases with a quality evaluation framework. If Haiku achieves above 90% of Sonnet's quality for a given task, use Haiku (5x cheaper). Common Haiku-appropriate tasks: classification, entity extraction, simple summarization, formatting.

Is self-hosting open-source models on SageMaker cheaper?

For high-volume workloads (over $5,000/month in Bedrock costs), self-hosting Llama models on SageMaker Inference endpoints can be 40-60% cheaper. Below that volume, Bedrock's per-token pricing is usually more cost-effective due to zero infrastructure management.

How much does prompt caching actually save?

For applications with stable system prompts over 1,000 tokens, prompt caching saves 80-90% on input token costs. The cache write premium (25%) is recovered after just 2-3 requests. For a system prompt used 10,000 times daily, the savings are substantial.

Bedrock Cost Optimization Guide key statistics

Lower Your Bedrock Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your Bedrock costs. Through group buying power, Wring negotiates better rates so you pay less per model inference.

Start saving on Bedrock →