AWS Bedrock Cost Optimization: Cut LLM API Costs

AI model optimization and cost management

AWS Bedrock usage is growing faster than almost any other AWS service, and so are the bills. A single Claude or Llama model powering a production application can easily cost $10,000-50,000/month in token charges. The model you choose, how you structure prompts, and whether you use features like batch inference or prompt caching determine whether your AI costs are sustainable or spiraling.

TL;DR: The three biggest Bedrock savings: (1) Use the smallest model that meets quality requirements — Claude Haiku costs 95% less than Claude Opus for many tasks. (2) Enable batch inference for non-real-time workloads — 50% discount on token prices. (3) Implement prompt caching for repeated system prompts — up to 90% reduction on cached input tokens. These three strategies combined can reduce Bedrock costs by 60-80%.

Bedrock Pricing Quick Reference

Per-Token Pricing (Popular Models)

Model	Input (per 1K tokens)	Output (per 1K tokens)
Claude 4 Opus	$0.015	$0.075
Claude 4 Sonnet	$0.003	$0.015
Claude 4 Haiku	$0.0008	$0.004
Llama 3.1 70B	$0.00099	$0.00099
Llama 3.1 8B	$0.00022	$0.00022
Titan Text Express	$0.0002	$0.0006
Mistral Large	$0.004	$0.012

Bedrock Cost Optimization Guide process flow diagram

Strategy 1: Choose the Right Model Size

The most impactful cost decision is model selection. Many tasks don't require the most capable model.

Task Type	Recommended Model	vs Opus Savings
Classification, routing	Haiku or Llama 8B	95%
Summarization, extraction	Sonnet or Llama 70B	80%
Simple Q&A, formatting	Haiku or Titan	95%
Complex reasoning, coding	Sonnet	80%
Research, creative writing	Opus	Baseline

Implementation: Build a model router that classifies incoming requests and sends them to the appropriate model. Simple classification requests should never hit Opus.

Strategy 2: Use Batch Inference

Bedrock Batch Inference processes large volumes of requests asynchronously at a 50% discount on token prices.

Use Case	On-Demand Cost	Batch Cost	Savings
1M documents classification	$800	$400	50%
100K email summaries	$300	$150	50%
500K content generation	$7,500	$3,750	50%

Batch is ideal for: document processing pipelines, content generation queues, data extraction jobs, and any workload that doesn't need sub-second response times.

Strategy 3: Implement Prompt Caching

Bedrock supports prompt caching for Claude models. Cached input tokens cost up to 90% less than standard input tokens.

Component	Cost
Standard input tokens	Full price
Cache write (first use)	25% premium
Cache read (subsequent uses)	90% discount
Cache TTL	5 minutes (auto-extended on hit)

Best for: System prompts, few-shot examples, and context documents that repeat across many requests. A 10,000-token system prompt used 1,000 times costs $0.15 cached vs $1.50 uncached (with Claude Sonnet).

Strategy 4: Optimize Token Usage

Every token costs money. Reduce token consumption without sacrificing output quality.

Input token reduction:

Trim system prompts to essential instructions only
Use concise few-shot examples instead of lengthy descriptions
Summarize long context documents before injection
Remove redundant instructions

Output token reduction:

Set max_tokens to the minimum needed
Use structured output formats (JSON) with constrained schemas
Request concise responses explicitly in the prompt
Use stop_sequences to prevent unnecessary continuation

Strategy 5: Use Provisioned Throughput

For predictable, high-volume workloads, Provisioned Throughput offers reserved capacity at lower per-token cost.

Commitment	Discount
1-month	~20%
6-month	~35%

Break-even: Provisioned Throughput is cheaper when you consistently use more than 60% of the provisioned capacity. Monitor actual usage for 2-4 weeks before committing.

Strategy 6: Implement Semantic Caching

Cache LLM responses for semantically similar queries. When a new query is similar enough to a previously answered one, return the cached response instead of calling Bedrock.

Tools: Use vector databases (OpenSearch, ElastiCache for Valkey with vector search) to store embeddings of previous queries and responses. Set a similarity threshold (e.g., 0.95 cosine similarity) for cache hits.

Savings: 30-70% reduction in API calls depending on query diversity.

Strategy 7: Implement Request Deduplication

Track in-flight requests and return the same response for duplicate or near-duplicate concurrent requests. This is common in applications where multiple users ask similar questions simultaneously.

Strategy 8: Monitor and Set Budgets

Use CloudWatch metrics to track:

InvocationCount — total API calls
InputTokenCount and OutputTokenCount — token usage per model
InvocationLatency — helps identify model sizing opportunities

Set AWS Budgets alerts at 50%, 80%, and 100% of your monthly AI budget to catch runaway costs early.

Bedrock Cost Optimization Guide optimization checklist

Related Guides

FAQ

How do I choose between Claude Haiku and Sonnet?

Test both on your actual use cases with a quality evaluation framework. If Haiku achieves above 90% of Sonnet's quality for a given task, use Haiku (5x cheaper). Common Haiku-appropriate tasks: classification, entity extraction, simple summarization, formatting.

Is self-hosting open-source models on SageMaker cheaper?

For high-volume workloads (over $5,000/month in Bedrock costs), self-hosting Llama models on SageMaker Inference endpoints can be 40-60% cheaper. Below that volume, Bedrock's per-token pricing is usually more cost-effective due to zero infrastructure management.

How much does prompt caching actually save?

For applications with stable system prompts over 1,000 tokens, prompt caching saves 80-90% on input token costs. The cache write premium (25%) is recovered after just 2-3 requests. For a system prompt used 10,000 times daily, the savings are substantial.

Bedrock Cost Optimization Guide key statistics

Lower Your Bedrock Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your Bedrock costs. Through group buying power, Wring negotiates better rates so you pay less per model inference.

Start saving on Bedrock →