There are three ways to run LLM inference on AWS: call Bedrock's API and pay per token, deploy a model on a SageMaker endpoint and pay per hour, or run the model on bare EC2 instances with full control. Each approach has a different cost structure, and the cheapest option depends almost entirely on your monthly token volume. Bedrock wins below 10M tokens/month, SageMaker wins at 10-50M tokens/month for managed ease, and self-hosted EC2 wins above 50M tokens/month when you have the engineering team to support it.
TL;DR: For a 7B model: Bedrock API costs $0.22-0.44 per 1M tokens (cheapest under 10M tokens/month). A SageMaker endpoint on ml.g5.xlarge costs $1,015/month (break-even at ~10M tokens). Self-hosted EC2 g5.xlarge costs $724/month with on-demand or $290/month with Reserved (cheapest above 50M tokens/month).
Three Approaches Compared
| Feature | Bedrock API | SageMaker Endpoint | Self-Hosted EC2 |
|---|---|---|---|
| Pricing model | Per token | Per hour | Per hour |
| Minimum cost | $0 (pay-per-use) | ~$1,015/month (24/7 endpoint) | ~$724/month (on-demand 24/7) |
| Scaling | Automatic | Auto-scaling policies | Manual or custom |
| Model selection | Bedrock catalog only | Any model | Any model |
| Infrastructure management | None | Minimal | Full responsibility |
| Latency control | Limited | Full | Full |
| GPU utilization | Shared (AWS-managed) | Dedicated | Dedicated |
Bedrock API Pricing (Pay-Per-Token)
Bedrock charges per input and output token with no minimum commitment.
Comparable Model Pricing on Bedrock
| Model | Input/1K Tokens | Output/1K Tokens | Effective Cost per 1M Mixed Tokens |
|---|---|---|---|
| Llama 3.1 8B | $0.00022 | $0.00022 | $0.22 |
| Llama 3.1 70B | $0.00099 | $0.00099 | $0.99 |
| Mistral Small | $0.001 | $0.003 | $2.00 |
| Claude Haiku | $0.0008 | $0.004 | $2.40 |
| Claude Sonnet | $0.003 | $0.015 | $9.00 |
Mixed tokens assumes a 50/50 input/output ratio.
Bedrock Provisioned Throughput
For sustained high-volume workloads, Bedrock Provisioned Throughput offers dedicated capacity:
| Commitment | Discount vs On-Demand |
|---|---|
| No commitment | ~15% off |
| 1-month | ~20% off |
| 6-month | ~35% off |
Provisioned Throughput is priced per model unit, not per token. It makes sense when your token volume is high enough that the per-unit cost beats pay-per-token pricing.
SageMaker Endpoint Pricing (Per-Hour)
SageMaker endpoints bill for the ML instance running your model, regardless of how many (or few) queries it serves.
Common SageMaker Inference Instances
| Instance | GPU | GPU Memory | SageMaker Price/hr | Monthly (24/7) |
|---|---|---|---|---|
| ml.g5.xlarge | 1x A10G | 24 GB | $1.408 | $1,028 |
| ml.g5.2xlarge | 1x A10G | 24 GB | $1.694 | $1,237 |
| ml.g5.12xlarge | 4x A10G | 96 GB | $7.941 | $5,797 |
| ml.inf2.xlarge | 1x Inferentia2 | 32 GB | $1.109 | $810 |
| ml.p4d.24xlarge | 8x A100 | 320 GB | $37.688 | $27,512 |
SageMaker Cost Advantages
- Auto-scaling: Scale to zero during off-hours (Serverless Inference or auto-scaling policies)
- Multi-model endpoints: Share one instance across multiple models
- Managed infrastructure: No patching, no CUDA driver management
Self-Hosted EC2 Pricing (Full Control)
Running your own inference server on EC2 gives you the lowest per-hour cost but the highest operational overhead.
EC2 Inference Instance Pricing
| Instance | GPU | On-Demand/hr | 1-Year Reserved/hr | 3-Year Reserved/hr | Monthly (On-Demand) |
|---|---|---|---|---|---|
| g5.xlarge | 1x A10G | $1.006 | $0.636 | $0.399 | $724 |
| g5.2xlarge | 1x A10G | $1.212 | $0.768 | $0.482 | $873 |
| g5.12xlarge | 4x A10G | $5.672 | $3.590 | $2.253 | $4,084 |
| inf2.xlarge | 1x Inferentia2 | $0.758 | $0.479 | $0.301 | $546 |
| p4d.24xlarge | 8x A100 | $32.77 | $20.37 | $12.58 | $23,594 |
Self-Hosted Cost Advantages
- No SageMaker surcharge: Save 15-40% vs SageMaker pricing
- Reserved Instances: 37-62% savings with commitments
- Spot Instances: 60-70% savings for fault-tolerant inference
- Full CUDA control: Custom kernels, vLLM, TensorRT-LLM
Break-Even Analysis
7B Model (Llama 3.1 8B equivalent)
| Monthly Tokens | Bedrock API Cost | SageMaker (ml.g5.xlarge) | EC2 On-Demand | EC2 Reserved (1-yr) |
|---|---|---|---|---|
| 1M | $0.22 | $1,028 | $724 | $464 |
| 5M | $1.10 | $1,028 | $724 | $464 |
| 10M | $2.20 | $1,028 | $724 | $464 |
| 50M | $11.00 | $1,028 | $724 | $464 |
| 100M | $22.00 | $1,028 | $724 | $464 |
| 500M | $110.00 | $1,028 | $724 | $464 |
| 5B | $1,100.00 | $1,028 | $724 | $464 |
| 10B | $2,200.00 | $1,028 | $724 | $464 |
Break-even points for 7B model:
- Bedrock vs SageMaker: ~4.7B tokens/month
- Bedrock vs EC2 On-Demand: ~3.3B tokens/month
- Bedrock vs EC2 Reserved: ~2.1B tokens/month
For small models like Llama 8B, Bedrock's per-token pricing is extremely competitive. You would need billions of tokens per month before self-hosting becomes cheaper.
70B Model (Llama 3.1 70B equivalent)
| Monthly Tokens | Bedrock API Cost | SageMaker (ml.g5.48xlarge) | EC2 On-Demand | EC2 Reserved (1-yr) |
|---|---|---|---|---|
| 1M | $0.99 | $22,770 | $11,730 | $7,416 |
| 10M | $9.90 | $22,770 | $11,730 | $7,416 |
| 100M | $99.00 | $22,770 | $11,730 | $7,416 |
| 1B | $990.00 | $22,770 | $11,730 | $7,416 |
| 10B | $9,900.00 | $22,770 | $11,730 | $7,416 |
| 50B | $49,500.00 | $22,770 | $11,730 | $7,416 |
Break-even points for 70B model:
- Bedrock vs SageMaker: ~23B tokens/month
- Bedrock vs EC2 On-Demand: ~11.8B tokens/month
- Bedrock vs EC2 Reserved: ~7.5B tokens/month
Operational Overhead Comparison
Cost is not just compute — engineering time matters too.
| Task | Bedrock | SageMaker | Self-Hosted EC2 |
|---|---|---|---|
| Initial setup | Minutes | Hours | Days |
| Model updates | Automatic | Container rebuild | Full redeployment |
| Scaling | Automatic | Auto-scaling config | Custom solution |
| Monitoring | CloudWatch built-in | CloudWatch + SageMaker metrics | Custom dashboards |
| GPU driver management | None | None | Manual |
| Security patching | None | Minimal | Full responsibility |
| Estimated DevOps hours/month | 0-2 | 2-8 | 10-40 |
At $150/hr for ML engineering time, 20 hours/month of additional self-hosted operations adds $3,000/month to your effective cost.
When Each Option Wins
Choose Bedrock API When:
- Token volume is under 5B/month for small models or under 10B/month for large models
- You need access to multiple model families (Claude, Llama, Mistral) through one API
- You want zero infrastructure management
- You need Guardrails, Knowledge Bases, or Agents built in
Choose SageMaker Endpoints When:
- You need to deploy custom or fine-tuned models not available on Bedrock
- You want managed infrastructure with more control than Bedrock
- You need multi-model endpoints to serve several models from one instance
- Token volume is moderate and predictable
Choose Self-Hosted EC2 When:
- Token volume exceeds 10B/month consistently
- You have an ML platform team to manage infrastructure
- You need custom inference optimizations (vLLM, TensorRT-LLM, custom kernels)
- You can commit to 1-3 year Reserved Instances for the deepest discounts
Cost Optimization Tips
-
Start with Bedrock, graduate to self-hosted — Begin with Bedrock API for rapid iteration and to establish your actual token volume. Migrate to SageMaker or EC2 only when you have predictable high volume that justifies the fixed cost.
-
Use SageMaker Serverless Inference for bursty workloads — Serverless endpoints scale to zero when idle, eliminating the 24/7 cost. You pay only for the compute time during active inference.
-
Deploy with vLLM or TensorRT-LLM on self-hosted instances — These inference engines provide 2-4x throughput improvement over naive model serving, effectively cutting your per-token cost by 50-75%.
-
Combine Bedrock with self-hosted — Route high-volume, cost-sensitive workloads to self-hosted instances while using Bedrock for low-volume, multi-model, or experimental workloads.
-
Use Inf2 instances for supported models — Whether on SageMaker or EC2, Inferentia2 instances offer 25-40% lower cost per inference than equivalent GPU instances.
-
Implement request batching — Batch multiple inference requests together to maximize GPU utilization. This is especially impactful for self-hosted deployments where you are paying per hour regardless of utilization.
Related Guides
- AWS Bedrock Pricing Guide
- AWS SageMaker Pricing Guide
- AWS GPU Instance Pricing Guide
- AWS Inferentia vs GPU Pricing
FAQ
At what token volume should I switch from Bedrock to self-hosting?
For 7B-class models (Llama 8B), Bedrock's per-token pricing is so low ($0.22/1M tokens) that self-hosting rarely makes sense unless you are processing billions of tokens monthly. For 70B-class models, the break-even point is around 10-25B tokens/month depending on whether you use Reserved Instances.
Can I use Bedrock Provisioned Throughput as a middle ground?
Yes. Bedrock Provisioned Throughput gives you dedicated model capacity billed per model unit per hour — similar to the SageMaker pricing model but without managing infrastructure. It is 20-35% cheaper than on-demand Bedrock for sustained workloads and eliminates the throttling risk of on-demand.
How do I calculate total cost of ownership for self-hosted LLMs?
Add compute costs (instance hours), storage costs (EBS for model weights, S3 for logs), networking costs (load balancer, data transfer), and engineering costs (DevOps time for monitoring, scaling, patching, and model updates). Most teams underestimate the engineering component, which can add $3,000-6,000/month for a dedicated ML platform engineer.
Lower Your LLM Hosting Costs with Wring
Wring helps you access AWS credits and volume discounts to lower your LLM inference costs. Through group buying power, Wring negotiates better rates so you pay less per token across Bedrock, SageMaker, and EC2.
