SageMaker offers four distinct inference options, and choosing the wrong one can cost you 5-10x more than necessary. The difference between a real-time endpoint running 24/7 on a GPU and a serverless endpoint that scales to zero is hundreds of dollars a month per model. Understanding when to use each option is the single most impactful SageMaker cost decision most teams face.
AWS SageMaker inference pricing is charged per second for real-time endpoints, per request duration for serverless, per processing time for batch, and per request duration for async. Each model has its own sweet spot.
TL;DR: SageMaker has 4 inference modes. Real-Time Endpoints start at $0.05/hr (ml.t3.medium) but run 24/7. Serverless Inference scales to zero with 1-5s cold starts. Batch Transform handles offline workloads at training-instance rates. Async Inference queues large payloads. For models under 1,000 requests/hour, Serverless saves 40-80% over always-on endpoints.
The Four Inference Options
Real-Time Endpoints
Always-on instances that serve predictions with low latency. You pay for the instance from the moment it's created until it's deleted, regardless of traffic.
| Instance | GPU | On-Demand/hr | Monthly (24/7) | Best For |
|---|---|---|---|---|
| ml.t3.medium | None | $0.05 | $37 | Simple models, low latency CPU |
| ml.c7g.large (Graviton) | None | $0.10 | $73 | CPU inference, cost-efficient |
| ml.g5.xlarge (A10G) | 1x 24GB | $1.21 | $883 | GPU models, LLMs under 7B |
| ml.g5.2xlarge (A10G) | 1x 24GB | $1.52 | $1,110 | Larger GPU models |
| ml.inf2.xlarge (Inferentia2) | 2 NeuronCores | $0.76 | $555 | Optimized transformer inference |
| ml.p4d.24xlarge (A100) | 8x 320GB | $37.69 | $27,514 | Large LLMs, 70B+ parameter models |
Real-time endpoints support auto-scaling, which adjusts instance count based on traffic metrics. Without auto-scaling, you pay full price even at zero traffic.
Serverless Inference
Pay-per-request pricing with automatic scaling, including scaling to zero.
| Resource | Price |
|---|---|
| Duration | $0.0000667/second per GB of memory provisioned |
| Requests | $0.20 per 1 million requests |
| Cold start | 1-5 seconds (varies by model size) |
| Max memory | 6 GB |
| Max concurrency | 200 concurrent requests |
Serverless inference is ideal for models that receive sporadic or unpredictable traffic. You provision memory (1-6 GB) and SageMaker handles everything else. The catch: cold starts add 1-5 seconds when the endpoint has been idle.
Batch Transform
Process entire datasets offline. You pay for the compute instances only while the batch job runs.
| Instance | On-Demand/hr | 1M Records (est. time) | Total Cost |
|---|---|---|---|
| ml.m5.xlarge (CPU) | $0.23 | ~10 hours | $2.30 |
| ml.g5.xlarge (GPU) | $1.01 | ~2 hours | $2.02 |
| ml.c7g.xlarge (Graviton) | $0.20 | ~12 hours | $2.40 |
Batch Transform is billed per second of instance usage. There is no charge for idle time because the instances terminate when the job completes.
Async Inference
Designed for requests with large payloads (up to 1 GB) or long processing times (up to 15 minutes). Requests are queued in an internal buffer and processed asynchronously.
| Feature | Detail |
|---|---|
| Pricing | Same as real-time endpoint instances |
| Scale to zero | Yes, min instances can be 0 |
| Max payload | 1 GB |
| Max processing time | 15 minutes per request |
| Notification | SNS notifications on completion |
Async inference combines the managed infrastructure of real-time endpoints with the ability to scale to zero. It's the best option for workloads with large payloads that don't need immediate responses.
Cost Comparison by Traffic Pattern
The right inference option depends entirely on your traffic pattern. Here's a comparison for a model running on ml.g5.xlarge-equivalent compute:
| Monthly Requests | Real-Time (24/7) | Serverless | Batch (daily) | Async (scale-to-zero) |
|---|---|---|---|---|
| 1,000 | $883 | $3 | $2 | $15 |
| 10,000 | $883 | $18 | $2 | $25 |
| 100,000 | $883 | $142 | $5 | $120 |
| 500,000 | $883 | $680 | $15 | $500 |
| 1,000,000 | $883 | $1,320 | $30 | $850 |
| 5,000,000 | $883 | $6,500 | $60 | $883 |
Key takeaways from this table:
- Under 100K requests/month: Serverless or Async saves 80-99% vs real-time
- 100K-500K requests/month: Serverless is competitive, real-time becomes cost-effective above 500K
- Over 1M requests/month: Real-time endpoints are the cheapest option
- Batch is always cheapest for offline workloads where latency does not matter
Auto-Scaling Real-Time Endpoints
If you must use real-time endpoints, auto-scaling is essential to control costs. Configure scaling based on CloudWatch metrics:
Target tracking scaling:
- Metric:
InvocationsPerInstance - Target: 70-80% of your model's max throughput
- Min instances: 1 (or 0 if using async)
- Max instances: based on peak traffic
- Scale-in cooldown: 300 seconds
- Scale-out cooldown: 60 seconds
Scheduled scaling for predictable patterns:
- Scale up before business hours (e.g., 8 AM)
- Scale down after hours (e.g., 8 PM)
- Reduce capacity on weekends
Auto-scaling reduces real-time endpoint costs by 40-65% for workloads with variable traffic.
Multi-Model Endpoints
Multi-model endpoints host multiple models on a single instance. SageMaker dynamically loads and unloads models from memory based on request traffic.
| Scenario | Dedicated Endpoints | Multi-Model Endpoint |
|---|---|---|
| 5 low-traffic models on ml.g5.xlarge | $4,415/month | $883/month |
| 10 low-traffic models on ml.g5.xlarge | $8,830/month | $883-$1,766/month |
| 20 low-traffic models on ml.c7g.large | $1,460/month | $146/month |
Trade-offs:
- Slightly higher latency when a model must be loaded from S3 (first request or after eviction)
- All models share the same instance resources
- Models must use the same framework (e.g., all PyTorch or all TensorFlow)
Multi-model endpoints can reduce inference costs by 80-90% for organizations hosting many models with individually low traffic.
Cost Optimization Tips
-
Start with Serverless Inference for new models. It costs almost nothing at low traffic. Migrate to real-time only when traffic exceeds 500K requests/month or cold starts become unacceptable.
-
Enable auto-scaling on every real-time endpoint. An unscaled endpoint paying for 24/7 GPU capacity with variable traffic wastes 40-65% of spend.
-
Use multi-model endpoints for model portfolios. If you have 5 or more models each receiving fewer than 10K requests/day, consolidating onto shared instances saves 80%+.
-
Use Batch Transform for anything not time-sensitive. Nightly scoring, weekly retraining, bulk predictions — batch is 10-50x cheaper than keeping an endpoint running.
-
Choose Async Inference for large payloads. Video, audio, or document processing models that accept large inputs benefit from async's queue-based architecture and scale-to-zero capability.
-
Consider Inferentia2 instances for transformer models. ml.inf2.xlarge ($0.76/hr) delivers comparable throughput to ml.g5.xlarge ($1.21/hr) for supported model architectures, saving 37%.
-
Monitor endpoint utilization with CloudWatch. Track
CPUUtilization,GPUUtilization, andInvocationsPerInstance. If GPU utilization is consistently under 30%, downsize the instance.
Related Guides
- AWS SageMaker Pricing: Training, Inference, Studio
- AWS SageMaker Cost Optimization: Cut ML Costs
- AWS Inferentia vs GPU Pricing
- AWS LLM Hosting vs API Costs
FAQ
When should I use Serverless Inference vs Real-Time Endpoints?
Use Serverless Inference when your model receives fewer than 500,000 requests per month, when traffic is unpredictable or bursty, or when you can tolerate 1-5 second cold starts. Use Real-Time Endpoints when you need consistently low latency (under 100ms), when traffic is steady and high-volume, or when your model requires more than 6 GB of memory.
How do I reduce cold start times for Serverless Inference?
Keep your model artifact small by using model compression and quantization. Choose the minimum memory configuration that fits your model. Use provisioned concurrency (if available) to keep instances warm. Cold starts scale with model size — a 500 MB model loads in roughly 1-2 seconds, while a 5 GB model may take 4-5 seconds.
Can I use Spot Instances for SageMaker inference?
No. Managed Spot is only available for SageMaker Training jobs, not inference endpoints. For inference cost savings, use auto-scaling, Serverless Inference, multi-model endpoints, or Inferentia2 instances instead.
Lower Your SageMaker Inference Costs with Wring
Wring helps you access AWS credits and volume discounts to lower your SageMaker inference costs. Through group buying power, Wring negotiates better rates so you pay less per inference hour.
