Wring
All articlesAWS Guides

SageMaker Inference: Real-Time vs Serverless

Compare SageMaker inference options: Real-Time, Serverless, Batch, and Async. Real-Time from $0.05/hr. Serverless scales to zero. Pick the right one.

Wring Team
March 15, 2026
8 min read
SageMaker inferenceML inference costsserverless inferencereal-time endpoints
Server infrastructure with neural network visualization for ML inference workloads
Server infrastructure with neural network visualization for ML inference workloads

SageMaker offers four distinct inference options, and choosing the wrong one can cost you 5-10x more than necessary. The difference between a real-time endpoint running 24/7 on a GPU and a serverless endpoint that scales to zero is hundreds of dollars a month per model. Understanding when to use each option is the single most impactful SageMaker cost decision most teams face.

AWS SageMaker inference pricing is charged per second for real-time endpoints, per request duration for serverless, per processing time for batch, and per request duration for async. Each model has its own sweet spot.

TL;DR: SageMaker has 4 inference modes. Real-Time Endpoints start at $0.05/hr (ml.t3.medium) but run 24/7. Serverless Inference scales to zero with 1-5s cold starts. Batch Transform handles offline workloads at training-instance rates. Async Inference queues large payloads. For models under 1,000 requests/hour, Serverless saves 40-80% over always-on endpoints.


The Four Inference Options

Real-Time Endpoints

Always-on instances that serve predictions with low latency. You pay for the instance from the moment it's created until it's deleted, regardless of traffic.

InstanceGPUOn-Demand/hrMonthly (24/7)Best For
ml.t3.mediumNone$0.05$37Simple models, low latency CPU
ml.c7g.large (Graviton)None$0.10$73CPU inference, cost-efficient
ml.g5.xlarge (A10G)1x 24GB$1.21$883GPU models, LLMs under 7B
ml.g5.2xlarge (A10G)1x 24GB$1.52$1,110Larger GPU models
ml.inf2.xlarge (Inferentia2)2 NeuronCores$0.76$555Optimized transformer inference
ml.p4d.24xlarge (A100)8x 320GB$37.69$27,514Large LLMs, 70B+ parameter models

Real-time endpoints support auto-scaling, which adjusts instance count based on traffic metrics. Without auto-scaling, you pay full price even at zero traffic.

Serverless Inference

Pay-per-request pricing with automatic scaling, including scaling to zero.

ResourcePrice
Duration$0.0000667/second per GB of memory provisioned
Requests$0.20 per 1 million requests
Cold start1-5 seconds (varies by model size)
Max memory6 GB
Max concurrency200 concurrent requests

Serverless inference is ideal for models that receive sporadic or unpredictable traffic. You provision memory (1-6 GB) and SageMaker handles everything else. The catch: cold starts add 1-5 seconds when the endpoint has been idle.

Batch Transform

Process entire datasets offline. You pay for the compute instances only while the batch job runs.

InstanceOn-Demand/hr1M Records (est. time)Total Cost
ml.m5.xlarge (CPU)$0.23~10 hours$2.30
ml.g5.xlarge (GPU)$1.01~2 hours$2.02
ml.c7g.xlarge (Graviton)$0.20~12 hours$2.40

Batch Transform is billed per second of instance usage. There is no charge for idle time because the instances terminate when the job completes.

Async Inference

Designed for requests with large payloads (up to 1 GB) or long processing times (up to 15 minutes). Requests are queued in an internal buffer and processed asynchronously.

FeatureDetail
PricingSame as real-time endpoint instances
Scale to zeroYes, min instances can be 0
Max payload1 GB
Max processing time15 minutes per request
NotificationSNS notifications on completion

Async inference combines the managed infrastructure of real-time endpoints with the ability to scale to zero. It's the best option for workloads with large payloads that don't need immediate responses.

Sagemaker Inference Guide comparison chart

Cost Comparison by Traffic Pattern

The right inference option depends entirely on your traffic pattern. Here's a comparison for a model running on ml.g5.xlarge-equivalent compute:

Monthly RequestsReal-Time (24/7)ServerlessBatch (daily)Async (scale-to-zero)
1,000$883$3$2$15
10,000$883$18$2$25
100,000$883$142$5$120
500,000$883$680$15$500
1,000,000$883$1,320$30$850
5,000,000$883$6,500$60$883

Key takeaways from this table:

  • Under 100K requests/month: Serverless or Async saves 80-99% vs real-time
  • 100K-500K requests/month: Serverless is competitive, real-time becomes cost-effective above 500K
  • Over 1M requests/month: Real-time endpoints are the cheapest option
  • Batch is always cheapest for offline workloads where latency does not matter
Sagemaker Inference Guide process flow diagram

Auto-Scaling Real-Time Endpoints

If you must use real-time endpoints, auto-scaling is essential to control costs. Configure scaling based on CloudWatch metrics:

Target tracking scaling:

  • Metric: InvocationsPerInstance
  • Target: 70-80% of your model's max throughput
  • Min instances: 1 (or 0 if using async)
  • Max instances: based on peak traffic
  • Scale-in cooldown: 300 seconds
  • Scale-out cooldown: 60 seconds

Scheduled scaling for predictable patterns:

  • Scale up before business hours (e.g., 8 AM)
  • Scale down after hours (e.g., 8 PM)
  • Reduce capacity on weekends

Auto-scaling reduces real-time endpoint costs by 40-65% for workloads with variable traffic.


Multi-Model Endpoints

Multi-model endpoints host multiple models on a single instance. SageMaker dynamically loads and unloads models from memory based on request traffic.

ScenarioDedicated EndpointsMulti-Model Endpoint
5 low-traffic models on ml.g5.xlarge$4,415/month$883/month
10 low-traffic models on ml.g5.xlarge$8,830/month$883-$1,766/month
20 low-traffic models on ml.c7g.large$1,460/month$146/month

Trade-offs:

  • Slightly higher latency when a model must be loaded from S3 (first request or after eviction)
  • All models share the same instance resources
  • Models must use the same framework (e.g., all PyTorch or all TensorFlow)

Multi-model endpoints can reduce inference costs by 80-90% for organizations hosting many models with individually low traffic.


Cost Optimization Tips

  1. Start with Serverless Inference for new models. It costs almost nothing at low traffic. Migrate to real-time only when traffic exceeds 500K requests/month or cold starts become unacceptable.

  2. Enable auto-scaling on every real-time endpoint. An unscaled endpoint paying for 24/7 GPU capacity with variable traffic wastes 40-65% of spend.

  3. Use multi-model endpoints for model portfolios. If you have 5 or more models each receiving fewer than 10K requests/day, consolidating onto shared instances saves 80%+.

  4. Use Batch Transform for anything not time-sensitive. Nightly scoring, weekly retraining, bulk predictions — batch is 10-50x cheaper than keeping an endpoint running.

  5. Choose Async Inference for large payloads. Video, audio, or document processing models that accept large inputs benefit from async's queue-based architecture and scale-to-zero capability.

  6. Consider Inferentia2 instances for transformer models. ml.inf2.xlarge ($0.76/hr) delivers comparable throughput to ml.g5.xlarge ($1.21/hr) for supported model architectures, saving 37%.

  7. Monitor endpoint utilization with CloudWatch. Track CPUUtilization, GPUUtilization, and InvocationsPerInstance. If GPU utilization is consistently under 30%, downsize the instance.

Sagemaker Inference Guide optimization checklist

Related Guides


FAQ

When should I use Serverless Inference vs Real-Time Endpoints?

Use Serverless Inference when your model receives fewer than 500,000 requests per month, when traffic is unpredictable or bursty, or when you can tolerate 1-5 second cold starts. Use Real-Time Endpoints when you need consistently low latency (under 100ms), when traffic is steady and high-volume, or when your model requires more than 6 GB of memory.

How do I reduce cold start times for Serverless Inference?

Keep your model artifact small by using model compression and quantization. Choose the minimum memory configuration that fits your model. Use provisioned concurrency (if available) to keep instances warm. Cold starts scale with model size — a 500 MB model loads in roughly 1-2 seconds, while a 5 GB model may take 4-5 seconds.

Can I use Spot Instances for SageMaker inference?

No. Managed Spot is only available for SageMaker Training jobs, not inference endpoints. For inference cost savings, use auto-scaling, Serverless Inference, multi-model endpoints, or Inferentia2 instances instead.

Sagemaker Inference Guide savings breakdown

Lower Your SageMaker Inference Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your SageMaker inference costs. Through group buying power, Wring negotiates better rates so you pay less per inference hour.

Start saving on SageMaker inference →