Wring
All articlesAWS Guides

AWS SageMaker Cost Optimization: Cut ML Costs

Save 40-70% on AWS SageMaker with Spot training, serverless inference, Graviton instances, and endpoint auto-scaling. 12 proven strategies.

Wring Team
March 14, 2026
9 min read
AWS SageMakerSageMaker optimizationML costsinference optimizationtraining costsAI infrastructure
Machine learning infrastructure optimization and cost reduction
Machine learning infrastructure optimization and cost reduction

SageMaker is one of the highest-spend AWS services for ML teams, and most of that spend is waste. Inference endpoints running 24/7 at under 10% utilization, training jobs on oversized GPU instances, notebooks left running overnight, and endpoints provisioned for peak traffic that arrives once a day. The platform has more cost optimization levers than almost any other AWS service — you just need to know where to pull.

TL;DR: The three biggest SageMaker savings: (1) Use Managed Spot Training for 60-90% off training costs — most training jobs are fault-tolerant and complete with zero interruptions. (2) Switch low-traffic inference endpoints to Serverless Inference or scale to zero with auto-scaling — idle GPU endpoints are the number one waste. (3) Use Graviton (ml.c7g, ml.m7g) instances for CPU-based inference — 20% cheaper with 20% better performance. These three alone save most teams 40-60%.


Where SageMaker Costs Hide

Before optimizing, understand where the money goes. For a typical ML team:

ComponentTypical ShareCommon Waste
Inference endpoints50-70%Idle endpoints, oversized instances
Training jobs15-25%Full on-demand pricing, oversized GPUs
Notebooks/Studio5-15%Running overnight, weekends
Storage (S3, EFS)3-5%Old model artifacts, training data copies
Data processing2-5%Unoptimized Spark jobs

Inference endpoints dominate — a single ml.g5.xlarge endpoint costs $737/month running 24/7. Teams with 5-10 endpoints can spend $3,000-7,000/month on inference alone.

Sagemaker Cost Optimization Guide savings comparison

Training Cost Optimization

Strategy 1: Use Managed Spot Training

Spot instances provide up to 90% savings for training jobs. SageMaker Managed Spot handles interruptions automatically with checkpointing.

InstanceOn-Demand/hrSpot/hrSavings
ml.g5.xlarge (1 GPU)$1.01$0.3070%
ml.g5.2xlarge (1 GPU)$1.52$0.4670%
ml.g5.12xlarge (4 GPU)$7.09$2.1370%
ml.p4d.24xlarge (8 A100)$32.77$9.8370%

How to enable: Set use_spot_instances=True and max_wait (maximum time including interruptions) in your training job configuration. Set max_wait to 2x your expected training time.

Checkpointing: Enable checkpointing to S3 so training resumes from the last checkpoint after an interruption. Most frameworks (PyTorch, TensorFlow) support this natively.

Real-world results: Most Spot training jobs complete without any interruption. When interruptions occur, checkpointing means you only lose minutes, not hours.

Strategy 2: Right-Size Training Instances

Teams frequently use the largest GPU available "to be safe." But many training jobs are bottlenecked by data loading, not GPU compute.

How to right-size:

  1. Run a short training job (10 minutes) and monitor GPU utilization via CloudWatch
  2. If GPU utilization is under 50%, try a smaller instance
  3. If GPU memory utilization is under 30%, you're overpaying for unused VRAM
ScenarioOversizedRight-SizedMonthly Savings
Fine-tuning a 7B modelml.p4d.24xlarge ($32.77/hr)ml.g5.2xlarge ($1.52/hr)95%
Training a tabular modelml.g5.xlarge ($1.01/hr)ml.c7g.2xlarge ($0.39/hr)61%
Image classificationml.g5.12xlarge ($7.09/hr)ml.g5.2xlarge ($1.52/hr)79%

Strategy 3: Use SageMaker Training Compiler

SageMaker Training Compiler optimizes deep learning model training, reducing training time by up to 50% for supported frameworks (PyTorch, TensorFlow with Hugging Face). Less training time = less cost, with no code changes required.

Strategy 4: Optimize Data Loading

Slow data loading leaves GPUs idle. Use these patterns:

TechniqueImpact
SageMaker Pipe ModeStream data from S3 instead of downloading — eliminates startup delay
FSx for LustreHigh-throughput shared filesystem for large datasets
ShardedByS3KeyDistribute data across training instances automatically
PrefetchingDataLoader workers load next batch while GPU processes current

Inference Cost Optimization

Strategy 5: Use Serverless Inference for Low-Traffic Models

Serverless Inference scales to zero when idle and automatically handles traffic spikes.

ComponentCost
ComputeBased on memory and processing time
MemoryFrom 1 GB to 6 GB
Cold start1-2 seconds (first request after idle)
Idle cost$0
ScenarioReal-Time EndpointServerlessSavings
100 requests/day, ml.c5.large$63/month (24/7)~$5/month92%
1K requests/day, ml.m5.xlarge$139/month (24/7)~$20/month86%

Use Serverless when: Traffic is sporadic, latency of 1-2 seconds is acceptable, and you don't need GPU inference.

Strategy 6: Auto-Scale Real-Time Endpoints

Configure auto-scaling to match capacity to demand:

MetricTarget
InvocationsPerInstanceScale based on requests per instance
CPUUtilizationScale based on CPU load
Minimum instances0 (scale to zero) or 1 (always-on)
Cooldown300 seconds (scale in), 60 seconds (scale out)

Scale to zero: Set minimum instances to 0 for development and internal models. The endpoint scales up on the first request (with cold start latency) and scales down after the cooldown period.

Strategy 7: Use Inference Components (Multi-Model Endpoints)

Host multiple models on a single endpoint instance:

Approach3 Models Separately3 Models on 1 Endpoint
Instances3x ml.g5.xlarge1x ml.g5.4xlarge
Monthly cost$2,211$1,474
Savings33%

Inference Components dynamically load models into memory based on traffic, maximizing GPU utilization across models.

Strategy 8: Use Graviton for CPU Inference

For models that don't require GPU (scikit-learn, XGBoost, some ONNX models), Graviton instances are 20% cheaper with 20% better price-performance:

InstanceOn-Demand/hrUse Case
ml.c7g.medium (Graviton)$0.05Lightweight models
ml.c7g.xlarge (Graviton)$0.19Medium CPU models
ml.c6i.xlarge (Intel)$0.24Same workload, 26% more expensive

Strategy 9: Use Batch Transform for Offline Predictions

For predictions that don't need real-time responses, Batch Transform processes data in S3 and shuts down when done:

Approach1M predictions/dayMonthly Cost
Real-time endpoint (24/7)ml.c5.xlarge always on$126/month
Batch Transform (2 hours/day)ml.c5.xlarge for 2 hrs$8/month
Sagemaker Cost Optimization Guide process flow diagram

Notebook and Studio Optimization

Strategy 10: Auto-Stop Idle Notebooks

SageMaker Studio notebooks and classic notebook instances run until manually stopped. Use lifecycle configurations to auto-stop after idle periods:

ConfigurationImpact
Auto-stop after 1 hour idleEliminates overnight/weekend costs
Typical savings60-70% of notebook costs

A ml.t3.medium notebook running 24/7 costs $31/month. With 8-hour workday auto-stop, it costs $10/month.

Strategy 11: Use the Right Notebook Instance

TaskRecommended InstanceCost/hr
Code editing, small dataml.t3.medium$0.042
Data exploration, pandasml.m5.xlarge$0.190
GPU prototypingml.g4dn.xlarge$0.526
Large-scale training devUse training jobs instead

Common waste: Developers using ml.g5.xlarge ($1.01/hr) notebooks for writing code and reviewing results, when ml.t3.medium ($0.042/hr) suffices.


Storage and Infrastructure

Strategy 12: Clean Up Model Artifacts

SageMaker stores model artifacts, training outputs, and checkpoints in S3. These accumulate:

ArtifactTypical SizeCleanup Strategy
Training checkpoints1-50 GB per jobDelete after final model is selected
Model artifacts0.5-20 GB per versionKeep only production and rollback versions
Processing job outputs1-10 GB per jobDelete after validation

Set S3 lifecycle policies on your SageMaker output bucket:

  • Delete training checkpoints after 30 days
  • Move old model artifacts to Glacier after 90 days

Cost Monitoring

Track these CloudWatch metrics across all SageMaker components:

MetricWhat to Watch
CPUUtilization (endpoints)Under 20% = over-provisioned
GPUUtilization (endpoints)Under 30% = consider smaller instance
InvocationsPerInstanceNear zero = candidate for Serverless
ModelLatencyIncreasing = may need larger instance
OverheadLatencyHigh = networking issue, not instance size

Use AWS Cost Explorer filtered to SageMaker with "Usage Type" grouping to see training vs inference vs notebook costs separately.

Sagemaker Cost Optimization Guide optimization checklist

Related Guides


FAQ

What's the single biggest SageMaker cost savings?

Switching idle inference endpoints to Serverless Inference or scaling to zero. Most teams have 2-5 endpoints running 24/7 that serve fewer than 100 requests per hour — these can be 90%+ cheaper on Serverless.

Is Managed Spot Training reliable?

Yes. In practice, most Spot training jobs complete without interruption. When interruptions occur (approximately 5-10% of jobs), checkpointing means you only lose the time since the last checkpoint (typically minutes). The 70% cost savings far outweigh the occasional restart.

How do I calculate my SageMaker ROI?

Compare the cost of SageMaker infrastructure (training + inference + notebooks) against the business value of your ML models. Then optimize: most teams can cut SageMaker costs 40-60% without affecting model performance, directly improving ROI.

Sagemaker Cost Optimization Guide key statistics

Lower Your SageMaker Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your SageMaker costs. Through group buying power, Wring negotiates better rates so you pay less per training hour.

Start saving on SageMaker →