Bedrock Batch Inference lets you submit large volumes of inference requests as a single job and receive results asynchronously — at 50% off standard on-demand pricing. Instead of paying $0.003/1K input tokens for Claude Sonnet, you pay $0.0015/1K. For document processing pipelines, content generation, and data extraction at scale, this is the single largest cost savings available.
TL;DR: Batch inference processes thousands of requests asynchronously at 50% off on-demand token prices. Input is a JSONL file in S3, output is written to S3. Turnaround time is up to 24 hours (often faster). Use batch for any workload that doesn't need real-time responses: document summarization, content classification, data extraction, email generation, translation pipelines.
How Batch Inference Works
1. Prepare: Create JSONL file with all requests → Upload to S3
2. Submit: Create batch inference job via API/console
3. Process: Bedrock processes all requests (up to 24 hours)
4. Retrieve: Results written to S3 as JSONL
Pricing Comparison
| Model | On-Demand Input/1K | Batch Input/1K | Savings |
|---|---|---|---|
| Claude Opus | $0.015 | $0.0075 | 50% |
| Claude Sonnet | $0.003 | $0.0015 | 50% |
| Claude Haiku | $0.0008 | $0.0004 | 50% |
| Llama 3.1 70B | $0.00099 | $0.000495 | 50% |
| Mistral Large | $0.004 | $0.002 | 50% |
Output tokens receive the same 50% discount.
Input Format
Each line of the JSONL file is an independent inference request:
{"recordId": "001", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": "Summarize this document: ..."}]}}
{"recordId": "002", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": "Summarize this document: ..."}]}}
| Field | Required | Details |
|---|---|---|
recordId | Yes | Unique ID to match input to output |
modelInput | Yes | Same format as real-time InvokeModel request body |
Input Limits
| Limit | Value |
|---|---|
| Max file size | 200 MB per JSONL file |
| Max records per job | 50,000 |
| Max concurrent jobs | 5 per account per model |
| Max input tokens per record | Model-specific context window |
Output Format
Results are written to S3 as JSONL:
{"recordId": "001", "modelOutput": {"content": [{"text": "The document discusses...", "type": "text"}], "usage": {"input_tokens": 1500, "output_tokens": 200}}}
{"recordId": "002", "modelOutput": {"content": [{"text": "This report covers...", "type": "text"}], "usage": {"input_tokens": 800, "output_tokens": 150}}}
Failed records include error details instead of model output, allowing you to retry only the failures.
Real-World Use Cases and Cost Savings
Document Summarization Pipeline
| Component | On-Demand | Batch | Savings |
|---|---|---|---|
| 50K documents, 2K tokens avg input | $300 input | $150 input | $150 |
| 500 tokens avg output | $375 output | $188 output | $187 |
| Total | $675/run | $338/run | $337 (50%) |
Content Classification
| Component | On-Demand | Batch | Savings |
|---|---|---|---|
| 500K items, 200 tokens input | $30 input | $15 input | $15 |
| 50 tokens output | $19 output | $9.50 output | $9.50 |
| Total | $49/run | $24.50/run | $24.50 (50%) |
Weekly Report Generation
| Component | On-Demand | Batch | Savings |
|---|---|---|---|
| 10K reports, 5K tokens input | $150 input | $75 input | $75 |
| 2K tokens output | $300 output | $150 output | $150 |
| Total | $450/week | $225/week | $900/month |
When to Use Batch vs Real-Time
| Factor | Batch Inference | Real-Time Inference |
|---|---|---|
| Latency requirement | Up to 24 hours acceptable | Sub-second to seconds |
| Cost | 50% cheaper | Full price |
| Volume | 100s to 50,000 requests | Any volume |
| Use cases | Document processing, ETL, reports | Chatbots, APIs, interactive |
| Error handling | Per-record failures in output | Immediate retry |
| Streaming | Not supported | Supported |
Hybrid Approach
Many production systems use both:
- Real-time: User-facing features (chat, search, recommendations)
- Batch: Background processing (nightly reports, content generation, data enrichment)
Best Practices
1. Right-Size Max Tokens
Set max_tokens to the minimum needed for each request. Batch jobs don't benefit from token caching between records, so every excess token is wasted at full output price (even discounted).
2. Use the Cheapest Adequate Model
The 50% discount applies to all models. Claude Haiku at batch pricing ($0.0004/1K input) is extremely cost-effective for classification and extraction tasks.
3. Split Large Jobs
Instead of one 50,000-record job, submit five 10,000-record jobs. This:
- Provides partial results faster
- Reduces blast radius if a job fails
- Allows different models for different batches
4. Implement Record-Level Error Handling
Check every output record for errors. Collect failed record IDs and retry them in a follow-up batch job rather than reprocessing the entire batch.
5. Compress Input Data
Remove unnecessary whitespace, metadata, and formatting from input documents before batching. Every token counts at scale — a 10% reduction in input tokens across 50K documents saves significantly.
Related Guides
- AWS Bedrock Cost Optimization Guide
- AWS Bedrock Pricing Guide
- AWS Bedrock Embeddings Guide
- LLM Inference Cost Optimization
FAQ
How long do batch jobs take?
Most batch jobs complete within 1-6 hours. The SLA is 24 hours. Smaller jobs (under 1,000 records) often complete within an hour. Processing time scales with total token volume, not record count.
Can I cancel a batch job?
Yes. You can stop a running batch job. Records already processed are available in the output file. Unprocessed records are not billed.
Is batch inference available for all Bedrock models?
Batch is available for most major models including Claude (all versions), Llama, Mistral, and Titan. Check the Bedrock API reference for the current supported model list, as availability expands regularly.
Lower Your Bedrock Batch Inference Costs with Wring
Wring helps you access AWS credits and volume discounts to lower your Bedrock batch inference costs. Through group buying power, Wring negotiates better rates so you pay less per batch inference job.
