AWS Bedrock Batch Inference: 50% Lower Cost

Batch processing and document automation system

Bedrock Batch Inference lets you submit large volumes of inference requests as a single job and receive results asynchronously — at 50% off standard on-demand pricing. Instead of paying $0.003/1K input tokens for Claude Sonnet, you pay $0.0015/1K. For document processing pipelines, content generation, and data extraction at scale, this is the single largest cost savings available.

TL;DR: Batch inference processes thousands of requests asynchronously at 50% off on-demand token prices. Input is a JSONL file in S3, output is written to S3. Turnaround time is up to 24 hours (often faster). Use batch for any workload that doesn't need real-time responses: document summarization, content classification, data extraction, email generation, translation pipelines.

How Batch Inference Works

1. Prepare: Create JSONL file with all requests → Upload to S3
2. Submit: Create batch inference job via API/console
3. Process: Bedrock processes all requests (up to 24 hours)
4. Retrieve: Results written to S3 as JSONL

Pricing Comparison

Model	On-Demand Input/1K	Batch Input/1K	Savings
Claude Opus	$0.015	$0.0075	50%
Claude Sonnet	$0.003	$0.0015	50%
Claude Haiku	$0.0008	$0.0004	50%
Llama 3.1 70B	$0.00099	$0.000495	50%
Mistral Large	$0.004	$0.002	50%

Output tokens receive the same 50% discount.

Bedrock Batch Inference Guide savings comparison

Input Format

Each line of the JSONL file is an independent inference request:

{"recordId": "001", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": "Summarize this document: ..."}]}}
{"recordId": "002", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": "Summarize this document: ..."}]}}

Field	Required	Details
`recordId`	Yes	Unique ID to match input to output
`modelInput`	Yes	Same format as real-time InvokeModel request body

Input Limits

Limit	Value
Max file size	200 MB per JSONL file
Max records per job	50,000
Max concurrent jobs	5 per account per model
Max input tokens per record	Model-specific context window

Output Format

Results are written to S3 as JSONL:

{"recordId": "001", "modelOutput": {"content": [{"text": "The document discusses...", "type": "text"}], "usage": {"input_tokens": 1500, "output_tokens": 200}}}
{"recordId": "002", "modelOutput": {"content": [{"text": "This report covers...", "type": "text"}], "usage": {"input_tokens": 800, "output_tokens": 150}}}

Failed records include error details instead of model output, allowing you to retry only the failures.

Bedrock Batch Inference Guide process flow diagram

Real-World Use Cases and Cost Savings

Document Summarization Pipeline

Component	On-Demand	Batch	Savings
50K documents, 2K tokens avg input	$300 input	$150 input	$150
500 tokens avg output	$375 output	$188 output	$187
Total	$675/run	$338/run	$337 (50%)

Content Classification

Component	On-Demand	Batch	Savings
500K items, 200 tokens input	$30 input	$15 input	$15
50 tokens output	$19 output	$9.50 output	$9.50
Total	$49/run	$24.50/run	$24.50 (50%)

Weekly Report Generation

Component	On-Demand	Batch	Savings
10K reports, 5K tokens input	$150 input	$75 input	$75
2K tokens output	$300 output	$150 output	$150
Total	$450/week	$225/week	$900/month

When to Use Batch vs Real-Time

Factor	Batch Inference	Real-Time Inference
Latency requirement	Up to 24 hours acceptable	Sub-second to seconds
Cost	50% cheaper	Full price
Volume	100s to 50,000 requests	Any volume
Use cases	Document processing, ETL, reports	Chatbots, APIs, interactive
Error handling	Per-record failures in output	Immediate retry
Streaming	Not supported	Supported

Hybrid Approach

Many production systems use both:

Real-time: User-facing features (chat, search, recommendations)
Batch: Background processing (nightly reports, content generation, data enrichment)

Best Practices

1. Right-Size Max Tokens

Set max_tokens to the minimum needed for each request. Batch jobs don't benefit from token caching between records, so every excess token is wasted at full output price (even discounted).

2. Use the Cheapest Adequate Model

The 50% discount applies to all models. Claude Haiku at batch pricing ($0.0004/1K input) is extremely cost-effective for classification and extraction tasks.

3. Split Large Jobs

Instead of one 50,000-record job, submit five 10,000-record jobs. This:

Provides partial results faster
Reduces blast radius if a job fails
Allows different models for different batches

4. Implement Record-Level Error Handling

Check every output record for errors. Collect failed record IDs and retry them in a follow-up batch job rather than reprocessing the entire batch.

5. Compress Input Data

Remove unnecessary whitespace, metadata, and formatting from input documents before batching. Every token counts at scale — a 10% reduction in input tokens across 50K documents saves significantly.

Bedrock Batch Inference Guide optimization checklist

Related Guides

FAQ

How long do batch jobs take?

Most batch jobs complete within 1-6 hours. The SLA is 24 hours. Smaller jobs (under 1,000 records) often complete within an hour. Processing time scales with total token volume, not record count.

Can I cancel a batch job?

Yes. You can stop a running batch job. Records already processed are available in the output file. Unprocessed records are not billed.

Is batch inference available for all Bedrock models?

Batch is available for most major models including Claude (all versions), Llama, Mistral, and Titan. Check the Bedrock API reference for the current supported model list, as availability expands regularly.

Bedrock Batch Inference Guide key statistics

Lower Your Bedrock Batch Inference Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your Bedrock batch inference costs. Through group buying power, Wring negotiates better rates so you pay less per batch inference job.

Start saving on Bedrock batch inference →