Wring
All articlesAWS Guides

AWS Bedrock Batch Inference: 50% Lower Cost

AWS Bedrock Batch Inference processes thousands of documents at 50% off on-demand token prices. Learn setup, formatting, and when batch beats real-time.

Wring Team
March 14, 2026
6 min read
AWS Bedrockbatch inferencedocument processingasync AIbatch APIcost optimization
Batch processing and document automation system
Batch processing and document automation system

Bedrock Batch Inference lets you submit large volumes of inference requests as a single job and receive results asynchronously — at 50% off standard on-demand pricing. Instead of paying $0.003/1K input tokens for Claude Sonnet, you pay $0.0015/1K. For document processing pipelines, content generation, and data extraction at scale, this is the single largest cost savings available.

TL;DR: Batch inference processes thousands of requests asynchronously at 50% off on-demand token prices. Input is a JSONL file in S3, output is written to S3. Turnaround time is up to 24 hours (often faster). Use batch for any workload that doesn't need real-time responses: document summarization, content classification, data extraction, email generation, translation pipelines.


How Batch Inference Works

1. Prepare: Create JSONL file with all requests → Upload to S3
2. Submit: Create batch inference job via API/console
3. Process: Bedrock processes all requests (up to 24 hours)
4. Retrieve: Results written to S3 as JSONL

Pricing Comparison

ModelOn-Demand Input/1KBatch Input/1KSavings
Claude Opus$0.015$0.007550%
Claude Sonnet$0.003$0.001550%
Claude Haiku$0.0008$0.000450%
Llama 3.1 70B$0.00099$0.00049550%
Mistral Large$0.004$0.00250%

Output tokens receive the same 50% discount.

Bedrock Batch Inference Guide savings comparison

Input Format

Each line of the JSONL file is an independent inference request:

{"recordId": "001", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": "Summarize this document: ..."}]}}
{"recordId": "002", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 500, "messages": [{"role": "user", "content": "Summarize this document: ..."}]}}
FieldRequiredDetails
recordIdYesUnique ID to match input to output
modelInputYesSame format as real-time InvokeModel request body

Input Limits

LimitValue
Max file size200 MB per JSONL file
Max records per job50,000
Max concurrent jobs5 per account per model
Max input tokens per recordModel-specific context window

Output Format

Results are written to S3 as JSONL:

{"recordId": "001", "modelOutput": {"content": [{"text": "The document discusses...", "type": "text"}], "usage": {"input_tokens": 1500, "output_tokens": 200}}}
{"recordId": "002", "modelOutput": {"content": [{"text": "This report covers...", "type": "text"}], "usage": {"input_tokens": 800, "output_tokens": 150}}}

Failed records include error details instead of model output, allowing you to retry only the failures.

Bedrock Batch Inference Guide process flow diagram

Real-World Use Cases and Cost Savings

Document Summarization Pipeline

ComponentOn-DemandBatchSavings
50K documents, 2K tokens avg input$300 input$150 input$150
500 tokens avg output$375 output$188 output$187
Total$675/run$338/run$337 (50%)

Content Classification

ComponentOn-DemandBatchSavings
500K items, 200 tokens input$30 input$15 input$15
50 tokens output$19 output$9.50 output$9.50
Total$49/run$24.50/run$24.50 (50%)

Weekly Report Generation

ComponentOn-DemandBatchSavings
10K reports, 5K tokens input$150 input$75 input$75
2K tokens output$300 output$150 output$150
Total$450/week$225/week$900/month

When to Use Batch vs Real-Time

FactorBatch InferenceReal-Time Inference
Latency requirementUp to 24 hours acceptableSub-second to seconds
Cost50% cheaperFull price
Volume100s to 50,000 requestsAny volume
Use casesDocument processing, ETL, reportsChatbots, APIs, interactive
Error handlingPer-record failures in outputImmediate retry
StreamingNot supportedSupported

Hybrid Approach

Many production systems use both:

  • Real-time: User-facing features (chat, search, recommendations)
  • Batch: Background processing (nightly reports, content generation, data enrichment)

Best Practices

1. Right-Size Max Tokens

Set max_tokens to the minimum needed for each request. Batch jobs don't benefit from token caching between records, so every excess token is wasted at full output price (even discounted).

2. Use the Cheapest Adequate Model

The 50% discount applies to all models. Claude Haiku at batch pricing ($0.0004/1K input) is extremely cost-effective for classification and extraction tasks.

3. Split Large Jobs

Instead of one 50,000-record job, submit five 10,000-record jobs. This:

  • Provides partial results faster
  • Reduces blast radius if a job fails
  • Allows different models for different batches

4. Implement Record-Level Error Handling

Check every output record for errors. Collect failed record IDs and retry them in a follow-up batch job rather than reprocessing the entire batch.

5. Compress Input Data

Remove unnecessary whitespace, metadata, and formatting from input documents before batching. Every token counts at scale — a 10% reduction in input tokens across 50K documents saves significantly.

Bedrock Batch Inference Guide optimization checklist

Related Guides


FAQ

How long do batch jobs take?

Most batch jobs complete within 1-6 hours. The SLA is 24 hours. Smaller jobs (under 1,000 records) often complete within an hour. Processing time scales with total token volume, not record count.

Can I cancel a batch job?

Yes. You can stop a running batch job. Records already processed are available in the output file. Unprocessed records are not billed.

Is batch inference available for all Bedrock models?

Batch is available for most major models including Claude (all versions), Llama, Mistral, and Titan. Check the Bedrock API reference for the current supported model list, as availability expands regularly.

Bedrock Batch Inference Guide key statistics

Lower Your Bedrock Batch Inference Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your Bedrock batch inference costs. Through group buying power, Wring negotiates better rates so you pay less per batch inference job.

Start saving on Bedrock batch inference →