RAG System Costs on AWS: Optimization Guide

A Retrieval-Augmented Generation (RAG) pipeline on AWS has four cost components: embedding generation, vector storage, retrieval queries, and LLM generation. Most teams focus on the LLM cost, but the vector store often dominates — OpenSearch Serverless has a minimum of 4 OCUs costing $701/month even at zero query volume. Understanding where your RAG budget actually goes is essential before you can optimize it. This guide breaks down every cost component and provides a full model for estimating monthly spend at scale.

TL;DR: A production RAG system serving 1M queries/month costs $2,500-8,000 depending on your vector store and LLM choices. The vector store is typically 30-50% of total cost. Use Aurora pgvector instead of OpenSearch Serverless for smaller workloads (under 500K queries/month) to save $400+/month on minimum fees. Cache repeated queries to cut LLM costs by 20-40%.

RAG Pipeline Cost Breakdown

Every RAG query incurs costs at four stages:

Stage	What Happens	Service	Cost Driver
1. Embedding	Convert query to vector	Bedrock (Titan Embed)	Tokens processed
2. Storage	Store and index document vectors	OpenSearch or Aurora	Storage volume and compute
3. Retrieval	Find similar vectors	OpenSearch or Aurora	Query volume and latency
4. Generation	LLM generates answer with context	Bedrock (Claude, Llama)	Input + output tokens

Rag Cost Optimization Guide comparison chart

Embedding Costs

Embedding Model Pricing

Model	Cost per 1K Tokens	Dimensions	Max Tokens
Titan Embeddings V2	$0.00002	256-1,024	8,192
Cohere Embed English	$0.0001	1,024	512
Cohere Embed Multilingual	$0.0001	1,024	512

Titan Embeddings V2 is 5x cheaper than Cohere for English-only workloads. At $0.00002 per 1K tokens, embedding costs are typically the smallest component of a RAG pipeline.

Embedding Cost at Scale

Document Volume	Avg Tokens per Doc	Titan V2 Cost	Cohere Cost
10,000 documents	1,000	$0.20	$1.00
100,000 documents	1,000	$2.00	$10.00
1,000,000 documents	1,000	$20.00	$100.00
10,000,000 documents	1,000	$200.00	$1,000.00

Embedding is a one-time cost per document (unless you re-embed). Query embedding costs are minimal — 1M queries at 50 tokens each costs $1.00 with Titan V2.

Vector Store Costs

The vector store is where RAG cost optimization matters most. Your choice here determines your minimum monthly spend.

OpenSearch Serverless

OpenSearch Serverless bills by OpenSearch Compute Units (OCUs).

Component	Cost	Minimum
Indexing OCU	$0.24/OCU-hour	2 OCUs minimum
Search OCU	$0.24/OCU-hour	2 OCUs minimum
Storage	$0.024/GB-month	Based on data volume

Minimum monthly cost: 4 OCUs x $0.24/hr x 730 hours = $701/month

This minimum applies even with zero queries. OpenSearch Serverless scales automatically but never below 4 OCUs (2 indexing + 2 search).

Aurora PostgreSQL with pgvector

Component	Cost	Notes
Aurora Serverless v2	$0.12/ACU-hour	Scales 0.5-128 ACUs
Storage	$0.10/GB-month	Auto-scales
I/O	$0.20 per 1M requests	Read and write operations

Minimum monthly cost: 0.5 ACU x $0.12/hr x 730 hours = $43.80/month

Aurora with pgvector is dramatically cheaper at low scale. The tradeoff is lower query throughput for very large vector collections (10M+ vectors).

RDS PostgreSQL with pgvector

Component	Cost	Notes
db.r6g.large	$0.26/hr ($189.80/month)	2 vCPUs, 16 GB RAM
db.r6g.xlarge	$0.52/hr ($379.60/month)	4 vCPUs, 32 GB RAM
Storage (gp3)	$0.08/GB-month	Provisioned

RDS is a good middle ground — more predictable pricing than Serverless, lower minimum than OpenSearch.

Vector Store Comparison

Vector Store	Min Monthly Cost	Best For	Max Vectors (practical)
OpenSearch Serverless	$701	High-throughput, large collections	100M+
Aurora Serverless v2	$44	Variable workloads, cost-sensitive	10M
RDS PostgreSQL	$190	Predictable workloads	5M
Pinecone (external)	$70 (Starter)	Managed, serverless	Varies by plan

Rag Cost Optimization Guide process flow diagram

LLM Generation Costs

The LLM generation step is the most expensive per-query component. Cost depends on the retrieved context size and output length.

Typical RAG Query Token Breakdown

Component	Tokens	Notes
System prompt	200-500	Instructions, formatting rules
Retrieved context	1,000-4,000	Top-k retrieved chunks
User query	20-100	The actual question
Generated answer	200-800	LLM response
Total input	1,220-4,600	System + context + query
Total output	200-800	Generated answer

Generation Cost per Query

Model	Input Cost (3K tokens)	Output Cost (500 tokens)	Total per Query
Claude Haiku	$0.0024	$0.002	$0.0044
Claude Sonnet	$0.009	$0.0075	$0.0165
Llama 3.1 8B	$0.00066	$0.00011	$0.00077
Llama 3.1 70B	$0.00297	$0.000495	$0.003465
Titan Text Express	$0.0006	$0.0003	$0.0009

Full Cost Model: 1M Queries per Month

Using OpenSearch Serverless and Claude Haiku

Component	Calculation	Monthly Cost
Query embeddings	1M queries x 50 tokens x $0.00002/1K	$1.00
OpenSearch Serverless	4 OCUs minimum + scaling	$701-$1,200
OpenSearch storage	50 GB vectors	$1.20
LLM generation (Haiku)	1M queries x $0.0044/query	$4,400
S3 (source documents)	100 GB	$2.30
Total		$5,105-$5,604

Using Aurora pgvector and Llama 3.1 8B

Component	Calculation	Monthly Cost
Query embeddings	1M queries x 50 tokens x $0.00002/1K	$1.00
Aurora Serverless v2	~2 ACUs average	$175
Aurora storage	50 GB	$5.00
LLM generation (Llama 8B)	1M queries x $0.00077/query	$770
S3 (source documents)	100 GB	$2.30
Total		$953

The difference is striking — $953 vs $5,105 for the same query volume, driven entirely by vector store and LLM choices.

Chunking Strategy Impact on Costs

How you chunk documents directly affects both storage costs and LLM generation costs.

Chunk Size	Chunks per 10K-word Doc	Storage per 1M Docs	LLM Context per Query
128 tokens	~100	~12.8 GB vectors	~640 tokens (top-5)
256 tokens	~50	~6.4 GB vectors	~1,280 tokens (top-5)
512 tokens	~25	~3.2 GB vectors	~2,560 tokens (top-5)
1,024 tokens	~13	~1.7 GB vectors	~5,120 tokens (top-5)

Smaller chunks produce more vectors (higher storage cost) but reduce LLM input tokens per query (lower generation cost). Larger chunks reduce vector count but increase generation cost. A 256-512 token chunk size is typically the best balance.

Cost Optimization Tips

Use Aurora pgvector instead of OpenSearch Serverless for workloads under 500K queries/month — Save $500+/month by eliminating the OpenSearch 4-OCU minimum. Aurora Serverless v2 scales down to $44/month during low-traffic periods.
Cache query results for repeated questions — Implement a semantic cache using a small embedding model and similarity threshold. If a new query is 95%+ similar to a cached query, return the cached response. This typically eliminates 20-40% of LLM calls.
Use the cheapest embedding model that meets quality needs — Titan Embeddings V2 at $0.00002/1K tokens is 5x cheaper than Cohere and produces competitive results for English-language retrieval.
Optimize chunk size for cost, not just quality — Test 256 and 512 token chunks. Larger chunks reduce vector storage volume but increase LLM input cost. Find the chunk size where retrieval quality plateaus.
Reduce top-k retrieval count — Retrieving 10 chunks instead of 5 doubles your LLM input context cost. Test whether reducing from top-10 to top-3 maintains answer quality.
Route simple queries to cheaper models — Use a small classifier to detect simple factual queries (route to Llama 8B at $0.00077/query) vs complex reasoning queries (route to Claude Sonnet at $0.0165/query).
Use Bedrock prompt caching — If your system prompt is large and consistent, prompt caching reduces input token costs by up to 90% on cached content.

Rag Cost Optimization Guide optimization checklist

Related Guides

FAQ

What is the minimum cost to run a RAG system on AWS?

The absolute minimum is roughly $44/month using Aurora Serverless v2 with pgvector for vector storage and Bedrock pay-per-token for generation. At very low query volumes (under 1,000/month), your total cost would be $45-50. OpenSearch Serverless has a $701/month minimum regardless of usage.

Should I use Bedrock Knowledge Bases or build my own RAG pipeline?

Bedrock Knowledge Bases simplifies the entire pipeline — automatic chunking, embedding, and retrieval — but uses OpenSearch Serverless as the default vector store ($701/month minimum). Building your own pipeline with Aurora pgvector costs less but requires more engineering. For teams without ML infrastructure experience, Knowledge Bases is worth the premium.

How do I estimate the right number of OpenSearch OCUs?

Start with the 4-OCU minimum and monitor the CloudWatch metrics for OCU utilization. OpenSearch Serverless auto-scales, but you can set max OCU limits. For 1M queries/month with sub-500ms latency, 4-6 search OCUs are typically sufficient. Index OCUs depend on how frequently you add new documents.

Rag Cost Optimization Guide savings breakdown

Lower Your RAG Pipeline Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your RAG pipeline costs. Through group buying power, Wring negotiates better rates so you pay less per query across Bedrock, OpenSearch, and Aurora.

Start saving on AWS →