A Retrieval-Augmented Generation (RAG) pipeline on AWS has four cost components: embedding generation, vector storage, retrieval queries, and LLM generation. Most teams focus on the LLM cost, but the vector store often dominates — OpenSearch Serverless has a minimum of 4 OCUs costing $701/month even at zero query volume. Understanding where your RAG budget actually goes is essential before you can optimize it. This guide breaks down every cost component and provides a full model for estimating monthly spend at scale.
TL;DR: A production RAG system serving 1M queries/month costs $2,500-8,000 depending on your vector store and LLM choices. The vector store is typically 30-50% of total cost. Use Aurora pgvector instead of OpenSearch Serverless for smaller workloads (under 500K queries/month) to save $400+/month on minimum fees. Cache repeated queries to cut LLM costs by 20-40%.
RAG Pipeline Cost Breakdown
Every RAG query incurs costs at four stages:
| Stage | What Happens | Service | Cost Driver |
|---|---|---|---|
| 1. Embedding | Convert query to vector | Bedrock (Titan Embed) | Tokens processed |
| 2. Storage | Store and index document vectors | OpenSearch or Aurora | Storage volume and compute |
| 3. Retrieval | Find similar vectors | OpenSearch or Aurora | Query volume and latency |
| 4. Generation | LLM generates answer with context | Bedrock (Claude, Llama) | Input + output tokens |
Embedding Costs
Embedding Model Pricing
| Model | Cost per 1K Tokens | Dimensions | Max Tokens |
|---|---|---|---|
| Titan Embeddings V2 | $0.00002 | 256-1,024 | 8,192 |
| Cohere Embed English | $0.0001 | 1,024 | 512 |
| Cohere Embed Multilingual | $0.0001 | 1,024 | 512 |
Titan Embeddings V2 is 5x cheaper than Cohere for English-only workloads. At $0.00002 per 1K tokens, embedding costs are typically the smallest component of a RAG pipeline.
Embedding Cost at Scale
| Document Volume | Avg Tokens per Doc | Titan V2 Cost | Cohere Cost |
|---|---|---|---|
| 10,000 documents | 1,000 | $0.20 | $1.00 |
| 100,000 documents | 1,000 | $2.00 | $10.00 |
| 1,000,000 documents | 1,000 | $20.00 | $100.00 |
| 10,000,000 documents | 1,000 | $200.00 | $1,000.00 |
Embedding is a one-time cost per document (unless you re-embed). Query embedding costs are minimal — 1M queries at 50 tokens each costs $1.00 with Titan V2.
Vector Store Costs
The vector store is where RAG cost optimization matters most. Your choice here determines your minimum monthly spend.
OpenSearch Serverless
OpenSearch Serverless bills by OpenSearch Compute Units (OCUs).
| Component | Cost | Minimum |
|---|---|---|
| Indexing OCU | $0.24/OCU-hour | 2 OCUs minimum |
| Search OCU | $0.24/OCU-hour | 2 OCUs minimum |
| Storage | $0.024/GB-month | Based on data volume |
Minimum monthly cost: 4 OCUs x $0.24/hr x 730 hours = $701/month
This minimum applies even with zero queries. OpenSearch Serverless scales automatically but never below 4 OCUs (2 indexing + 2 search).
Aurora PostgreSQL with pgvector
| Component | Cost | Notes |
|---|---|---|
| Aurora Serverless v2 | $0.12/ACU-hour | Scales 0.5-128 ACUs |
| Storage | $0.10/GB-month | Auto-scales |
| I/O | $0.20 per 1M requests | Read and write operations |
Minimum monthly cost: 0.5 ACU x $0.12/hr x 730 hours = $43.80/month
Aurora with pgvector is dramatically cheaper at low scale. The tradeoff is lower query throughput for very large vector collections (10M+ vectors).
RDS PostgreSQL with pgvector
| Component | Cost | Notes |
|---|---|---|
| db.r6g.large | $0.26/hr ($189.80/month) | 2 vCPUs, 16 GB RAM |
| db.r6g.xlarge | $0.52/hr ($379.60/month) | 4 vCPUs, 32 GB RAM |
| Storage (gp3) | $0.08/GB-month | Provisioned |
RDS is a good middle ground — more predictable pricing than Serverless, lower minimum than OpenSearch.
Vector Store Comparison
| Vector Store | Min Monthly Cost | Best For | Max Vectors (practical) |
|---|---|---|---|
| OpenSearch Serverless | $701 | High-throughput, large collections | 100M+ |
| Aurora Serverless v2 | $44 | Variable workloads, cost-sensitive | 10M |
| RDS PostgreSQL | $190 | Predictable workloads | 5M |
| Pinecone (external) | $70 (Starter) | Managed, serverless | Varies by plan |
LLM Generation Costs
The LLM generation step is the most expensive per-query component. Cost depends on the retrieved context size and output length.
Typical RAG Query Token Breakdown
| Component | Tokens | Notes |
|---|---|---|
| System prompt | 200-500 | Instructions, formatting rules |
| Retrieved context | 1,000-4,000 | Top-k retrieved chunks |
| User query | 20-100 | The actual question |
| Generated answer | 200-800 | LLM response |
| Total input | 1,220-4,600 | System + context + query |
| Total output | 200-800 | Generated answer |
Generation Cost per Query
| Model | Input Cost (3K tokens) | Output Cost (500 tokens) | Total per Query |
|---|---|---|---|
| Claude Haiku | $0.0024 | $0.002 | $0.0044 |
| Claude Sonnet | $0.009 | $0.0075 | $0.0165 |
| Llama 3.1 8B | $0.00066 | $0.00011 | $0.00077 |
| Llama 3.1 70B | $0.00297 | $0.000495 | $0.003465 |
| Titan Text Express | $0.0006 | $0.0003 | $0.0009 |
Full Cost Model: 1M Queries per Month
Using OpenSearch Serverless and Claude Haiku
| Component | Calculation | Monthly Cost |
|---|---|---|
| Query embeddings | 1M queries x 50 tokens x $0.00002/1K | $1.00 |
| OpenSearch Serverless | 4 OCUs minimum + scaling | $701-$1,200 |
| OpenSearch storage | 50 GB vectors | $1.20 |
| LLM generation (Haiku) | 1M queries x $0.0044/query | $4,400 |
| S3 (source documents) | 100 GB | $2.30 |
| Total | $5,105-$5,604 |
Using Aurora pgvector and Llama 3.1 8B
| Component | Calculation | Monthly Cost |
|---|---|---|
| Query embeddings | 1M queries x 50 tokens x $0.00002/1K | $1.00 |
| Aurora Serverless v2 | ~2 ACUs average | $175 |
| Aurora storage | 50 GB | $5.00 |
| LLM generation (Llama 8B) | 1M queries x $0.00077/query | $770 |
| S3 (source documents) | 100 GB | $2.30 |
| Total | $953 |
The difference is striking — $953 vs $5,105 for the same query volume, driven entirely by vector store and LLM choices.
Chunking Strategy Impact on Costs
How you chunk documents directly affects both storage costs and LLM generation costs.
| Chunk Size | Chunks per 10K-word Doc | Storage per 1M Docs | LLM Context per Query |
|---|---|---|---|
| 128 tokens | ~100 | ~12.8 GB vectors | ~640 tokens (top-5) |
| 256 tokens | ~50 | ~6.4 GB vectors | ~1,280 tokens (top-5) |
| 512 tokens | ~25 | ~3.2 GB vectors | ~2,560 tokens (top-5) |
| 1,024 tokens | ~13 | ~1.7 GB vectors | ~5,120 tokens (top-5) |
Smaller chunks produce more vectors (higher storage cost) but reduce LLM input tokens per query (lower generation cost). Larger chunks reduce vector count but increase generation cost. A 256-512 token chunk size is typically the best balance.
Cost Optimization Tips
-
Use Aurora pgvector instead of OpenSearch Serverless for workloads under 500K queries/month — Save $500+/month by eliminating the OpenSearch 4-OCU minimum. Aurora Serverless v2 scales down to $44/month during low-traffic periods.
-
Cache query results for repeated questions — Implement a semantic cache using a small embedding model and similarity threshold. If a new query is 95%+ similar to a cached query, return the cached response. This typically eliminates 20-40% of LLM calls.
-
Use the cheapest embedding model that meets quality needs — Titan Embeddings V2 at $0.00002/1K tokens is 5x cheaper than Cohere and produces competitive results for English-language retrieval.
-
Optimize chunk size for cost, not just quality — Test 256 and 512 token chunks. Larger chunks reduce vector storage volume but increase LLM input cost. Find the chunk size where retrieval quality plateaus.
-
Reduce top-k retrieval count — Retrieving 10 chunks instead of 5 doubles your LLM input context cost. Test whether reducing from top-10 to top-3 maintains answer quality.
-
Route simple queries to cheaper models — Use a small classifier to detect simple factual queries (route to Llama 8B at $0.00077/query) vs complex reasoning queries (route to Claude Sonnet at $0.0165/query).
-
Use Bedrock prompt caching — If your system prompt is large and consistent, prompt caching reduces input token costs by up to 90% on cached content.
Related Guides
- AWS Bedrock Knowledge Bases Guide
- AWS Bedrock Embeddings Guide
- AWS OpenSearch Pricing Guide
- AWS Bedrock Pricing Guide
FAQ
What is the minimum cost to run a RAG system on AWS?
The absolute minimum is roughly $44/month using Aurora Serverless v2 with pgvector for vector storage and Bedrock pay-per-token for generation. At very low query volumes (under 1,000/month), your total cost would be $45-50. OpenSearch Serverless has a $701/month minimum regardless of usage.
Should I use Bedrock Knowledge Bases or build my own RAG pipeline?
Bedrock Knowledge Bases simplifies the entire pipeline — automatic chunking, embedding, and retrieval — but uses OpenSearch Serverless as the default vector store ($701/month minimum). Building your own pipeline with Aurora pgvector costs less but requires more engineering. For teams without ML infrastructure experience, Knowledge Bases is worth the premium.
How do I estimate the right number of OpenSearch OCUs?
Start with the 4-OCU minimum and monitor the CloudWatch metrics for OCU utilization. OpenSearch Serverless auto-scales, but you can set max OCU limits. For 1M queries/month with sub-500ms latency, 4-6 search OCUs are typically sufficient. Index OCUs depend on how frequently you add new documents.
Lower Your RAG Pipeline Costs with Wring
Wring helps you access AWS credits and volume discounts to lower your RAG pipeline costs. Through group buying power, Wring negotiates better rates so you pay less per query across Bedrock, OpenSearch, and Aurora.
