Wring
All articlesAWS Guides

RAG System Costs on AWS: Optimization Guide

RAG pipeline costs on AWS broken down: embeddings at $0.0001/1K tokens, OpenSearch at $0.24/OCU-hr, and LLM generation. Full cost model included.

Wring Team
March 15, 2026
9 min read
RAG costsretrieval augmented generationvector database costsRAG optimization
Data retrieval and search architecture visualization for RAG systems
Data retrieval and search architecture visualization for RAG systems

A Retrieval-Augmented Generation (RAG) pipeline on AWS has four cost components: embedding generation, vector storage, retrieval queries, and LLM generation. Most teams focus on the LLM cost, but the vector store often dominates — OpenSearch Serverless has a minimum of 4 OCUs costing $701/month even at zero query volume. Understanding where your RAG budget actually goes is essential before you can optimize it. This guide breaks down every cost component and provides a full model for estimating monthly spend at scale.

TL;DR: A production RAG system serving 1M queries/month costs $2,500-8,000 depending on your vector store and LLM choices. The vector store is typically 30-50% of total cost. Use Aurora pgvector instead of OpenSearch Serverless for smaller workloads (under 500K queries/month) to save $400+/month on minimum fees. Cache repeated queries to cut LLM costs by 20-40%.


RAG Pipeline Cost Breakdown

Every RAG query incurs costs at four stages:

StageWhat HappensServiceCost Driver
1. EmbeddingConvert query to vectorBedrock (Titan Embed)Tokens processed
2. StorageStore and index document vectorsOpenSearch or AuroraStorage volume and compute
3. RetrievalFind similar vectorsOpenSearch or AuroraQuery volume and latency
4. GenerationLLM generates answer with contextBedrock (Claude, Llama)Input + output tokens
Rag Cost Optimization Guide comparison chart

Embedding Costs

Embedding Model Pricing

ModelCost per 1K TokensDimensionsMax Tokens
Titan Embeddings V2$0.00002256-1,0248,192
Cohere Embed English$0.00011,024512
Cohere Embed Multilingual$0.00011,024512

Titan Embeddings V2 is 5x cheaper than Cohere for English-only workloads. At $0.00002 per 1K tokens, embedding costs are typically the smallest component of a RAG pipeline.

Embedding Cost at Scale

Document VolumeAvg Tokens per DocTitan V2 CostCohere Cost
10,000 documents1,000$0.20$1.00
100,000 documents1,000$2.00$10.00
1,000,000 documents1,000$20.00$100.00
10,000,000 documents1,000$200.00$1,000.00

Embedding is a one-time cost per document (unless you re-embed). Query embedding costs are minimal — 1M queries at 50 tokens each costs $1.00 with Titan V2.


Vector Store Costs

The vector store is where RAG cost optimization matters most. Your choice here determines your minimum monthly spend.

OpenSearch Serverless

OpenSearch Serverless bills by OpenSearch Compute Units (OCUs).

ComponentCostMinimum
Indexing OCU$0.24/OCU-hour2 OCUs minimum
Search OCU$0.24/OCU-hour2 OCUs minimum
Storage$0.024/GB-monthBased on data volume

Minimum monthly cost: 4 OCUs x $0.24/hr x 730 hours = $701/month

This minimum applies even with zero queries. OpenSearch Serverless scales automatically but never below 4 OCUs (2 indexing + 2 search).

Aurora PostgreSQL with pgvector

ComponentCostNotes
Aurora Serverless v2$0.12/ACU-hourScales 0.5-128 ACUs
Storage$0.10/GB-monthAuto-scales
I/O$0.20 per 1M requestsRead and write operations

Minimum monthly cost: 0.5 ACU x $0.12/hr x 730 hours = $43.80/month

Aurora with pgvector is dramatically cheaper at low scale. The tradeoff is lower query throughput for very large vector collections (10M+ vectors).

RDS PostgreSQL with pgvector

ComponentCostNotes
db.r6g.large$0.26/hr ($189.80/month)2 vCPUs, 16 GB RAM
db.r6g.xlarge$0.52/hr ($379.60/month)4 vCPUs, 32 GB RAM
Storage (gp3)$0.08/GB-monthProvisioned

RDS is a good middle ground — more predictable pricing than Serverless, lower minimum than OpenSearch.

Vector Store Comparison

Vector StoreMin Monthly CostBest ForMax Vectors (practical)
OpenSearch Serverless$701High-throughput, large collections100M+
Aurora Serverless v2$44Variable workloads, cost-sensitive10M
RDS PostgreSQL$190Predictable workloads5M
Pinecone (external)$70 (Starter)Managed, serverlessVaries by plan
Rag Cost Optimization Guide process flow diagram

LLM Generation Costs

The LLM generation step is the most expensive per-query component. Cost depends on the retrieved context size and output length.

Typical RAG Query Token Breakdown

ComponentTokensNotes
System prompt200-500Instructions, formatting rules
Retrieved context1,000-4,000Top-k retrieved chunks
User query20-100The actual question
Generated answer200-800LLM response
Total input1,220-4,600System + context + query
Total output200-800Generated answer

Generation Cost per Query

ModelInput Cost (3K tokens)Output Cost (500 tokens)Total per Query
Claude Haiku$0.0024$0.002$0.0044
Claude Sonnet$0.009$0.0075$0.0165
Llama 3.1 8B$0.00066$0.00011$0.00077
Llama 3.1 70B$0.00297$0.000495$0.003465
Titan Text Express$0.0006$0.0003$0.0009

Full Cost Model: 1M Queries per Month

Using OpenSearch Serverless and Claude Haiku

ComponentCalculationMonthly Cost
Query embeddings1M queries x 50 tokens x $0.00002/1K$1.00
OpenSearch Serverless4 OCUs minimum + scaling$701-$1,200
OpenSearch storage50 GB vectors$1.20
LLM generation (Haiku)1M queries x $0.0044/query$4,400
S3 (source documents)100 GB$2.30
Total$5,105-$5,604

Using Aurora pgvector and Llama 3.1 8B

ComponentCalculationMonthly Cost
Query embeddings1M queries x 50 tokens x $0.00002/1K$1.00
Aurora Serverless v2~2 ACUs average$175
Aurora storage50 GB$5.00
LLM generation (Llama 8B)1M queries x $0.00077/query$770
S3 (source documents)100 GB$2.30
Total$953

The difference is striking — $953 vs $5,105 for the same query volume, driven entirely by vector store and LLM choices.


Chunking Strategy Impact on Costs

How you chunk documents directly affects both storage costs and LLM generation costs.

Chunk SizeChunks per 10K-word DocStorage per 1M DocsLLM Context per Query
128 tokens~100~12.8 GB vectors~640 tokens (top-5)
256 tokens~50~6.4 GB vectors~1,280 tokens (top-5)
512 tokens~25~3.2 GB vectors~2,560 tokens (top-5)
1,024 tokens~13~1.7 GB vectors~5,120 tokens (top-5)

Smaller chunks produce more vectors (higher storage cost) but reduce LLM input tokens per query (lower generation cost). Larger chunks reduce vector count but increase generation cost. A 256-512 token chunk size is typically the best balance.


Cost Optimization Tips

  1. Use Aurora pgvector instead of OpenSearch Serverless for workloads under 500K queries/month — Save $500+/month by eliminating the OpenSearch 4-OCU minimum. Aurora Serverless v2 scales down to $44/month during low-traffic periods.

  2. Cache query results for repeated questions — Implement a semantic cache using a small embedding model and similarity threshold. If a new query is 95%+ similar to a cached query, return the cached response. This typically eliminates 20-40% of LLM calls.

  3. Use the cheapest embedding model that meets quality needs — Titan Embeddings V2 at $0.00002/1K tokens is 5x cheaper than Cohere and produces competitive results for English-language retrieval.

  4. Optimize chunk size for cost, not just quality — Test 256 and 512 token chunks. Larger chunks reduce vector storage volume but increase LLM input cost. Find the chunk size where retrieval quality plateaus.

  5. Reduce top-k retrieval count — Retrieving 10 chunks instead of 5 doubles your LLM input context cost. Test whether reducing from top-10 to top-3 maintains answer quality.

  6. Route simple queries to cheaper models — Use a small classifier to detect simple factual queries (route to Llama 8B at $0.00077/query) vs complex reasoning queries (route to Claude Sonnet at $0.0165/query).

  7. Use Bedrock prompt caching — If your system prompt is large and consistent, prompt caching reduces input token costs by up to 90% on cached content.

Rag Cost Optimization Guide optimization checklist

Related Guides


FAQ

What is the minimum cost to run a RAG system on AWS?

The absolute minimum is roughly $44/month using Aurora Serverless v2 with pgvector for vector storage and Bedrock pay-per-token for generation. At very low query volumes (under 1,000/month), your total cost would be $45-50. OpenSearch Serverless has a $701/month minimum regardless of usage.

Should I use Bedrock Knowledge Bases or build my own RAG pipeline?

Bedrock Knowledge Bases simplifies the entire pipeline — automatic chunking, embedding, and retrieval — but uses OpenSearch Serverless as the default vector store ($701/month minimum). Building your own pipeline with Aurora pgvector costs less but requires more engineering. For teams without ML infrastructure experience, Knowledge Bases is worth the premium.

How do I estimate the right number of OpenSearch OCUs?

Start with the 4-OCU minimum and monitor the CloudWatch metrics for OCU utilization. OpenSearch Serverless auto-scales, but you can set max OCU limits. For 1M queries/month with sub-500ms latency, 4-6 search OCUs are typically sufficient. Index OCUs depend on how frequently you add new documents.

Rag Cost Optimization Guide savings breakdown

Lower Your RAG Pipeline Costs with Wring

Wring helps you access AWS credits and volume discounts to lower your RAG pipeline costs. Through group buying power, Wring negotiates better rates so you pay less per query across Bedrock, OpenSearch, and Aurora.

Start saving on AWS →