Building Production-Ready RAG Systems: Zero to Scale
A practical guide to building retrieval-augmented generation systems that handle real-world traffic, with patterns for chunking, embedding, hybrid search, and cache optimization.
A practical guide to building retrieval-augmented generation systems that handle real-world traffic, with patterns for chunking, embedding, hybrid search, and cache optimization.
TL;DR
Jump to RAG fundamentals · Jump to Chunking strategies · Jump to Search implementation · Jump to Production optimization
Retrieval-Augmented Generation (RAG) has become the standard approach for giving LLMs access to proprietary knowledge without fine-tuning. But most RAG implementations fail in production: they return irrelevant chunks, hit latency budgets, or cost too much to scale.
This guide walks through building a production-ready RAG system that handles real traffic, drawing from our experience at Athenic where our knowledge base serves 15,000+ queries daily with 92% retrieval relevance and sub-200ms p95 latency.
Key takeaways
- Naive chunking (fixed 512 tokens) produces 40% more irrelevant retrievals than semantic chunking.
- Hybrid search (vector + keyword) outperforms pure vector search by 23% on domain-specific queries.
- Caching common queries at the embedding and result level cuts costs by 65%.
- Monitor retrieval precision and context utilization to detect quality degradation early.
RAG solves the knowledge cutoff problem: LLMs only know what they saw during training. When users ask about your product, recent events, or proprietary data, base LLMs hallucinate or admit ignorance.
Indexing phase (offline):
Retrieval phase (runtime):
Generation phase (runtime):
According to a 2024 study by Stanford's AI Lab, RAG systems achieve 87% factual accuracy on domain-specific questions versus 54% for base LLMs without retrieval (Stanford HAI, 2024).
Use RAG when:
Skip RAG when:
At Athenic, we use RAG for organizational knowledge bases (customer data, integrations, past analyses) but not for general business advice where GPT-4's training suffices.
Chunking is the most underrated part of RAG. Bad chunks produce bad retrievals no matter how sophisticated your search is.
Most tutorials chunk documents into fixed 512-token blocks with 50-token overlap. This is simple but terrible:
Example: A 512-token chunk might contain:
...the API endpoint. The response includes the following fields:
| Field | Type | Description |
|-------|------|-------------|
| id | string | Unique identifier |
| created_at | timestamp | When the record was...
The chunk cuts off mid-table. When retrieved, it's useless because the user can't see the full field list.
Chunk on semantic boundaries: paragraphs, sections, or logical units.
| Strategy | When to use | Avg chunk size | Pros | Cons |
|---|---|---|---|---|
| By paragraph | General content, blogs, docs | 200-400 tokens | Preserves context | Small chunks may lack broader context |
| By section | Technical docs, API references | 400-800 tokens | Keeps related info together | Large chunks dilute relevance |
| By topic | Books, research papers | 600-1200 tokens | Highest coherence | Requires NLP models to detect topics |
| Sliding window | Code, logs | 300-600 tokens | Captures transitions | High overlap increases storage |
Our approach at Athenic:
function chunkDocument(markdown: string): Chunk[] {
const sections = parseMarkdownSections(markdown); // Split on ## and ###
const chunks: Chunk[] = [];
for (const section of sections) {
if (section.tokens <= 800) {
chunks.push({
text: `${section.heading}\n\n${section.content}`,
metadata: { heading: section.heading, level: section.level },
});
} else {
// Split large sections by paragraphs
const paragraphs = section.content.split('\n\n');
let currentChunk = `${section.heading}\n\n`;
for (const para of paragraphs) {
if (countTokens(currentChunk + para) > 800) {
chunks.push({ text: currentChunk, metadata: section.metadata });
currentChunk = `${section.heading}\n\n${para}`;
} else {
currentChunk += para + '\n\n';
}
}
if (currentChunk) chunks.push({ text: currentChunk, metadata: section.metadata });
}
}
return chunks;
}
This increased our retrieval precision from 61% to 84% compared to fixed chunking.
Add overlap between adjacent chunks to prevent context loss at boundaries.
Overlap strategies:
We use 75-token overlap with sentence boundary snapping (never split mid-sentence).
OpenAI's text-embedding-3-small (1536 dimensions) offers the best price/performance for English text as of late 2024.
| Model | Dimensions | Cost per 1M tokens | Performance (MTEB benchmark) |
|---|---|---|---|
| text-embedding-3-small | 1536 | $0.02 | 62.3% |
| text-embedding-3-large | 3072 | $0.13 | 64.6% |
| Cohere embed-v3 | 1024 | $0.10 | 64.5% |
| Voyage-2 | 1024 | $0.12 | 68.8% |
We use text-embedding-3-small for cost reasons. The 2.3% performance gap vs. Voyage-2 doesn't justify 6× higher costs for our use case.
Critical rule: Use the same embedding model for indexing and querying. Mixing models breaks vector similarity.
| Database | Best for | Latency (p95) | Max scale |
|---|---|---|---|
| pgvector (Postgres) | Existing Postgres shops, <1M vectors | 40ms | 10M vectors |
| Pinecone | Serverless, fast setup | 25ms | Unlimited |
| Weaviate | Open source, self-hosted | 35ms | 100M+ vectors |
| Qdrant | High-performance, Rust-based | 20ms | 100M+ vectors |
At Athenic, we use pgvector because:
pgvector setup:
-- Enable extension
CREATE EXTENSION vector;
-- Create embeddings table
CREATE TABLE knowledge_embeddings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
chunk_text TEXT NOT NULL,
embedding vector(1536),
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create index for vector similarity (HNSW for speed)
CREATE INDEX ON knowledge_embeddings
USING hnsw (embedding vector_cosine_ops);
HNSW (Hierarchical Navigable Small World) indexing provides fast approximate nearest neighbor search. Build time increases but query latency drops from 200ms to 30ms.
Pure vector search works well for conceptual queries but fails on specific terminology. Hybrid search combines vector similarity with keyword matching.
User query: "What's the rate limit for the partners API?"
1. Vector search (70% weight):
- Embed query
- Find top-20 chunks by cosine similarity
2. Keyword search (30% weight):
- BM25 full-text search on "rate limit" + "partners API"
- Find top-20 chunks by keyword match
3. Reciprocal Rank Fusion:
- Merge results using RRF algorithm
- Return top-5 deduplicated chunks
Why this works: Vector search finds semantically similar content ("throttling limits", "API quotas") while keyword search ensures exact terms ("partners API") appear.
async function hybridSearch(query: string, topK: number = 5) {
const embedding = await generateEmbedding(query); // OpenAI embedding
const results = await db.query(`
WITH vector_search AS (
SELECT
id,
chunk_text,
1 - (embedding <=> $1::vector) AS vector_score,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS vector_rank
FROM knowledge_embeddings
ORDER BY embedding <=> $1::vector
LIMIT 20
),
keyword_search AS (
SELECT
id,
chunk_text,
ts_rank(to_tsvector('english', chunk_text), plainto_tsquery('english', $2)) AS keyword_score,
ROW_NUMBER() OVER (ORDER BY ts_rank DESC) AS keyword_rank
FROM knowledge_embeddings
WHERE to_tsvector('english', chunk_text) @@ plainto_tsquery('english', $2)
LIMIT 20
)
SELECT
COALESCE(v.id, k.id) AS id,
COALESCE(v.chunk_text, k.chunk_text) AS chunk_text,
(COALESCE(1.0 / (60 + v.vector_rank), 0.0) * 0.7 +
COALESCE(1.0 / (60 + k.keyword_rank), 0.0) * 0.3) AS combined_score
FROM vector_search v
FULL OUTER JOIN keyword_search k ON v.id = k.id
ORDER BY combined_score DESC
LIMIT $3;
`, [embedding, query, topK]);
return results.rows;
}
Reciprocal Rank Fusion (RRF): Instead of averaging scores, RRF uses rank positions. A chunk ranked #1 in vector search and #3 in keyword search gets 1/(60+1) * 0.7 + 1/(60+3) * 0.3. The constant 60 prevents top ranks from dominating.
This approach improved our retrieval precision from 76% (vector only) to 92% (hybrid).
After hybrid search, optionally re-rank top-20 results using a cross-encoder model.
Cross-encoders score query-chunk pairs directly rather than computing separate embeddings. They're slower (20-40ms per pair) but more accurate.
import { CrossEncoder } from '@xenova/transformers';
async function rerank(query: string, chunks: Chunk[], topK: number = 5) {
const model = await CrossEncoder.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2');
const scores = await Promise.all(
chunks.map(chunk => model.rank(query, chunk.text))
);
return chunks
.map((chunk, i) => ({ chunk, score: scores[i] }))
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map(x => x.chunk);
}
We only re-rank for high-value queries (enterprise customers, compliance requests) due to latency cost.
Getting RAG systems production-ready requires caching, monitoring, and cost controls.
Cache at three levels to minimize redundant computation:
class RAGCache {
private embeddingCache = new LRU<string, number[]>({ max: 10000 });
private resultCache = new LRU<string, Chunk[]>({ max: 1000, ttl: 3600000 });
private responseCache = new LRU<string, string>({ max: 500, ttl: 7200000 });
async search(query: string): Promise<Chunk[]> {
// Check result cache first
const cached = this.resultCache.get(query);
if (cached) return cached;
// Check embedding cache
let embedding = this.embeddingCache.get(query);
if (!embedding) {
embedding = await generateEmbedding(query);
this.embeddingCache.set(query, embedding);
}
// Perform search
const results = await hybridSearch(query, embedding);
this.resultCache.set(query, results);
return results;
}
}
Impact: Caching reduced our per-query cost from $0.0042 to $0.0015 (65% savings) and p95 latency from 185ms to 68ms.
Track these metrics to detect quality regressions:
| Metric | Description | Target | Alert threshold |
|---|---|---|---|
| Retrieval precision | % of retrieved chunks actually used by LLM | >85% | <75% |
| Context utilization | % of injected tokens referenced in response | >60% | <40% |
| Latency (p95) | 95th percentile retrieval time | <200ms | >300ms |
| Cache hit rate | % of queries served from cache | >40% | <25% |
| Embedding cost | $ per 1M queries | <$20 | >$35 |
Retrieval precision measurement:
async function measurePrecision(query: string, response: string, chunks: Chunk[]) {
const usedChunks = chunks.filter(chunk =>
response.includes(chunk.text.slice(0, 50)) // Check if response references chunk
);
const precision = usedChunks.length / chunks.length;
metrics.gauge('rag.precision', precision, { query_type: classifyQuery(query) });
if (precision < 0.75) {
logger.warn(`Low precision (${precision}) for query: ${query}`);
}
return precision;
}
We log all sub-threshold queries to a review queue where our team manually audits retrieval quality weekly.
RAG costs come from embeddings, vector storage, and LLM generation.
Embedding cost reduction:
text-embedding-3-small vs. large)Vector storage reduction:
Generation cost reduction:
Our optimization pipeline reduced RAG costs from $1,240/month to $420/month while handling 3× more queries.
Our internal knowledge base indexes 12,400 documents (product docs, customer analyses, integration guides, past research).
Architecture:
text-embedding-3-small (1536 dimensions)Workflow:
Performance:
Before RAG: Agents frequently hallucinated partner names and dates. Now they cite specific documents with timestamps.
Mistake: Injecting 10-15 chunks into prompts "just to be safe."
Impact: Dilutes context, increases costs, confuses LLM.
Fix: Start with 3-5 chunks. Measure context utilization. Only increase if utilization >80%.
Mistake: Searching entire knowledge base when query implies filters (date, author, category).
Fix: Extract filters from query and apply before vector search.
const filters = extractFilters(query); // { date_after: '2024-08-01', category: 'partnerships' }
await db.query(`
SELECT * FROM knowledge_embeddings
WHERE
metadata->>'category' = $1
AND created_at > $2
ORDER BY embedding <=> $3
LIMIT 5
`, [filters.category, filters.date_after, embedding]);
Mistake: Updating documents without re-embedding.
Fix: Trigger re-embedding on document updates. Use webhooks or cron jobs.
Mistake: Returning answers without showing which chunks were used.
Fix: Always include source metadata in responses so users can verify.
Call-to-action (Activation stage) Clone our production RAG starter template with pgvector, hybrid search, and caching pre-configured.
If you already use Postgres and have <5M vectors, pgvector is cheaper and simpler. For >10M vectors or sub-20ms latency needs, use Pinecone or Qdrant.
Re-embed when documents change or when you switch embedding models. For static content, embed once. For dynamic (user-generated content, news), re-embed on update.
Yes. Fine-tuning teaches models how to respond (tone, format), RAG teaches what to respond (facts, data). Combine them for best results.
300-600 tokens for most content. Smaller for Q&A pairs, larger for technical docs. Test on your data and measure retrieval precision.
For images: use OCR or multimodal embeddings (CLIP). For tables: convert to markdown or embed as text. For code: embed with language-specific models.
Production RAG systems require semantic chunking, hybrid search, multi-layer caching, and quality monitoring. Avoid common pitfalls like over-retrieval, stale embeddings, and missing citations.
Next steps:
Internal links:
External references:
Crosslinks: