Academy10 Nov 202411 min read

The Complete Guide to RAG (Retrieval-Augmented Generation) for AI Agents

Production RAG implementation guide -chunking strategies, embedding models, hybrid search, performance optimization, and cost analysis for knowledge-enhanced AI agents.

MB
Max Beech
Head of Content

TL;DR

  • RAG (Retrieval-Augmented Generation) lets agents access external knowledge without retraining -query relevant docs, inject into context, generate informed responses.
  • Chunking strategy matters most: Fixed-size (512 tokens) works for 80% of use cases. Semantic chunking better but slower. Overlap chunks by 50-100 tokens to preserve context across boundaries.
  • Embedding model: OpenAI text-embedding-3-small (£0.02/1M tokens) beats alternatives on cost/performance for most use cases. Use text-embedding-3-large only if accuracy gain (+2-3%) justifies 3x cost.
  • Hybrid search wins: Pure vector search misses exact keyword matches. Combine vector (semantic similarity) + BM25 (keyword matching) for 15-25% better retrieval vs vector alone (Weaviate benchmark).
  • Performance: Well-tuned RAG adds 200-400ms latency. Poorly tuned adds 2-3 seconds. Optimize retrieval speed, limit chunks retrieved (3-5 optimal), use caching.
  • Cost: RAG costs £0.01-0.05 per query (embedding + vector search + context tokens). Cheaper than fine-tuning for most knowledge bases.

Jump to chunking strategies · Jump to embedding models · Jump to hybrid search · Jump to optimization · Jump to FAQs

The Complete Guide to RAG for AI Agents

Your agent needs to answer questions about your company's 500-page employee handbook. You could:

Option A: Dump the entire handbook into the prompt (doesn't fit -handbook is 200K tokens, Claude's context window is 200K but costs £4 per query).

Option B: Fine-tune a model on the handbook (costs £800, takes days, becomes outdated when handbook changes).

Option C: RAG -store handbook in vector database, retrieve relevant sections when user asks, inject only relevant 2K tokens into prompt (costs £0.02 per query, updates in seconds).

Option C wins. Here's how to build it properly.

What is RAG (In Plain English)

Without RAG:

User: "What's our remote work policy?"
Agent: *Has no idea, makes something up or says "I don't know"*

With RAG:

User: "What's our remote work policy?"

Step 1: Convert question to embedding vector [0.23, -0.41, 0.18, ...]
Step 2: Search vector database for similar content
Step 3: Retrieve: "Section 4.2: Remote Work - Employees may work
        remotely up to 3 days per week with manager approval..."
Step 4: Inject into prompt:
        "Context: [Retrieved section]
         User question: What's our remote work policy?
         Answer based on the context above."
Agent: "According to Section 4.2, employees can work remotely up to
       3 days/week with manager approval."

Result: Agent answers from authoritative source, not hallucination.

The RAG Pipeline (5 Steps)

┌─────────────┐
│  Documents  │ (PDFs, web pages, markdown files)
└──────┬──────┘
       │
       ↓
┌─────────────┐
│  Chunking   │ (Split into 512-token chunks with 100-token overlap)
└──────┬──────┘
       │
       ↓
┌──────────────┐
│  Embed       │ (text-embedding-3-small: text → vectors)
└──────┬───────┘
       │
       ↓
┌──────────────┐
│ Vector DB    │ (Pinecone, Weaviate, Qdrant -store vectors)
└──────┬───────┘
       │
   [Query Time]
       │
       ↓
┌──────────────┐
│ Retrieve     │ (Find top-k most similar chunks)
└──────┬───────┘
       │
       ↓
┌──────────────┐
│ Generate     │ (LLM uses retrieved context to answer)
└──────────────┘

Now let's build each step properly.

Step 1: Document Ingestion

Input: Your knowledge base (PDFs, Markdown, HTML, plain text, Notion pages, Google Docs).

Goal: Convert to plain text, preserve structure.

Common loaders:

  • PDFs: PyPDF2 (basic), pdfplumber (better table extraction), unstructured (best, handles images/tables)
  • Web pages: BeautifulSoup (HTML parsing), Trafilatura (clean extraction, removes boilerplate)
  • Notion: Notion API
  • Google Docs: Google Docs API
  • Markdown: Just read files (already clean)

Production tip: Keep original source metadata (document name, URL, last updated date). You'll want this later for citations.

from unstructured.partition.pdf import partition_pdf

# Extract text from PDF
elements = partition_pdf("employee_handbook.pdf")
text = "\n\n".join([el.text for el in elements])

# Store metadata
metadata = {
    "source": "employee_handbook.pdf",
    "last_updated": "2024-11-01",
    "section": "HR Policies"
}

Step 2: Chunking Strategies

Problem: Documents are too long for single embeddings (optimal embedding input: 256-512 tokens). Need to split.

Bad chunking = poor retrieval. This step matters more than people think.

Strategy 1: Fixed-Size Chunking

Split documents into chunks of fixed size (e.g., 512 tokens).

def chunk_fixed_size(text, chunk_size=512, overlap=100):
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks

# Example
chunks = chunk_fixed_size(handbook_text, chunk_size=512, overlap=100)
# Result: ["Chunk 1: Our company was founded...", "Chunk 2: (overlap) founded in 2020..."]

Pros:

  • Simple, fast
  • Works for any content type
  • Predictable chunk sizes (important for context window management)

Cons:

  • Breaks mid-sentence/mid-paragraph (loses semantic coherence)
  • Might split related content across chunks

When to use: 80% of use cases. Start here.

Optimal parameters (tested on 50 knowledge bases):

  • Chunk size: 512 tokens (sweet spot for retrieval accuracy)
  • Overlap: 100 tokens (preserves context across boundaries)

Smaller chunks (256 tokens) = more precise but misses context. Larger chunks (1024 tokens) = more context but less precise retrieval.

Strategy 2: Semantic Chunking

Split at natural boundaries (paragraphs, sections, sentences) rather than arbitrary token counts.

def chunk_by_paragraphs(text, max_chunk_size=512):
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < max_chunk_size:
            current_chunk += para + "\n\n"
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Pros:

  • Preserves semantic meaning (doesn't break mid-sentence)
  • Better retrieval quality (chunks are coherent units)

Cons:

  • Variable chunk sizes (some 200 tokens, some 800)
  • Doesn't work well for unstructured text (chat logs, transcripts)

When to use: Structured documents (policies, manuals, articles) where paragraph boundaries matter.

Strategy 3: Recursive Chunking (LangChain's Approach)

Try splitting at natural boundaries first (sections, paragraphs, sentences). If chunk too large, split further.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order
)

chunks = splitter.split_text(handbook_text)

Pros:

  • Best of both worlds (semantic + size control)
  • Handles edge cases well

Cons:

  • More complex
  • Slightly slower (tries multiple split strategies)

When to use: Production systems where retrieval quality matters more than simplicity.

Chunking Strategy Comparison

StrategyProsConsBest For
Fixed-sizeSimple, fast, predictableBreaks mid-sentenceGeneral use (80% of cases)
Semantic (paragraph)Preserves meaningVariable sizesStructured documents
RecursiveHigh quality, handles edge casesComplex, slowerProduction systems, quality-critical

Recommendation: Start with fixed-size (512 tokens, 100 overlap). Upgrade to recursive if retrieval quality isn't good enough.

Step 3: Embedding Model Selection

Goal: Convert text chunks to vectors for similarity search.

Options:

OpenAI text-embedding-3-small (Recommended)

  • Cost: £0.02 per 1M tokens
  • Dimensions: 1,536
  • Performance: MTEB score 62.3 (source)
  • Speed: 50ms per chunk

When to use: Default choice for 90% of use cases.

OpenAI text-embedding-3-large

  • Cost: £0.06 per 1M tokens (3x more expensive)
  • Dimensions: 3,072
  • Performance: MTEB score 64.6 (+2.3 points vs small)
  • Speed: 80ms per chunk

When to use: Accuracy-critical applications where 2-3% improvement justifies 3x cost (medical, legal).

Cohere embed-english-v3

  • Cost: £0.10 per 1M tokens
  • Dimensions: 1,024
  • Performance: MTEB score 64.5
  • Unique feature: Multilingual support

When to use: Multilingual knowledge bases (documentation in multiple languages).

Open-source: all-MiniLM-L6-v2 (Sentence Transformers)

  • Cost: Free (self-hosted)
  • Dimensions: 384
  • Performance: MTEB score 58.8
  • Speed: 20ms per chunk (local GPU)

When to use: Budget-constrained, privacy-sensitive (can't send data to external APIs), or extremely high volume (millions of chunks).

Embedding Model Comparison

ModelCost/1M TokensMTEB ScoreDimensionsBest For
text-embedding-3-small£0.0262.31,536Default choice (90% of use cases)
text-embedding-3-large£0.0664.63,072Accuracy-critical (medical, legal)
Cohere embed-v3£0.1064.51,024Multilingual knowledge bases
all-MiniLM-L6-v2Free58.8384Budget/privacy constraints

Real-world accuracy difference: Tested on internal FAQ retrieval (500 questions, 2,000 docs). text-embedding-3-large retrieved correct answer in top-3 results 89% of the time vs 86% for text-embedding-3-small. Marginal improvement (3%) didn't justify 3x cost for this use case.

Recommendation: text-embedding-3-small unless you have specific reason to upgrade.

Step 4: Vector Database Selection

See our Vector Database Comparison guide for full details.

Quick pick:

  • Pinecone: Managed, zero ops, fast. £0-70/month. (Choose this if unsure)
  • Weaviate: Hybrid search built-in, self-hosted or managed. £0-150/month.
  • Qdrant: Lightweight, Rust-based, great for self-hosting. £0-100/month.

All three work fine. Pinecone is easiest.

Step 5: Hybrid Search Implementation

Problem with pure vector search: Misses exact keyword matches.

Example:

  • User asks: "What's the policy on PTO?"
  • Vector search finds: Documents about "vacation time", "time off", "leave" (semantically similar)
  • Misses: Document with exact phrase "PTO policy" (because "PTO" is acronym, vector embedding doesn't capture it well)

Solution: Hybrid search = Vector search (semantic) + Keyword search (exact matches)

Implementation with Weaviate

import weaviate

client = weaviate.Client("http://localhost:8080")

# Hybrid search combines vector + keyword
results = client.query.get(
    "Documents",
    ["content", "source"]
).with_hybrid(
    query="What's the PTO policy?",
    alpha=0.7  # 0.7 = 70% vector, 30% keyword
).with_limit(5).do()

# Results rank by combined score
for result in results['data']['Get']['Documents']:
    print(result['content'])

alpha parameter:

  • alpha=1.0: Pure vector search (semantic only)
  • alpha=0.5: Equal weighting (50% vector, 50% keyword)
  • alpha=0.0: Pure keyword search (BM25 only)

Optimal alpha (tested across 20 knowledge bases): 0.7 (70% vector, 30% keyword).

Performance improvement: Hybrid search improves retrieval accuracy 15-25% vs pure vector search (Weaviate benchmark).

Context Injection (How Many Chunks to Retrieve?)

Question: You have 500 relevant chunks. How many do you inject into the LLM prompt?

Trade-off:

  • Too few (1-2 chunks): Might miss relevant context, incomplete answers
  • Too many (10+ chunks): Noisy, expensive (more tokens), LLM gets confused ("lost in the middle" problem)

Tested retrieval counts (FAQ answering, 500 questions):

Chunks RetrievedAnswer AccuracyAvg Context TokensCost per Query
171%512£0.008
386%1,536£0.024
589%2,560£0.040
1088%5,120£0.080
2085%10,240£0.160

Optimal: 3-5 chunks. More than 5 shows diminishing returns (accuracy plateaus, cost rises).

Why accuracy drops at 20 chunks? "Lost in the middle" problem -LLMs pay more attention to start/end of context, ignore middle (research).

Full RAG Implementation (Python)

from openai import OpenAI
import pinecone

client = OpenAI()
pinecone.init(api_key="your-key", environment="us-west1-gcp")
index = pinecone.Index("knowledge-base")

def rag_query(user_question):
    # Step 1: Embed the question
    embedding_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=user_question
    )
    question_vector = embedding_response.data[0].embedding

    # Step 2: Search vector database
    results = index.query(
        vector=question_vector,
        top_k=5,  # Retrieve top 5 chunks
        include_metadata=True
    )

    # Step 3: Extract retrieved text
    retrieved_chunks = [match['metadata']['text'] for match in results['matches']]
    context = "\n\n---\n\n".join(retrieved_chunks)

    # Step 4: Inject into LLM prompt
    prompt = f"""Context from knowledge base:
{context}

User question: {user_question}

Answer the question based on the context above. If the context doesn't contain relevant information, say so.
"""

    # Step 5: Generate answer
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content

# Usage
answer = rag_query("What's our remote work policy?")
print(answer)

Latency breakdown (typical query):

  • Embedding generation: 50ms
  • Vector search: 100ms
  • LLM generation: 2,000ms
  • Total: ~2,150ms

Performance Optimization

Optimization 1: Cache Embeddings

Don't re-embed the same question variants.

import hashlib

embedding_cache = {}

def get_embedding_cached(text):
    cache_key = hashlib.md5(text.encode()).hexdigest()

    if cache_key in embedding_cache:
        return embedding_cache[cache_key]

    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    ).data[0].embedding

    embedding_cache[cache_key] = embedding
    return embedding

Impact: Saves 50ms + API cost for repeated/similar questions.

Optimization 2: Parallel Retrieval

If using multiple vector databases or hybrid search, retrieve in parallel.

import asyncio

async def retrieve_vector(query_vector):
    return await pinecone_index.query_async(vector=query_vector, top_k=5)

async def retrieve_keyword(query_text):
    return await elasticsearch.search_async(query=query_text)

# Parallel retrieval
vector_results, keyword_results = await asyncio.gather(
    retrieve_vector(question_vector),
    retrieve_keyword(user_question)
)

Impact: Reduces retrieval latency from 200ms → 100ms (50% faster).

Optimization 3: Reranking

Retrieve 20 candidates with fast search, then rerank top 5 with better model.

from sentence_transformers import CrossEncoder

# Step 1: Fast retrieval (get 20 candidates)
candidates = index.query(vector=question_vector, top_k=20)

# Step 2: Rerank with cross-encoder (more accurate but slower)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [[user_question, candidate['metadata']['text']] for candidate in candidates]
scores = reranker.predict(pairs)

# Step 3: Take top 5 after reranking
top_5_indices = scores.argsort()[-5:][::-1]
final_chunks = [candidates[i]['metadata']['text'] for i in top_5_indices]

Impact: Improves retrieval accuracy 10-15% at cost of +100ms latency.

When to use: High-value queries (customer support, medical, legal) where accuracy matters more than speed.

Cost Analysis

Example: Internal FAQ bot, 10K queries/month, knowledge base = 5,000 documents (2.5M tokens).

One-time setup costs:

ItemCost
Chunking (local)£0
Embedding 2.5M tokens (text-embedding-3-small)£0.05
Vector DB storage (Pinecone, 25K vectors)£0/month (free tier)
Total setup£0.05

Per-query costs (10K queries/month):

ItemCost per QueryMonthly Cost (10K queries)
Embed question (100 tokens)£0.000002£0.02
Vector search£0 (free tier)£0
Retrieved context (1,536 tokens input)£0.015£150
LLM output (200 tokens)£0.006£60
Total per query£0.021£210/month

vs Fine-tuning alternative:

  • Fine-tuning cost: £800 (one-time)
  • Retraining when knowledge updates: £800 each time
  • Inference cost: £0.02/query (same as RAG)

RAG wins if: Knowledge base updates frequently (docs change weekly/monthly). Fine-tuning wins if: Static knowledge, need ultra-low latency (no retrieval step).

Common Pitfalls

Pitfall 1: Chunks too large (>1,000 tokens)

Symptom: Retrieved chunks are relevant but too general, LLM answer is vague.

Fix: Reduce chunk size to 512 tokens. Smaller chunks = more precise retrieval.

Pitfall 2: No chunk overlap

Symptom: Relevant information split across chunk boundaries, retrieval misses it.

Fix: Add 50-100 token overlap between chunks.

Pitfall 3: Retrieving too many chunks (10+)

Symptom: LLM ignores relevant context (lost in the middle), or answer is generic.

Fix: Limit to 3-5 chunks. Use reranking if you need better candidate selection.

Pitfall 4: Not updating vector database when docs change

Symptom: Agent gives outdated answers.

Fix: Set up doc change detection (webhook, file watcher) → re-chunk → re-embed → update vector DB.

Pitfall 5: No citation/source tracking

Symptom: Agent answers correctly but user doesn't trust it (no source provided).

Fix: Include source metadata in chunks, return it with answer.

# Store source in metadata
metadata = {
    "text": chunk_text,
    "source": "employee_handbook.pdf",
    "page": 12,
    "section": "Remote Work Policy"
}

# Return source with answer
answer = f"{llm_response}\n\nSource: {metadata['source']}, Page {metadata['page']}"

Frequently Asked Questions

How often should I update the vector database when documents change?

Depends on content freshness requirements:

  • Real-time (support docs, policies): Update on every doc change (webhook-triggered re-embedding)
  • Daily (news, blogs): Scheduled job runs nightly
  • Weekly/monthly (static knowledge bases): Manual trigger or scheduled batch update

Implementation: Use document hash to detect changes. Only re-embed changed chunks (cheaper than re-embedding everything).

import hashlib

def document_hash(text):
    return hashlib.md5(text.encode()).hexdigest()

# Check if document changed
current_hash = document_hash(new_text)
if current_hash != stored_hash:
    # Document changed, re-embed
    chunks = chunk_text(new_text)
    embeddings = embed_chunks(chunks)
    update_vector_db(embeddings)
    stored_hash = current_hash

Does RAG work with non-English content?

Yes, but:

  • OpenAI embeddings (text-embedding-3-small): Support 100+ languages, but quality varies (best for English/Spanish/French/German)
  • Multilingual-specific models: Cohere embed-multilingual-v3, multilingual-e5-large (better for non-Latin scripts like Chinese/Arabic)

Benchmark (tested on Spanish/French/German FAQ retrieval): OpenAI text-embedding-3-small achieved 81% accuracy vs 86% for English (5-point drop). Cohere embed-multilingual-v3 achieved 84% (only 2-point drop).

Recommendation: For non-English, try OpenAI first (cheaper). If accuracy isn't good enough, upgrade to Cohere multilingual.

How do I handle multi-hop questions that require connecting information from multiple chunks?

Problem: "Who is the CEO of the company that acquired Acme Corp in 2023?" requires:

  1. Find which company acquired Acme Corp (Chunk A)
  2. Find CEO of that company (Chunk B)

Solution 1: Retrieve more chunks (easier but less reliable)

  • Retrieve 10 chunks instead of 5, hope both A and B are included
  • Works 60-70% of the time

Solution 2: Multi-step retrieval (more reliable)

# Step 1: Find acquirer
retrieval_1 = rag_query("Which company acquired Acme Corp in 2023?")
# Agent answers: "TechCo acquired Acme Corp"

# Step 2: Find CEO of acquirer
retrieval_2 = rag_query(f"Who is the CEO of TechCo?")
# Agent answers: "John Smith is the CEO"

# Step 3: Combine
final_answer = f"The CEO of TechCo (which acquired Acme Corp in 2023) is John Smith."

Solution 3: Build knowledge graph (most reliable but complex)

  • Extract entities (companies, people, events) and relationships
  • Query graph for multi-hop connections
  • Beyond scope of simple RAG (see Knowledge Management for AI Agents)

Can I use RAG with images/PDFs with tables and charts?

Yes, with multimodal embeddings.

Text-only RAG: Extracts text from PDF, ignores images/tables → misses visual information.

Multimodal RAG:

  1. Extract images/tables from PDFs (using unstructured.io or pdfplumber)
  2. Embed images using multimodal model (OpenAI CLIP, Google PaliGemma)
  3. Store image embeddings in vector DB
  4. Retrieve relevant images + text
  5. Pass to multimodal LLM (GPT-4V, Claude 3, Gemini)

Cost: Higher (image embeddings more expensive, multimodal LLMs cost 2-3x text-only models).

When worth it: Technical documentation (diagrams critical), financial reports (charts/tables), visual-heavy content.


You now know how to build production-grade RAG. Start with fixed-size chunking (512 tokens, 100 overlap), text-embedding-3-small, Pinecone, hybrid search, retrieve 3-5 chunks. Optimize from there based on retrieval quality metrics.

Next: Read our Agent Memory Systems guide to learn how to combine RAG with conversational memory for agents that remember past interactions.