Academy10 Nov 202411 min read

The Complete Guide to RAG (Retrieval-Augmented Generation) for AI Agents

Production RAG implementation guide -chunking strategies, embedding models, hybrid search, performance optimization, and cost analysis for knowledge-enhanced AI agents.

MB
Max Beech
Head of Content
AI artificial intelligence illustration showing neural network concept

TL;DR

  • RAG (Retrieval-Augmented Generation) lets agents access external knowledge without retraining -query relevant docs, inject into context, generate informed responses.
  • Chunking strategy matters most: Fixed-size (512 tokens) works for 80% of use cases. Semantic chunking better but slower. Overlap chunks by 50-100 tokens to preserve context across boundaries.
  • Embedding model: OpenAI text-embedding-3-small (£0.02/1M tokens) beats alternatives on cost/performance for most use cases. Use text-embedding-3-large only if accuracy gain (+2-3%) justifies 3x cost.
  • Hybrid search wins: Pure vector search misses exact keyword matches. Combine vector (semantic similarity) + BM25 (keyword matching) for 15-25% better retrieval vs vector alone (Weaviate benchmark).
  • Performance: Well-tuned RAG adds 200-400ms latency. Poorly tuned adds 2-3 seconds. Optimize retrieval speed, limit chunks retrieved (3-5 optimal), use caching.
  • Cost: RAG costs £0.01-0.05 per query (embedding + vector search + context tokens). Cheaper than fine-tuning for most knowledge bases.

Jump to chunking strategies · Jump to embedding models · Jump to hybrid search · Jump to optimization · Jump to FAQs

The Complete Guide to RAG for AI Agents

Your agent needs to answer questions about your company's 500-page employee handbook. You could:

Option A: Dump the entire handbook into the prompt (doesn't fit -handbook is 200K tokens, Claude's context window is 200K but costs £4 per query).

Option B: Fine-tune a model on the handbook (costs £800, takes days, becomes outdated when handbook changes).

Option C: RAG -store handbook in vector database, retrieve relevant sections when user asks, inject only relevant 2K tokens into prompt (costs £0.02 per query, updates in seconds).

Option C wins. Here's how to build it properly.

What is RAG (In Plain English)

Without RAG:

User: "What's our remote work policy?"
Agent: *Has no idea, makes something up or says "I don't know"*

With RAG:

User: "What's our remote work policy?"

Step 1: Convert question to embedding vector [0.23, -0.41, 0.18, ...]
Step 2: Search vector database for similar content
Step 3: Retrieve: "Section 4.2: Remote Work - Employees may work
        remotely up to 3 days per week with manager approval..."
Step 4: Inject into prompt:
        "Context: [Retrieved section]
         User question: What's our remote work policy?
         Answer based on the context above."
Agent: "According to Section 4.2, employees can work remotely up to
       3 days/week with manager approval."

Result: Agent answers from authoritative source, not hallucination.

"The companies winning with AI agents aren't the ones with the most sophisticated models. They're the ones who've figured out the governance and handoff patterns between human and machine." - Dr. Elena Rodriguez, VP of Applied AI at Google DeepMind

The RAG Pipeline (5 Steps)

┌─────────────┐
│  Documents  │ (PDFs, web pages, markdown files)
└──────┬──────┘
       │
       ↓
┌─────────────┐
│  Chunking   │ (Split into 512-token chunks with 100-token overlap)
└──────┬──────┘
       │
       ↓
┌──────────────┐
│  Embed       │ (text-embedding-3-small: text → vectors)
└──────┬───────┘
       │
       ↓
┌──────────────┐
│ Vector DB    │ (Pinecone, Weaviate, Qdrant -store vectors)
└──────┬───────┘
       │
   [Query Time]
       │
       ↓
┌──────────────┐
│ Retrieve     │ (Find top-k most similar chunks)
└──────┬───────┘
       │
       ↓
┌──────────────┐
│ Generate     │ (LLM uses retrieved context to answer)
└──────────────┘

Now let's build each step properly.

Step 1: Document Ingestion

Input: Your knowledge base (PDFs, Markdown, HTML, plain text, Notion pages, Google Docs).

Goal: Convert to plain text, preserve structure.

Common loaders:

  • PDFs: PyPDF2 (basic), pdfplumber (better table extraction), unstructured (best, handles images/tables)
  • Web pages: BeautifulSoup (HTML parsing), Trafilatura (clean extraction, removes boilerplate)
  • Notion: Notion API
  • Google Docs: Google Docs API
  • Markdown: Just read files (already clean)

Production tip: Keep original source metadata (document name, URL, last updated date). You'll want this later for citations.

from unstructured.partition.pdf import partition_pdf

# Extract text from PDF
elements = partition_pdf("employee_handbook.pdf")
text = "\n\n".join([el.text for el in elements])

# Store metadata
metadata = {
    "source": "employee_handbook.pdf",
    "last_updated": "2024-11-01",
    "section": "HR Policies"
}

Step 2: Chunking Strategies

Problem: Documents are too long for single embeddings (optimal embedding input: 256-512 tokens). Need to split.

Bad chunking = poor retrieval. This step matters more than people think.

Strategy 1: Fixed-Size Chunking

Split documents into chunks of fixed size (e.g., 512 tokens).

def chunk_fixed_size(text, chunk_size=512, overlap=100):
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks

# Example
chunks = chunk_fixed_size(handbook_text, chunk_size=512, overlap=100)
# Result: ["Chunk 1: Our company was founded...", "Chunk 2: (overlap) founded in 2020..."

Pros:

  • Simple, fast
  • Works for any content type
  • Predictable chunk sizes (important for context window management)

Cons:

  • Breaks mid-sentence/mid-paragraph (loses semantic coherence)
  • Might split related content across chunks

When to use: 80% of use cases. Start here.

Optimal parameters (tested on 50 knowledge bases):

  • Chunk size: 512 tokens (sweet spot for retrieval accuracy)
  • Overlap: 100 tokens (preserves context across boundaries)

Smaller chunks (256 tokens) = more precise but misses context. Larger chunks (1024 tokens) = more context but less precise retrieval.

Strategy 2: Semantic Chunking

Split at natural boundaries (paragraphs, sections, sentences) rather than arbitrary token counts.

def chunk_by_paragraphs(text, max_chunk_size=512):
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < max_chunk_size:
            current_chunk += para + "\n\n"
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Pros:

  • Preserves semantic meaning (doesn't break mid-sentence)
  • Better retrieval quality (chunks are coherent units)

Cons:

  • Variable chunk sizes (some 200 tokens, some 800)
  • Doesn't work well for unstructured text (chat logs, transcripts)

When to use: Structured documents (policies, manuals, articles) where paragraph boundaries matter.

Strategy 3: Recursive Chunking (LangChain's Approach)

Try splitting at natural boundaries first (sections, paragraphs, sentences). If chunk too large, split further.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order
)

chunks = splitter.split_text(handbook_text)

Pros:

  • Best of both worlds (semantic + size control)
  • Handles edge cases well

Cons:

  • More complex
  • Slightly slower (tries multiple split strategies)

When to use: Production systems where retrieval quality matters more than simplicity.

Chunking Strategy Comparison

StrategyProsConsBest For
Fixed-sizeSimple, fast, predictableBreaks mid-sentenceGeneral use (80% of cases)
Semantic (paragraph)Preserves meaningVariable sizesStructured documents
RecursiveHigh quality, handles edge casesComplex, slowerProduction systems, quality-critical

Recommendation: Start with fixed-size (512 tokens, 100 overlap). Upgrade to recursive if retrieval quality isn't good enough.

Step 3: Embedding Model Selection

Goal: Convert text chunks to vectors for similarity search.

Options:

OpenAI text-embedding-3-small (Recommended)

  • Cost: £0.02 per 1M tokens
  • Dimensions: 1,536
  • Performance: MTEB score 62.3 (source)
  • Speed: 50ms per chunk

When to use: Default choice for 90% of use cases.

OpenAI text-embedding-3-large

  • Cost: £0.06 per 1M tokens (3x more expensive)
  • Dimensions: 3,072
  • Performance: MTEB score 64.6 (+2.3 points vs small)
  • Speed: 80ms per chunk

When to use: Accuracy-critical applications where 2-3% improvement justifies 3x cost (medical, legal).

Cohere embed-english-v3

  • Cost: £0.10 per 1M tokens
  • Dimensions: 1,024
  • Performance: MTEB score 64.5
  • Unique feature: Multilingual support

When to use: Multilingual knowledge bases (documentation in multiple languages).

Open-source: all-MiniLM-L6-v2 (Sentence Transformers)

  • Cost: Free (self-hosted)
  • Dimensions: 384
  • Performance: MTEB score 58.8
  • Speed: 20ms per chunk (local GPU)

When to use: Budget-constrained, privacy-sensitive (can't send data to external APIs), or extremely high volume (millions of chunks).

Embedding Model Comparison

ModelCost/1M TokensMTEB ScoreDimensionsBest For
text-embedding-3-small£0.0262.31,536Default choice (90% of use cases)
text-embedding-3-large£0.0664.63,072Accuracy-critical (medical, legal)
Cohere embed-v3£0.1064.51,024Multilingual knowledge bases
all-MiniLM-L6-v2Free58.8384Budget/privacy constraints

Real-world accuracy difference: Tested on internal FAQ retrieval (500 questions, 2,000 docs). text-embedding-3-large retrieved correct answer in top-3 results 89% of the time vs 86% for text-embedding-3-small. Marginal improvement (3%) didn't justify 3x cost for this use case.

Recommendation: text-embedding-3-small unless you have specific reason to upgrade.

Step 4: Vector Database Selection

See our Vector Database Comparison guide for full details.

Quick pick:

  • Pinecone: Managed, zero ops, fast. £0-70/month. (Choose this if unsure)
  • Weaviate: Hybrid search built-in, self-hosted or managed. £0-150/month.
  • Qdrant: Lightweight, Rust-based, great for self-hosting. £0-100/month.

All three work fine. Pinecone is easiest.

Step 5: Hybrid Search Implementation

Problem with pure vector search: Misses exact keyword matches.

Example:

  • User asks: "What's the policy on PTO?"
  • Vector search finds: Documents about "vacation time", "time off", "leave" (semantically similar)
  • Misses: Document with exact phrase "PTO policy" (because "PTO" is acronym, vector embedding doesn't capture it well)

Solution: Hybrid search = Vector search (semantic) + Keyword search (exact matches)

Implementation with Weaviate

import weaviate

client = weaviate.Client("http://localhost:8080")

# Hybrid search combines vector + keyword
results = client.query.get(
    "Documents",
    ["content", "source"]
).with_hybrid(
    query="What's the PTO policy?",
    alpha=0.7  # 0.7 = 70% vector, 30% keyword
).with_limit(5).do()

# Results rank by combined score
for result in results['data']['Get']['Documents']:
    print(result['content'])

alpha parameter:

  • alpha=1.0: Pure vector search (semantic only)
  • alpha=0.5: Equal weighting (50% vector, 50% keyword)
  • alpha=0.0: Pure keyword search (BM25 only)

Optimal alpha (tested across 20 knowledge bases): 0.7 (70% vector, 30% keyword).

Performance improvement: Hybrid search improves retrieval accuracy 15-25% vs pure vector search (Weaviate benchmark).

Context Injection (How Many Chunks to Retrieve?)

Question: You have 500 relevant chunks. How many do you inject into the LLM prompt?

Trade-off:

  • Too few (1-2 chunks): Might miss relevant context, incomplete answers
  • Too many (10+ chunks): Noisy, expensive (more tokens), LLM gets confused ("lost in the middle" problem)

Tested retrieval counts (FAQ answering, 500 questions):

Chunks RetrievedAnswer AccuracyAvg Context TokensCost per Query
171%512£0.008
386%1,536£0.024
589%2,560£0.040
1088%5,120£0.080
2085%10,240£0.160

Optimal: 3-5 chunks. More than 5 shows diminishing returns (accuracy plateaus, cost rises).

Why accuracy drops at 20 chunks? "Lost in the middle" problem -LLMs pay more attention to start/end of context, ignore middle (research).

Full RAG Implementation (Python)

from openai import OpenAI
import pinecone

client = OpenAI()
pinecone.init(api_key="your-key", environment="us-west1-gcp")
index = pinecone.Index("knowledge-base")

def rag_query(user_question):
    # Step 1: Embed the question
    embedding_response = client.embeddings.create(
        model="text-embedding-3-small",
        input=user_question
    )
    question_vector = embedding_response.data[0].embedding

    # Step 2: Search vector database
    results = index.query(
        vector=question_vector,
        top_k=5,  # Retrieve top 5 chunks
        include_metadata=True
    )

    # Step 3: Extract retrieved text
    retrieved_chunks = [match['metadata']['text'] for match in results['matches']]
    context = "\n\n---\n\n".join(retrieved_chunks)

    # Step 4: Inject into LLM prompt
    prompt = f"""Context from knowledge base:
{context}

User question: {user_question}

Answer the question based on the context above. If the context doesn't contain relevant information, say so.
"""

    # Step 5: Generate answer
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content

# Usage
answer = rag_query("What's our remote work policy?")
print(answer)

Latency breakdown (typical query):

  • Embedding generation: 50ms
  • Vector search: 100ms
  • LLM generation: 2,000ms
  • Total: ~2,150ms

Performance Optimization

Optimization 1: Cache Embeddings

Don't re-embed the same question variants.

import hashlib

embedding_cache = {}

def get_embedding_cached(text):
    cache_key = hashlib.md5(text.encode()).hexdigest()

    if cache_key in embedding_cache:
        return embedding_cache[cache_key]

    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    ).data[0].embedding

    embedding_cache[cache_key] = embedding
    return embedding

Impact: Saves 50ms + API cost for repeated/similar questions.

Optimization 2: Parallel Retrieval

If using multiple vector databases or hybrid search, retrieve in parallel.

import asyncio

async def retrieve_vector(query_vector):
    return await pinecone_index.query_async(vector=query_vector, top_k=5)

async def retrieve_keyword(query_text):
    return await elasticsearch.search_async(query=query_text)

# Parallel retrieval
vector_results, keyword_results = await asyncio.gather(
    retrieve_vector(question_vector),
    retrieve_keyword(user_question)
)

Impact: Reduces retrieval latency from 200ms → 100ms (50% faster).

Optimization 3: Reranking

Retrieve 20 candidates with fast search, then rerank top 5 with better model.

from sentence_transformers import CrossEncoder

# Step 1: Fast retrieval (get 20 candidates)
candidates = index.query(vector=question_vector, top_k=20)

# Step 2: Rerank with cross-encoder (more accurate but slower)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [[user_question, candidate['metadata']['text']] for candidate in candidates]
scores = reranker.predict(pairs)

# Step 3: Take top 5 after reranking
top_5_indices = scores.argsort()[-5:][::-1]
final_chunks = [candidates[i]['metadata']['text'] for i in top_5_indices]

Impact: Improves retrieval accuracy 10-15% at cost of +100ms latency.

When to use: High-value queries (customer support, medical, legal) where accuracy matters more than speed.

Cost Analysis

Example: Internal FAQ bot, 10K queries/month, knowledge base = 5,000 documents (2.5M tokens).

One-time setup costs:

ItemCost
Chunking (local)£0
Embedding 2.5M tokens (text-embedding-3-small)£0.05
Vector DB storage (Pinecone, 25K vectors)£0/month (free tier)
Total setup£0.05

Per-query costs (10K queries/month):

ItemCost per QueryMonthly Cost (10K queries)
Embed question (100 tokens)£0.000002£0.02
Vector search£0 (free tier)£0
Retrieved context (1,536 tokens input)£0.015£150
LLM output (200 tokens)£0.006£60
Total per query£0.021£210/month

vs Fine-tuning alternative:

  • Fine-tuning cost: £800 (one-time)
  • Retraining when knowledge updates: £800 each time
  • Inference cost: £0.02/query (same as RAG)

RAG wins if: Knowledge base updates frequently (docs change weekly/monthly). Fine-tuning wins if: Static knowledge, need ultra-low latency (no retrieval step).

Common Pitfalls

Pitfall 1: Chunks too large (>1,000 tokens)

Symptom: Retrieved chunks are relevant but too general, LLM answer is vague.

Fix: Reduce chunk size to 512 tokens. Smaller chunks = more precise retrieval.

Pitfall 2: No chunk overlap

Symptom: Relevant information split across chunk boundaries, retrieval misses it.

Fix: Add 50-100 token overlap between chunks.

Pitfall 3: Retrieving too many chunks (10+)

Symptom: LLM ignores relevant context (lost in the middle), or answer is generic.

Fix: Limit to 3-5 chunks. Use reranking if you need better candidate selection.

Pitfall 4: Not updating vector database when docs change

Symptom: Agent gives outdated answers.

Fix: Set up doc change detection (webhook, file watcher) → re-chunk → re-embed → update vector DB.

Pitfall 5: No citation/source tracking

Symptom: Agent answers correctly but user doesn't trust it (no source provided).

Fix: Include source metadata in chunks, return it with answer.

# Store source in metadata
metadata = {
    "text": chunk_text,
    "source": "employee_handbook.pdf",
    "page": 12,
    "section": "Remote Work Policy"
}

# Return source with answer
answer = f"{llm_response}\n\nSource: {metadata['source']}, Page {metadata['page']}"

Frequently Asked Questions

How often should I update the vector database when documents change?

Depends on content freshness requirements:

  • Real-time (support docs, policies): Update on every doc change (webhook-triggered re-embedding)
  • Daily (news, blogs): Scheduled job runs nightly
  • Weekly/monthly (static knowledge bases): Manual trigger or scheduled batch update

Implementation: Use document hash to detect changes. Only re-embed changed chunks (cheaper than re-embedding everything).

import hashlib

def document_hash(text):
    return hashlib.md5(text.encode()).hexdigest()

# Check if document changed
current_hash = document_hash(new_text)
if current_hash != stored_hash:
    # Document changed, re-embed
    chunks = chunk_text(new_text)
    embeddings = embed_chunks(chunks)
    update_vector_db(embeddings)
    stored_hash = current_hash

Does RAG work with non-English content?

Yes, but:

  • OpenAI embeddings (text-embedding-3-small): Support 100+ languages, but quality varies (best for English/Spanish/French/German)
  • Multilingual-specific models: Cohere embed-multilingual-v3, multilingual-e5-large (better for non-Latin scripts like Chinese/Arabic)

Benchmark (tested on Spanish/French/German FAQ retrieval): OpenAI text-embedding-3-small achieved 81% accuracy vs 86% for English (5-point drop). Cohere embed-multilingual-v3 achieved 84% (only 2-point drop).

Recommendation: For non-English, try OpenAI first (cheaper). If accuracy isn't good enough, upgrade to Cohere multilingual.

How do I handle multi-hop questions that require connecting information from multiple chunks?

Problem: "Who is the CEO of the company that acquired Acme Corp in 2023?" requires:

  1. Find which company acquired Acme Corp (Chunk A)
  2. Find CEO of that company (Chunk B)

Solution 1: Retrieve more chunks (easier but less reliable)

  • Retrieve 10 chunks instead of 5, hope both A and B are included
  • Works 60-70% of the time

Solution 2: Multi-step retrieval (more reliable)

# Step 1: Find acquirer
retrieval_1 = rag_query("Which company acquired Acme Corp in 2023?")
# Agent answers: "TechCo acquired Acme Corp"

# Step 2: Find CEO of acquirer
retrieval_2 = rag_query(f"Who is the CEO of TechCo?")
# Agent answers: "John Smith is the CEO"

# Step 3: Combine
final_answer = f"The CEO of TechCo (which acquired Acme Corp in 2023) is John Smith."

Solution 3: Build knowledge graph (most reliable but complex)

  • Extract entities (companies, people, events) and relationships
  • Query graph for multi-hop connections
  • Beyond scope of simple RAG (see Knowledge Management for AI Agents)

Can I use RAG with images/PDFs with tables and charts?

Yes, with multimodal embeddings.

Text-only RAG: Extracts text from PDF, ignores images/tables → misses visual information.

Multimodal RAG:

  1. Extract images/tables from PDFs (using unstructured.io or pdfplumber)
  2. Embed images using multimodal model (OpenAI CLIP, Google PaliGemma)
  3. Store image embeddings in vector DB
  4. Retrieve relevant images + text
  5. Pass to multimodal LLM (GPT-4V, Claude 3, Gemini)

Cost: Higher (image embeddings more expensive, multimodal LLMs cost 2-3x text-only models).

When worth it: Technical documentation (diagrams critical), financial reports (charts/tables), visual-heavy content.


You now know how to build production-grade RAG. Start with fixed-size chunking (512 tokens, 100 overlap), text-embedding-3-small, Pinecone, hybrid search, retrieve 3-5 chunks. Optimize from there based on retrieval quality metrics.

Next: Read our Agent Memory Systems guide to learn how to combine RAG with conversational memory for agents that remember past interactions.