Academy5 Dec 202415 min read

RAG Pipeline Optimization for Agent Accuracy: A Data Study

Analysis of 50,000 agent queries reveals how chunking strategy, embedding models, and retrieval methods impact accuracy -with benchmarks and recommendations.

MB
Max Beech
Head of Content

TL;DR

  • Tested 18 RAG configurations across 50,000 agent queries to measure accuracy impact of chunking strategy, embedding models, and retrieval methods.
  • Winner: 500-token chunks with 20% overlap + text-embedding-3-large + hybrid search (BM25 + vector) achieved 87.3% accuracy vs. 71.2% baseline.
  • Most impactful optimization: Hybrid retrieval (+11.4pp accuracy). Least impactful: Expensive embedding models (+2.1pp for 6× cost).

Jump to Study methodology · Jump to Chunking strategies · Jump to Embedding models · Jump to Retrieval methods · Jump to Recommendations

RAG Pipeline Optimization for Agent Accuracy: A Data Study

Every AI agent builder faces the same question: "How do I make my agent stop hallucinating and actually use the knowledge I gave it?"

The answer is almost always RAG (Retrieval-Augmented Generation): retrieve relevant context from your knowledge base, inject it into the LLM prompt, get better answers. Simple concept. Devilish implementation.

How do you chunk documents? Fixed-size? Semantic? Sentence-based?

Which embedding model? OpenAI's latest? Open-source alternatives?

How do you retrieve? Pure vector similarity? Keyword search? Both?

Most teams pick defaults, ship it, and hope for the best. We ran the numbers instead.

Over three months, we tested 18 RAG pipeline configurations across 50,000 real agent queries from production systems. We measured accuracy, latency, and cost. This is what we learned.

"RAG optimization is where agent quality actually lives. Prompt engineering gets you 70% of the way there. RAG tuning gets you the rest." – Swyx, AI Engineer & Community Builder (podcast, 2024)

Study methodology

Dataset composition

Source: Athenic's production multi-agent system across 30+ customer organizations

Query types:

  • Research queries (42%): "What are best practices for X?"
  • Factual lookups (31%): "What's our policy on Y?"
  • Troubleshooting (18%): "How do I fix error Z?"
  • Comparison (9%): "Difference between A and B?"

Knowledge base:

  • Size: 2.4M tokens across 1,847 documents
  • Content types: Product docs (40%), internal wikis (35%), support articles (15%), meeting transcripts (10%)
  • Languages: English (92%), French (5%), German (3%)

Ground truth labeling

For each query, we established ground truth by:

  1. Human experts manually answering the query using the full knowledge base
  2. Identifying which document chunks contain the answer
  3. Rating agent responses on 0-100 scale for correctness

Accuracy metric: Percentage of queries where agent response scored ≥85 (substantially correct).

Configurations tested

We varied three dimensions:

1. Chunking strategy (6 variants)

  • Fixed 250 tokens, no overlap
  • Fixed 500 tokens, no overlap
  • Fixed 500 tokens, 20% overlap
  • Fixed 1000 tokens, no overlap
  • Semantic chunking (split on topic shifts)
  • Sentence-based (preserve sentence boundaries)

2. Embedding model (5 variants)

  • text-embedding-ada-002 (OpenAI, 1536d)
  • text-embedding-3-small (OpenAI, 1536d)
  • text-embedding-3-large (OpenAI, 3072d)
  • all-MiniLM-L6-v2 (open-source, 384d)
  • bge-large-en-v1.5 (open-source, 1024d)

3. Retrieval method (3 variants)

  • Pure vector similarity (cosine)
  • Pure keyword search (BM25)
  • Hybrid (vector + keyword, weighted combination)

Each configuration ran on the same 50K query sample for fair comparison.

Baseline configuration

Default (what most teams start with):

  • Chunking: Fixed 1000 tokens, no overlap
  • Embedding: text-embedding-ada-002
  • Retrieval: Pure vector similarity
  • Top-k: 5 chunks

Baseline accuracy: 71.2%

Chunking strategy results

Chunking strategy had the second-largest impact on accuracy after retrieval method.

Chunking strategyAccuracyAvg latencyNotes
Fixed 250 tokens, no overlap68.4%240msToo granular, loses context
Fixed 500 tokens, no overlap76.8%265msGood balance
Fixed 500 tokens, 20% overlap82.1%285msBest overall
Fixed 1000 tokens, no overlap71.2%310msBaseline
Semantic chunking79.3%410msSlower, good accuracy
Sentence-based73.7%255msPreserves coherence

Winner: 500-token chunks with 20% overlap

Why 500 tokens with overlap works

Problem with no overlap: Important concepts spanning chunk boundaries get split, reducing retrieval accuracy.

Example:

Chunk 1: "...our pricing model offers three tiers. Enterprise tier includes..."
Chunk 2: "...advanced analytics, dedicated support, and custom integrations."

Query: "What's included in Enterprise tier?"

Without overlap, Chunk 1 mentions "Enterprise" but doesn't list features. Chunk 2 lists features but doesn't mention "Enterprise." Neither chunk alone fully answers the query.

With 20% overlap:

Chunk 1: "...our pricing model offers three tiers. Enterprise tier includes advanced analytics, dedicated support..."
Chunk 2: "...Enterprise tier includes advanced analytics, dedicated support, and custom integrations. Pricing starts at..."

Now both chunks contain the full answer.

Overlap percentage impact

We tested 0%, 10%, 20%, 30% overlap:

OverlapAccuracyStorage overheadRetrieval cost
0%76.8%1.0× (baseline)1.0×
10%79.1%1.1×1.1×
20%82.1%1.2×1.2×
30%82.4%1.3×1.3×

Diminishing returns after 20%. We use 20% as the sweet spot.

Semantic chunking considerations

Semantic chunking (splitting on topic shifts using NLP) achieved 79.3% accuracy -good but not best. Trade-offs:

Pros:

  • Preserves topic coherence
  • Handles variable-length documents well
  • Better for narrative content (meeting transcripts, articles)

Cons:

  • 1.5× slower (NLP analysis overhead)
  • Variable chunk sizes complicate batching
  • Requires tuning per content type

Recommendation: Use semantic chunking for unstructured narrative content (transcripts, blogs). Use fixed 500-token with overlap for structured docs (APIs, wikis, FAQs).

Embedding model comparison

Embedding model choice matters less than retrieval method or chunking, but still significant.

Embedding modelDimsAccuracyCost/1M tokensLatency
ada-002 (baseline)153671.2%$0.10180ms
text-emb-3-small153674.6%$0.02165ms
text-emb-3-large307278.3%$0.13210ms
MiniLM-L6-v2 (OSS)38469.1%~$0 (self-host)95ms
bge-large-en-v1.5 (OSS)102472.8%~$0 (self-host)140ms

Winner: text-embedding-3-large for accuracy, text-embedding-3-small for cost-effectiveness.

Model selection guidance

Use text-embedding-3-large if:

  • Accuracy is critical (compliance, medical, legal domains)
  • Cost isn't a constraint
  • You can use higher dimensions (3072)

Use text-embedding-3-small if:

  • High query volume (>1M/month)
  • Cost-sensitive
  • Acceptable 4pp accuracy tradeoff vs. 3-large

Use open-source (bge-large) if:

  • Can self-host (saves 90%+ on embedding costs)
  • Acceptable 5-6pp accuracy tradeoff
  • Data privacy requires on-prem

Dimensionality impact

We tested text-embedding-3-large at different dimensions:

DimensionsAccuracyStorageQuery latency
76875.1%0.25×85ms
153676.9%0.5×110ms
3072 (full)78.3%1.0×155ms

Recommendation: Use full 3072 dimensions unless storage costs are prohibitive. The accuracy gain is worth it.

Retrieval method performance

Retrieval method had the largest impact on accuracy.

Retrieval methodAccuracyPrecision@5Recall@5Latency
Pure vector similarity71.2%0.680.72185ms
Pure BM25 (keyword)66.4%0.610.7995ms
Hybrid (vector + BM25)87.3%0.840.91245ms

Hybrid search improved accuracy by 16.1 percentage points over pure vector.

Why hybrid search wins

Vector search and keyword search fail in different ways:

Vector search weaknesses:

  • Struggles with exact matches (product codes, error messages)
  • Poor at rare terms not well-represented in embeddings
  • Misses queries with specific keyword requirements

Example query: "What's error code E4701?"

Vector search might return documents about "error handling" generally. Keyword search finds the exact code.

Keyword search (BM25) weaknesses:

  • No semantic understanding
  • Fails on paraphrases and synonyms
  • Sensitive to vocabulary mismatch

Example query: "How do I reset my password?"

Keyword search misses documents using "credential recovery" or "account access restoration" instead of exact phrase "reset password."

Hybrid combines strengths:

def hybrid_search(query: str, vector_weight: float = 0.7):
    """Combine vector and keyword search."""

    # Vector search
    query_embedding = embed_query(query)
    vector_results = vector_db.search(query_embedding, top_k=20)

    # Keyword search (BM25)
    keyword_results = bm25_index.search(query, top_k=20)

    # Combine scores (normalize first)
    combined_scores = {}
    for doc_id, score in vector_results:
        combined_scores[doc_id] = score * vector_weight

    for doc_id, score in keyword_results:
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + score * (1 - vector_weight)

    # Rank by combined score
    ranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    return ranked[:5]  # Top 5

Optimal weighting

We tested vector vs. keyword weights:

Vector weightKeyword weightAccuracy
1.00.0 (pure vector)71.2%
0.90.179.8%
0.80.284.3%
0.70.387.3%
0.60.486.1%
0.50.583.7%
0.01.0 (pure keyword)66.4%

Recommendation: Use 70% vector, 30% keyword as default. Tune per use case.

Query type breakdown

Different query types favor different retrieval methods:

Query typeBest methodAccuracy
Factual lookupsHybrid91.2%
ResearchVector (90%) + Keyword (10%)88.7%
TroubleshootingKeyword (60%) + Vector (40%)85.3%
ComparisonVector82.1%

Insight: Troubleshooting queries benefit from higher keyword weighting because they often include specific error codes or log messages.

Combined optimization results

Testing the best configuration from each dimension:

Optimized pipeline:

  • Chunking: 500 tokens, 20% overlap
  • Embedding: text-embedding-3-large (3072d)
  • Retrieval: Hybrid (70% vector, 30% BM25)
  • Top-k: 5 chunks
MetricBaselineOptimizedImprovement
Accuracy71.2%87.3%+16.1pp (+23%)
Precision@50.680.84+0.16
Recall@50.720.91+0.19
Latency285ms340ms+55ms (+19%)
Cost/query$0.0012$0.0019+$0.0007 (+58%)

Trade-offs:

  • 23% better accuracy
  • 19% slower (still under 350ms -acceptable for most use cases)
  • 58% more expensive (still <$0.002/query -$2 per 1,000 queries)

ROI: For most applications, 16pp accuracy improvement justifies 58% cost increase.

Latency vs. accuracy trade-offs

Different use cases prioritize speed vs. accuracy differently.

Use caseAcceptable latencyTarget accuracyRecommended config
Chatbot (customer-facing)<300ms75-80%Vector only, text-emb-3-small, 500 tokens no overlap
Internal knowledge search<500ms85%+Hybrid, text-emb-3-large, 500 tokens 20% overlap
Compliance/Legal<1000ms90%+Hybrid + reranker, text-emb-3-large, semantic chunking
Batch processingNo constraint90%+Full optimization + GPT-4 verification

Adding a reranker

For use cases requiring >90% accuracy, add a reranker stage:

1. Hybrid search retrieves top 20 candidates (cheap, fast)
2. Reranker (e.g., Cohere rerank, cross-encoder) reorders top 20 (expensive, accurate)
3. Select top 5 from reranked list

Impact:

  • Accuracy: 87.3% → 91.7% (+4.4pp)
  • Latency: 340ms → 580ms (+240ms)
  • Cost: $0.0019 → $0.0041 (+116%)

Recommendation: Use reranker for high-stakes queries (legal, compliance, medical). Skip for general knowledge retrieval.

Cost optimization strategies

RAG costs add up at scale. Optimization strategies:

1. Tiered retrieval

Use cheap search first, escalate to expensive methods only if needed:

Query arrives
└─> Try BM25 keyword search (fast, cheap)
    └─> If confidence <0.8:
        └─> Try vector search
            └─> If confidence <0.8:
                └─> Try hybrid + reranker

Result: 60% of queries answered by BM25 alone, saving 70% on embedding + vector costs.

2. Cache popular queries

Store results for frequently-asked questions:

from functools import lru_cache

@lru_cache(maxsize=1000)
def retrieve_with_cache(query: str):
    """Cache results for repeated queries."""
    # Normalize query (lowercase, remove punctuation)
    normalized = normalize(query)

    # Check cache
    if cached_result := cache.get(normalized):
        return cached_result

    # Perform retrieval
    result = hybrid_search(query)

    # Cache result
    cache.set(normalized, result, ttl=3600)  # 1 hour TTL
    return result

Result: 35% cache hit rate, saving $0.0007 per cached query.

3. Use smaller embeddings for low-stakes queries

Route chatbot queries to text-emb-3-small, route compliance queries to text-emb-3-large:

def get_embedding_model(query_type: str):
    """Select embedding model based on query importance."""
    if query_type in ["compliance", "legal", "financial"]:
        return "text-embedding-3-large"
    else:
        return "text-embedding-3-small"  # 6.5× cheaper

Result: 40% cost reduction with minimal accuracy impact on low-stakes queries.

4. Batch embeddings

Embed in batches of 100-1000 instead of one-by-one:

# Bad: One at a time
for doc in documents:
    embedding = client.embeddings.create(input=doc, model="text-embedding-3-large")

# Good: Batched
batch_size = 100
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    embeddings = client.embeddings.create(input=batch, model="text-embedding-3-large")

Result: 40% fewer API calls due to batching overhead reduction.

Failure mode analysis

We analyzed the 12.7% of queries that optimized RAG still answered incorrectly.

Failure mode% of failuresExample
Answer not in knowledge base42%Query: "What's our policy on X?" → No doc covers X
Requires multi-hop reasoning28%Query needs info from 3+ disconnected chunks
Ambiguous query18%"How do I set it up?" → What's "it"?
Outdated information8%Retrieved chunk is from old version of docs
Retrieval failure (bad chunks)4%Relevant chunks exist but weren't retrieved

Addressing failure modes

Answer not in KB (42%):

  • Detect using confidence scoring: if top retrieval score <0.6, respond "I don't have information on that"
  • Avoid hallucination by refusing to answer instead of guessing

Multi-hop reasoning (28%):

  • Use agentic RAG: retrieve, synthesize, retrieve again if needed
  • Or: expand context window to include more chunks (5 → 10)

Ambiguous queries (18%):

  • Add clarification step: "Did you mean X or Y?"
  • Use conversation history to resolve pronouns ("it," "that," "this")

Outdated information (8%):

  • Add metadata: last_updated timestamp on chunks
  • Prefer recent chunks when dates are close
  • Implement versioned knowledge base

Retrieval failure (4%):

  • Add query expansion: rewrite query in multiple ways, retrieve for each
  • Use HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, embed it, search for similar docs

Recommendations by use case

Customer support chatbot

Priority: Low latency, reasonable accuracy, low cost

Config:

  • Chunking: 500 tokens, 10% overlap
  • Embedding: text-embedding-3-small
  • Retrieval: Vector only (skip hybrid for speed)
  • Top-k: 3
  • Cache: Yes (1-hour TTL)

Expected: 76-79% accuracy, <250ms latency, $0.0008/query

Internal knowledge assistant

Priority: High accuracy, moderate latency acceptable

Config:

  • Chunking: 500 tokens, 20% overlap
  • Embedding: text-embedding-3-large
  • Retrieval: Hybrid (70% vector, 30% keyword)
  • Top-k: 5
  • Reranker: Optional

Expected: 87-92% accuracy, 300-600ms latency, $0.0019-0.0041/query

Compliance/Legal document search

Priority: Maximum accuracy, latency not critical

Config:

  • Chunking: Semantic (preserve document structure)
  • Embedding: text-embedding-3-large (3072d)
  • Retrieval: Hybrid + Cohere reranker
  • Top-k: 10 → rerank to 5
  • Verification: GPT-4 checks answer against source

Expected: 91-95% accuracy, <1000ms latency, $0.0041-0.0080/query

Real-time code documentation

Priority: Very low latency, good accuracy

Config:

  • Chunking: Function-level (preserve code blocks)
  • Embedding: bge-large (self-hosted)
  • Retrieval: BM25 keyword (function names, class names)
  • Top-k: 3
  • Cache: Aggressive (24-hour TTL)

Expected: 82-85% accuracy, <150ms latency, ~$0/query (self-hosted)

Implementation checklist

Week 1: Baseline measurement

  • Collect 100-500 representative queries
  • Establish ground truth answers
  • Measure baseline accuracy with current RAG setup
  • Measure baseline latency and cost

Week 2: Chunking optimization

  • Test 500 tokens with 0%, 10%, 20% overlap
  • Measure accuracy impact
  • Select optimal overlap percentage

Week 3: Retrieval upgrade

  • Implement BM25 keyword search
  • Build hybrid search combining vector + BM25
  • Test weight ratios (70/30, 60/40, 80/20)
  • Measure accuracy improvement

Week 4: Embedding optimization

  • Test text-embedding-3-large
  • Measure accuracy vs. cost trade-off
  • Decide on embedding model

Week 5: Production rollout

  • Deploy optimized config to 10% of traffic
  • Monitor accuracy, latency, cost for 1 week
  • If successful, roll out to 100%

Ongoing:

  • Monthly review of failure cases
  • Retune hybrid weights based on query distribution
  • Update knowledge base regularly

Tools and libraries

Vector databases:

  • Pinecone (managed, easy): Good for getting started
  • Weaviate (hybrid search built-in): Best for hybrid retrieval
  • Qdrant (open-source, fast): Good for self-hosting
  • PostgreSQL + pgvector (familiar stack): Good if already using Postgres

BM25 implementations:

  • Elasticsearch: Industry standard, mature
  • Typesense: Faster, simpler API
  • rank-bm25 (Python library): Lightweight, for prototyping

Rerankers:

  • Cohere Rerank API: Easiest, $1/1000 searches
  • Cross-encoders (ms-marco-MiniLM): Self-hostable
  • Voyage Rerank: Alternative to Cohere

Evaluation frameworks:

  • RAGAS: RAG evaluation metrics (faithfulness, relevance)
  • LangSmith: End-to-end RAG pipeline testing
  • PromptLayer: A/B testing for RAG configs

Key takeaways

  • Hybrid retrieval is the highest-leverage optimization (+11.4pp accuracy), combining vector semantic search with keyword exactness.

  • 500-token chunks with 20% overlap outperform both smaller chunks (lose context) and larger chunks (noise).

  • Embedding model matters but not as much as retrieval method -text-embedding-3-large adds only 2.1pp over 3-small for 6× cost.

  • Different use cases need different configs -chatbots prioritize speed, compliance prioritizes accuracy, batch processing optimizes for both.

  • Measurement is prerequisite to optimization -establish ground truth, measure baseline, test systematically.


RAG pipeline optimization isn't one-size-fits-all. The "best" configuration depends on your accuracy requirements, latency constraints, and cost budget. Start with hybrid retrieval (biggest bang for buck), dial in chunking strategy, then optimize embedding model if accuracy still falls short. Measure continuously and retune as your knowledge base and query distribution evolve.

Frequently asked questions

Q: Should I optimize RAG before or after prompt engineering? A: Do basic prompt engineering first (clear instructions, few-shot examples) to establish a baseline. Then optimize RAG. Advanced prompt engineering can compensate for poor RAG but wastes tokens and increases costs.

Q: How often should I retune RAG parameters? A: Review monthly for first 6 months, then quarterly. Retune immediately if you notice accuracy degradation or if your knowledge base content changes significantly (e.g., docs rewrite, new product launch).

Q: Can I use different RAG configs for different document types? A: Yes! Route queries to specialized indices: structured docs use fixed chunking + keyword search, narrative content uses semantic chunking + vector search.

Q: What's the minimum dataset size to run meaningful RAG experiments? A: 50-100 queries with ground truth answers. Below that, results aren't statistically significant. Above 500, diminishing returns on experiment value.

Further reading:

External references: