Academy18 Nov 202512 min read

Semantic Caching for AI Agents: Reduce Costs and Latency by 40%

Implement semantic caching that serves similar queries from cache instead of calling LLMs - using embeddings to match query intent, not just exact strings, for significant cost and latency savings.

MB
Max Beech
Head of Content

TL;DR

  • Traditional caching only matches exact queries. Semantic caching matches by meaning - "What's the weather?" hits the same cache as "How's the weather today?"
  • Use embedding similarity to find cached responses for semantically similar queries.
  • Set similarity thresholds carefully: too low = wrong answers from cache; too high = poor hit rate.
  • Cache invalidation based on time, context changes, and feedback ensures freshness.

Jump to Why semantic caching · Jump to Architecture · Jump to Implementation · Jump to Threshold tuning

Semantic Caching for AI Agents: Reduce Costs and Latency by 40%

Your users ask the same questions in different ways. "What's our refund policy?" and "How do I get a refund?" and "Can I return this?" all need the same answer. Traditional caching misses all of these because the strings don't match exactly. You pay for an LLM call every time.

Semantic caching solves this by matching query intent, not query text. When a new query arrives, you check if you've answered a semantically similar query recently. If so, return the cached response instantly. No LLM call needed.

At Athenic, semantic caching reduced our support agent costs by 38% and improved p50 latency from 1,200ms to 180ms for cached queries. This guide shows you how to build it.

Key takeaways

  • Semantic caching uses embeddings to find similar queries, enabling cache hits for paraphrased questions.
  • Similarity threshold is critical: 0.92+ is typically safe for factual queries; lower thresholds need human review.
  • Context matters: "What's your refund policy?" for user A might need different answer than for user B.
  • Cache invalidation is as important as cache hits - stale cached answers damage trust.

Why semantic caching

Consider a customer support agent fielding these queries in one hour:

"What's your refund policy?"
"How do I get my money back?"
"Can I return this product?"
"What's the process for refunds?"
"I want a refund"
"Refund policy please"
"How does returns work?"

All seven queries want the same information. Without semantic caching, you make seven LLM calls. With it, you make one call and serve six from cache.

The economics

MetricWithout cacheWith semantic cacheImprovement
LLM calls/hour1,000620-38%
Avg latency1,200ms580ms-52%
Cost/hour$2.50$1.55-38%

The improvements compound. High-volume support scenarios might see 50-60% cache hit rates for common questions.

When semantic caching works

Good candidates:

  • FAQ-style queries with stable answers
  • Knowledge base lookups
  • Standard explanations and definitions
  • Status checks with predictable outputs

Poor candidates:

  • Personalised recommendations (user-specific context)
  • Time-sensitive information (stock prices, current events)
  • Creative tasks (writing, brainstorming)
  • Multi-turn conversations (context changes each turn)

Architecture

Semantic caching adds a lookup layer before LLM calls.

User query
    ↓
[Generate embedding]
    ↓
[Search cache by similarity]
    ↓
┌─ Cache hit (similarity > threshold)
│      ↓
│  [Return cached response]
│
└─ Cache miss
       ↓
   [Call LLM]
       ↓
   [Store in cache with embedding]
       ↓
   [Return response]

Components

Embedding service: Converts queries to vectors. Use same model for storage and lookup.

Vector store: Holds cached responses with their query embeddings. Needs fast similarity search.

Cache manager: Handles lookups, storage, invalidation, and threshold decisions.

Schema design

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE semantic_cache (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  -- Query representation
  query_text TEXT NOT NULL,
  query_embedding vector(1536),

  -- Response
  response_text TEXT NOT NULL,

  -- Context keys (for scoped caching)
  org_id TEXT,
  user_id TEXT,
  context_hash TEXT,  -- Hash of relevant context

  -- Metadata
  model_id TEXT NOT NULL,
  token_count INTEGER,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  last_accessed_at TIMESTAMPTZ DEFAULT NOW(),
  access_count INTEGER DEFAULT 1,
  expires_at TIMESTAMPTZ,

  -- Quality tracking
  feedback_positive INTEGER DEFAULT 0,
  feedback_negative INTEGER DEFAULT 0
);

-- Vector similarity index
CREATE INDEX ON semantic_cache
USING ivfflat (query_embedding vector_cosine_ops)
WITH (lists = 100);

-- Scoped lookup index
CREATE INDEX ON semantic_cache (org_id, context_hash);

-- Expiry cleanup index
CREATE INDEX ON semantic_cache (expires_at) WHERE expires_at IS NOT NULL;

Implementation

Step 1: Cache manager

interface CacheConfig {
  similarityThreshold: number;  // 0.0 to 1.0
  maxAge: number;               // Milliseconds
  contextKeys: string[];        // Which context fields affect cache scope
  embeddingModel: string;
}

interface CacheEntry {
  id: string;
  queryText: string;
  responseText: string;
  similarity: number;
  createdAt: Date;
  accessCount: number;
}

class SemanticCache {
  private config: CacheConfig;
  private openai: OpenAI;

  constructor(config: CacheConfig) {
    this.config = {
      similarityThreshold: 0.92,
      maxAge: 3600000,  // 1 hour
      contextKeys: ['org_id'],
      embeddingModel: 'text-embedding-3-small',
      ...config
    };
    this.openai = new OpenAI();
  }

  async lookup(
    query: string,
    context: Record<string, string>
  ): Promise<CacheEntry | null> {
    // Generate embedding for query
    const embedding = await this.embed(query);

    // Build context hash for scoped lookup
    const contextHash = this.hashContext(context);

    // Search for similar cached queries
    const results = await db.query(`
      SELECT
        id,
        query_text,
        response_text,
        1 - (query_embedding <=> $1) AS similarity,
        created_at,
        access_count
      FROM semantic_cache
      WHERE org_id = $2
        AND (context_hash = $3 OR context_hash IS NULL)
        AND (expires_at IS NULL OR expires_at > NOW())
        AND 1 - (query_embedding <=> $1) > $4
      ORDER BY query_embedding <=> $1
      LIMIT 1
    `, [
      embedding,
      context.org_id,
      contextHash,
      this.config.similarityThreshold
    ]);

    if (results.length === 0) {
      return null;
    }

    const hit = results[0];

    // Update access tracking
    await this.recordAccess(hit.id);

    return {
      id: hit.id,
      queryText: hit.query_text,
      responseText: hit.response_text,
      similarity: hit.similarity,
      createdAt: hit.created_at,
      accessCount: hit.access_count
    };
  }

  async store(
    query: string,
    response: string,
    context: Record<string, string>,
    metadata: { modelId: string; tokenCount: number }
  ): Promise<string> {
    const embedding = await this.embed(query);
    const contextHash = this.hashContext(context);

    const result = await db.query(`
      INSERT INTO semantic_cache (
        query_text,
        query_embedding,
        response_text,
        org_id,
        context_hash,
        model_id,
        token_count,
        expires_at
      ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
      ON CONFLICT DO NOTHING
      RETURNING id
    `, [
      query,
      embedding,
      response,
      context.org_id,
      contextHash,
      metadata.modelId,
      metadata.tokenCount,
      new Date(Date.now() + this.config.maxAge)
    ]);

    return result[0]?.id;
  }

  private async embed(text: string): Promise<number[]> {
    const response = await this.openai.embeddings.create({
      model: this.config.embeddingModel,
      input: text
    });
    return response.data[0].embedding;
  }

  private hashContext(context: Record<string, string>): string {
    const relevant = this.config.contextKeys
      .map(key => context[key])
      .filter(Boolean)
      .join(':');

    return crypto.createHash('md5').update(relevant).digest('hex');
  }

  private async recordAccess(cacheId: string): Promise<void> {
    await db.query(`
      UPDATE semantic_cache
      SET
        last_accessed_at = NOW(),
        access_count = access_count + 1
      WHERE id = $1
    `, [cacheId]);
  }
}

Step 2: Integration with agent

class CachedAgent {
  private cache: SemanticCache;
  private llm: OpenAI;

  async respond(
    query: string,
    context: ExecutionContext
  ): Promise<{ response: string; fromCache: boolean }> {
    // Check cache first
    const cached = await this.cache.lookup(query, {
      org_id: context.orgId,
      user_id: context.userId
    });

    if (cached) {
      console.log(`Cache hit (similarity: ${cached.similarity.toFixed(3)})`);
      return { response: cached.responseText, fromCache: true };
    }

    // Cache miss - call LLM
    const response = await this.llm.chat.completions.create({
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: this.systemPrompt },
        { role: 'user', content: query }
      ]
    });

    const responseText = response.choices[0].message.content;

    // Store in cache
    await this.cache.store(query, responseText, {
      org_id: context.orgId,
      user_id: context.userId
    }, {
      modelId: 'gpt-4o',
      tokenCount: response.usage.total_tokens
    });

    return { response: responseText, fromCache: false };
  }
}

Step 3: Cache invalidation

class CacheInvalidator {
  // Time-based expiry
  async cleanupExpired(): Promise<number> {
    const result = await db.query(`
      DELETE FROM semantic_cache
      WHERE expires_at < NOW()
      RETURNING id
    `);
    return result.length;
  }

  // Content-based invalidation
  async invalidateByContent(pattern: string): Promise<number> {
    // When source content changes, invalidate related cache entries
    const result = await db.query(`
      DELETE FROM semantic_cache
      WHERE response_text ILIKE $1
      RETURNING id
    `, [`%${pattern}%`]);
    return result.length;
  }

  // Semantic invalidation - invalidate entries similar to a query
  async invalidateSimilar(
    query: string,
    orgId: string,
    threshold: number = 0.85
  ): Promise<number> {
    const embedding = await this.embed(query);

    const result = await db.query(`
      DELETE FROM semantic_cache
      WHERE org_id = $1
        AND 1 - (query_embedding <=> $2) > $3
      RETURNING id
    `, [orgId, embedding, threshold]);

    return result.length;
  }

  // Feedback-based invalidation
  async invalidatePoorQuality(): Promise<number> {
    // Remove entries with bad feedback ratio
    const result = await db.query(`
      DELETE FROM semantic_cache
      WHERE feedback_positive + feedback_negative > 5
        AND feedback_negative::float / (feedback_positive + feedback_negative) > 0.3
      RETURNING id
    `);
    return result.length;
  }

  // Org-wide invalidation (e.g., when knowledge base updates)
  async invalidateOrg(orgId: string): Promise<number> {
    const result = await db.query(`
      DELETE FROM semantic_cache
      WHERE org_id = $1
      RETURNING id
    `, [orgId]);
    return result.length;
  }
}

Threshold tuning

The similarity threshold determines cache behaviour. Too high = few hits. Too low = wrong answers.

Guidelines by use case

Use caseRecommended thresholdRationale
FAQ lookups0.90-0.92Stable answers, high hit rate valuable
Knowledge base0.92-0.94Need accuracy, some flexibility okay
Technical docs0.94-0.96Precision matters, minor variations change meaning
Legal/compliance0.96-0.98Exactness critical, cache hits less important

Empirical tuning

Run experiments with labelled query pairs to find optimal threshold:

async function findOptimalThreshold(
  testPairs: { query1: string; query2: string; shouldMatch: boolean }[]
): Promise<{ threshold: number; accuracy: number }> {
  const results: { threshold: number; accuracy: number }[] = [];

  for (let threshold = 0.80; threshold <= 0.98; threshold += 0.01) {
    let correct = 0;

    for (const pair of testPairs) {
      const embedding1 = await embed(pair.query1);
      const embedding2 = await embed(pair.query2);

      const similarity = cosineSimilarity(embedding1, embedding2);
      const wouldMatch = similarity >= threshold;

      if (wouldMatch === pair.shouldMatch) {
        correct++;
      }
    }

    const accuracy = correct / testPairs.length;
    results.push({ threshold, accuracy });
  }

  // Find threshold with best accuracy
  const best = results.reduce((a, b) => a.accuracy > b.accuracy ? a : b);

  console.log('Threshold analysis:');
  results.forEach(r => {
    console.log(`  ${r.threshold.toFixed(2)}: ${(r.accuracy * 100).toFixed(1)}%`);
  });

  return best;
}

// Test data
const testPairs = [
  { query1: "What's your refund policy?", query2: "How do I get a refund?", shouldMatch: true },
  { query1: "What's your refund policy?", query2: "What's your pricing?", shouldMatch: false },
  { query1: "How do I reset password?", query2: "Reset my password", shouldMatch: true },
  { query1: "How do I reset password?", query2: "Change my email address", shouldMatch: false },
  // Add more pairs...
];

Dynamic thresholds

Adjust thresholds based on query characteristics:

function getDynamicThreshold(query: string, context: any): number {
  const baseThreshold = 0.92;

  // Lower threshold for short, common queries
  if (query.length < 30) {
    return baseThreshold - 0.02;
  }

  // Higher threshold for queries with numbers/specifics
  if (/\d+/.test(query)) {
    return baseThreshold + 0.03;
  }

  // Higher threshold for technical queries
  if (/api|config|error|code/.test(query.toLowerCase())) {
    return baseThreshold + 0.02;
  }

  return baseThreshold;
}

Monitoring and quality

Key metrics

MetricTargetAction if below
Cache hit rate>30%Lower threshold, longer TTL
False positive rate<2%Raise threshold
Cache latency<50msOptimise vector index
Feedback ratio>90% positiveReview threshold, improve invalidation

Monitoring implementation

const cacheMetrics = {
  recordLookup(hit: boolean, similarity?: number, latencyMs: number) {
    metrics.increment('cache.lookups', { hit: String(hit) });
    metrics.histogram('cache.lookup_latency_ms', latencyMs);

    if (hit && similarity) {
      metrics.histogram('cache.hit_similarity', similarity);
    }
  },

  recordFeedback(cacheId: string, positive: boolean) {
    metrics.increment('cache.feedback', { positive: String(positive) });

    // Update cache entry
    db.query(`
      UPDATE semantic_cache
      SET ${positive ? 'feedback_positive' : 'feedback_negative'} =
          ${positive ? 'feedback_positive' : 'feedback_negative'} + 1
      WHERE id = $1
    `, [cacheId]);
  },

  async getStats(hours: number = 24) {
    const stats = await db.query(`
      SELECT
        COUNT(*) as total_entries,
        AVG(access_count) as avg_access_count,
        SUM(feedback_positive) as total_positive,
        SUM(feedback_negative) as total_negative,
        COUNT(*) FILTER (WHERE access_count > 1) as entries_with_hits
      FROM semantic_cache
      WHERE created_at > NOW() - INTERVAL '${hours} hours'
    `);

    return stats[0];
  }
};

FAQs

What embedding model should I use?

text-embedding-3-small offers the best balance of quality and cost for most cases. Use text-embedding-3-large if you need finer semantic distinctions.

How do I handle multi-turn conversations?

Include conversation context in the cache key (context hash). Or disable caching for multi-turn - the context changes too much.

Can I cache streaming responses?

Yes, but store the complete response. On cache hit, stream from cache to maintain UX consistency.

What about different users needing different answers?

Use context-scoped caching. Include user segments, roles, or other differentiating factors in the context hash.

How big should my cache be?

Start with 100,000 entries per org. Monitor hit rates - if they plateau, you have enough coverage. Clean up low-access entries regularly.

Summary and next steps

Semantic caching delivers immediate cost and latency improvements for query patterns that repeat semantically. The key is tuning thresholds for your specific accuracy requirements and building robust invalidation.

Implementation checklist:

  1. Set up vector store with embedding index
  2. Build cache manager with lookup and storage
  3. Configure similarity threshold for your use case
  4. Implement invalidation strategies
  5. Add monitoring for hit rate and quality
  6. Collect feedback to improve thresholds

Quick wins:

  • Start with high threshold (0.94) and lower as you gain confidence
  • Focus caching on high-volume, stable-answer query patterns
  • Implement time-based expiry before more complex invalidation

Internal links:

External references: