TL;DR
- Traditional caching only matches exact queries. Semantic caching matches by meaning - "What's the weather?" hits the same cache as "How's the weather today?"
- Use embedding similarity to find cached responses for semantically similar queries.
- Set similarity thresholds carefully: too low = wrong answers from cache; too high = poor hit rate.
- Cache invalidation based on time, context changes, and feedback ensures freshness.
Jump to Why semantic caching · Jump to Architecture · Jump to Implementation · Jump to Threshold tuning
Semantic Caching for AI Agents: Reduce Costs and Latency by 40%
Your users ask the same questions in different ways. "What's our refund policy?" and "How do I get a refund?" and "Can I return this?" all need the same answer. Traditional caching misses all of these because the strings don't match exactly. You pay for an LLM call every time.
Semantic caching solves this by matching query intent, not query text. When a new query arrives, you check if you've answered a semantically similar query recently. If so, return the cached response instantly. No LLM call needed.
At Athenic, semantic caching reduced our support agent costs by 38% and improved p50 latency from 1,200ms to 180ms for cached queries. This guide shows you how to build it.
Key takeaways
- Semantic caching uses embeddings to find similar queries, enabling cache hits for paraphrased questions.
- Similarity threshold is critical: 0.92+ is typically safe for factual queries; lower thresholds need human review.
- Context matters: "What's your refund policy?" for user A might need different answer than for user B.
- Cache invalidation is as important as cache hits - stale cached answers damage trust.
Why semantic caching
Consider a customer support agent fielding these queries in one hour:
"What's your refund policy?"
"How do I get my money back?"
"Can I return this product?"
"What's the process for refunds?"
"I want a refund"
"Refund policy please"
"How does returns work?"
All seven queries want the same information. Without semantic caching, you make seven LLM calls. With it, you make one call and serve six from cache.
The economics
| Metric | Without cache | With semantic cache | Improvement |
|---|
| LLM calls/hour | 1,000 | 620 | -38% |
| Avg latency | 1,200ms | 580ms | -52% |
| Cost/hour | $2.50 | $1.55 | -38% |
The improvements compound. High-volume support scenarios might see 50-60% cache hit rates for common questions.
When semantic caching works
Good candidates:
- FAQ-style queries with stable answers
- Knowledge base lookups
- Standard explanations and definitions
- Status checks with predictable outputs
Poor candidates:
- Personalised recommendations (user-specific context)
- Time-sensitive information (stock prices, current events)
- Creative tasks (writing, brainstorming)
- Multi-turn conversations (context changes each turn)
Architecture
Semantic caching adds a lookup layer before LLM calls.
User query
↓
[Generate embedding]
↓
[Search cache by similarity]
↓
┌─ Cache hit (similarity > threshold)
│ ↓
│ [Return cached response]
│
└─ Cache miss
↓
[Call LLM]
↓
[Store in cache with embedding]
↓
[Return response]
Components
Embedding service: Converts queries to vectors. Use same model for storage and lookup.
Vector store: Holds cached responses with their query embeddings. Needs fast similarity search.
Cache manager: Handles lookups, storage, invalidation, and threshold decisions.
Schema design
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE semantic_cache (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Query representation
query_text TEXT NOT NULL,
query_embedding vector(1536),
-- Response
response_text TEXT NOT NULL,
-- Context keys (for scoped caching)
org_id TEXT,
user_id TEXT,
context_hash TEXT, -- Hash of relevant context
-- Metadata
model_id TEXT NOT NULL,
token_count INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(),
last_accessed_at TIMESTAMPTZ DEFAULT NOW(),
access_count INTEGER DEFAULT 1,
expires_at TIMESTAMPTZ,
-- Quality tracking
feedback_positive INTEGER DEFAULT 0,
feedback_negative INTEGER DEFAULT 0
);
-- Vector similarity index
CREATE INDEX ON semantic_cache
USING ivfflat (query_embedding vector_cosine_ops)
WITH (lists = 100);
-- Scoped lookup index
CREATE INDEX ON semantic_cache (org_id, context_hash);
-- Expiry cleanup index
CREATE INDEX ON semantic_cache (expires_at) WHERE expires_at IS NOT NULL;
Implementation
Step 1: Cache manager
interface CacheConfig {
similarityThreshold: number; // 0.0 to 1.0
maxAge: number; // Milliseconds
contextKeys: string[]; // Which context fields affect cache scope
embeddingModel: string;
}
interface CacheEntry {
id: string;
queryText: string;
responseText: string;
similarity: number;
createdAt: Date;
accessCount: number;
}
class SemanticCache {
private config: CacheConfig;
private openai: OpenAI;
constructor(config: CacheConfig) {
this.config = {
similarityThreshold: 0.92,
maxAge: 3600000, // 1 hour
contextKeys: ['org_id'],
embeddingModel: 'text-embedding-3-small',
...config
};
this.openai = new OpenAI();
}
async lookup(
query: string,
context: Record<string, string>
): Promise<CacheEntry | null> {
// Generate embedding for query
const embedding = await this.embed(query);
// Build context hash for scoped lookup
const contextHash = this.hashContext(context);
// Search for similar cached queries
const results = await db.query(`
SELECT
id,
query_text,
response_text,
1 - (query_embedding <=> $1) AS similarity,
created_at,
access_count
FROM semantic_cache
WHERE org_id = $2
AND (context_hash = $3 OR context_hash IS NULL)
AND (expires_at IS NULL OR expires_at > NOW())
AND 1 - (query_embedding <=> $1) > $4
ORDER BY query_embedding <=> $1
LIMIT 1
`, [
embedding,
context.org_id,
contextHash,
this.config.similarityThreshold
]);
if (results.length === 0) {
return null;
}
const hit = results[0];
// Update access tracking
await this.recordAccess(hit.id);
return {
id: hit.id,
queryText: hit.query_text,
responseText: hit.response_text,
similarity: hit.similarity,
createdAt: hit.created_at,
accessCount: hit.access_count
};
}
async store(
query: string,
response: string,
context: Record<string, string>,
metadata: { modelId: string; tokenCount: number }
): Promise<string> {
const embedding = await this.embed(query);
const contextHash = this.hashContext(context);
const result = await db.query(`
INSERT INTO semantic_cache (
query_text,
query_embedding,
response_text,
org_id,
context_hash,
model_id,
token_count,
expires_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
ON CONFLICT DO NOTHING
RETURNING id
`, [
query,
embedding,
response,
context.org_id,
contextHash,
metadata.modelId,
metadata.tokenCount,
new Date(Date.now() + this.config.maxAge)
]);
return result[0]?.id;
}
private async embed(text: string): Promise<number[]> {
const response = await this.openai.embeddings.create({
model: this.config.embeddingModel,
input: text
});
return response.data[0].embedding;
}
private hashContext(context: Record<string, string>): string {
const relevant = this.config.contextKeys
.map(key => context[key])
.filter(Boolean)
.join(':');
return crypto.createHash('md5').update(relevant).digest('hex');
}
private async recordAccess(cacheId: string): Promise<void> {
await db.query(`
UPDATE semantic_cache
SET
last_accessed_at = NOW(),
access_count = access_count + 1
WHERE id = $1
`, [cacheId]);
}
}
Step 2: Integration with agent
class CachedAgent {
private cache: SemanticCache;
private llm: OpenAI;
async respond(
query: string,
context: ExecutionContext
): Promise<{ response: string; fromCache: boolean }> {
// Check cache first
const cached = await this.cache.lookup(query, {
org_id: context.orgId,
user_id: context.userId
});
if (cached) {
console.log(`Cache hit (similarity: ${cached.similarity.toFixed(3)})`);
return { response: cached.responseText, fromCache: true };
}
// Cache miss - call LLM
const response = await this.llm.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: this.systemPrompt },
{ role: 'user', content: query }
]
});
const responseText = response.choices[0].message.content;
// Store in cache
await this.cache.store(query, responseText, {
org_id: context.orgId,
user_id: context.userId
}, {
modelId: 'gpt-4o',
tokenCount: response.usage.total_tokens
});
return { response: responseText, fromCache: false };
}
}
Step 3: Cache invalidation
class CacheInvalidator {
// Time-based expiry
async cleanupExpired(): Promise<number> {
const result = await db.query(`
DELETE FROM semantic_cache
WHERE expires_at < NOW()
RETURNING id
`);
return result.length;
}
// Content-based invalidation
async invalidateByContent(pattern: string): Promise<number> {
// When source content changes, invalidate related cache entries
const result = await db.query(`
DELETE FROM semantic_cache
WHERE response_text ILIKE $1
RETURNING id
`, [`%${pattern}%`]);
return result.length;
}
// Semantic invalidation - invalidate entries similar to a query
async invalidateSimilar(
query: string,
orgId: string,
threshold: number = 0.85
): Promise<number> {
const embedding = await this.embed(query);
const result = await db.query(`
DELETE FROM semantic_cache
WHERE org_id = $1
AND 1 - (query_embedding <=> $2) > $3
RETURNING id
`, [orgId, embedding, threshold]);
return result.length;
}
// Feedback-based invalidation
async invalidatePoorQuality(): Promise<number> {
// Remove entries with bad feedback ratio
const result = await db.query(`
DELETE FROM semantic_cache
WHERE feedback_positive + feedback_negative > 5
AND feedback_negative::float / (feedback_positive + feedback_negative) > 0.3
RETURNING id
`);
return result.length;
}
// Org-wide invalidation (e.g., when knowledge base updates)
async invalidateOrg(orgId: string): Promise<number> {
const result = await db.query(`
DELETE FROM semantic_cache
WHERE org_id = $1
RETURNING id
`, [orgId]);
return result.length;
}
}
Threshold tuning
The similarity threshold determines cache behaviour. Too high = few hits. Too low = wrong answers.
Guidelines by use case
| Use case | Recommended threshold | Rationale |
|---|
| FAQ lookups | 0.90-0.92 | Stable answers, high hit rate valuable |
| Knowledge base | 0.92-0.94 | Need accuracy, some flexibility okay |
| Technical docs | 0.94-0.96 | Precision matters, minor variations change meaning |
| Legal/compliance | 0.96-0.98 | Exactness critical, cache hits less important |
Empirical tuning
Run experiments with labelled query pairs to find optimal threshold:
async function findOptimalThreshold(
testPairs: { query1: string; query2: string; shouldMatch: boolean }[]
): Promise<{ threshold: number; accuracy: number }> {
const results: { threshold: number; accuracy: number }[] = [];
for (let threshold = 0.80; threshold <= 0.98; threshold += 0.01) {
let correct = 0;
for (const pair of testPairs) {
const embedding1 = await embed(pair.query1);
const embedding2 = await embed(pair.query2);
const similarity = cosineSimilarity(embedding1, embedding2);
const wouldMatch = similarity >= threshold;
if (wouldMatch === pair.shouldMatch) {
correct++;
}
}
const accuracy = correct / testPairs.length;
results.push({ threshold, accuracy });
}
// Find threshold with best accuracy
const best = results.reduce((a, b) => a.accuracy > b.accuracy ? a : b);
console.log('Threshold analysis:');
results.forEach(r => {
console.log(` ${r.threshold.toFixed(2)}: ${(r.accuracy * 100).toFixed(1)}%`);
});
return best;
}
// Test data
const testPairs = [
{ query1: "What's your refund policy?", query2: "How do I get a refund?", shouldMatch: true },
{ query1: "What's your refund policy?", query2: "What's your pricing?", shouldMatch: false },
{ query1: "How do I reset password?", query2: "Reset my password", shouldMatch: true },
{ query1: "How do I reset password?", query2: "Change my email address", shouldMatch: false },
// Add more pairs...
];
Dynamic thresholds
Adjust thresholds based on query characteristics:
function getDynamicThreshold(query: string, context: any): number {
const baseThreshold = 0.92;
// Lower threshold for short, common queries
if (query.length < 30) {
return baseThreshold - 0.02;
}
// Higher threshold for queries with numbers/specifics
if (/\d+/.test(query)) {
return baseThreshold + 0.03;
}
// Higher threshold for technical queries
if (/api|config|error|code/.test(query.toLowerCase())) {
return baseThreshold + 0.02;
}
return baseThreshold;
}
Monitoring and quality
Key metrics
| Metric | Target | Action if below |
|---|
| Cache hit rate | >30% | Lower threshold, longer TTL |
| False positive rate | <2% | Raise threshold |
| Cache latency | <50ms | Optimise vector index |
| Feedback ratio | >90% positive | Review threshold, improve invalidation |
Monitoring implementation
const cacheMetrics = {
recordLookup(hit: boolean, similarity?: number, latencyMs: number) {
metrics.increment('cache.lookups', { hit: String(hit) });
metrics.histogram('cache.lookup_latency_ms', latencyMs);
if (hit && similarity) {
metrics.histogram('cache.hit_similarity', similarity);
}
},
recordFeedback(cacheId: string, positive: boolean) {
metrics.increment('cache.feedback', { positive: String(positive) });
// Update cache entry
db.query(`
UPDATE semantic_cache
SET ${positive ? 'feedback_positive' : 'feedback_negative'} =
${positive ? 'feedback_positive' : 'feedback_negative'} + 1
WHERE id = $1
`, [cacheId]);
},
async getStats(hours: number = 24) {
const stats = await db.query(`
SELECT
COUNT(*) as total_entries,
AVG(access_count) as avg_access_count,
SUM(feedback_positive) as total_positive,
SUM(feedback_negative) as total_negative,
COUNT(*) FILTER (WHERE access_count > 1) as entries_with_hits
FROM semantic_cache
WHERE created_at > NOW() - INTERVAL '${hours} hours'
`);
return stats[0];
}
};
FAQs
What embedding model should I use?
text-embedding-3-small offers the best balance of quality and cost for most cases. Use text-embedding-3-large if you need finer semantic distinctions.
How do I handle multi-turn conversations?
Include conversation context in the cache key (context hash). Or disable caching for multi-turn - the context changes too much.
Can I cache streaming responses?
Yes, but store the complete response. On cache hit, stream from cache to maintain UX consistency.
What about different users needing different answers?
Use context-scoped caching. Include user segments, roles, or other differentiating factors in the context hash.
How big should my cache be?
Start with 100,000 entries per org. Monitor hit rates - if they plateau, you have enough coverage. Clean up low-access entries regularly.
Summary and next steps
Semantic caching delivers immediate cost and latency improvements for query patterns that repeat semantically. The key is tuning thresholds for your specific accuracy requirements and building robust invalidation.
Implementation checklist:
- Set up vector store with embedding index
- Build cache manager with lookup and storage
- Configure similarity threshold for your use case
- Implement invalidation strategies
- Add monitoring for hit rate and quality
- Collect feedback to improve thresholds
Quick wins:
- Start with high threshold (0.94) and lower as you gain confidence
- Focus caching on high-volume, stable-answer query patterns
- Implement time-based expiry before more complex invalidation
Internal links:
External references: