Academy22 Sept 202514 min read

Prompt Caching for AI Agents: Cut LLM Costs by 60% Without Sacrificing Quality

Implement prompt caching strategies that reduce AI agent operating costs by 60% or more - with production patterns for cache invalidation, prefix optimisation, and provider-specific implementations.

MB
Max Beech
Head of Content

TL;DR

  • Prompt caching reduces costs by reusing processed prompt prefixes across requests - providers charge less for cached tokens.
  • Structure prompts with stable content first (system prompt, context docs) and variable content last (user query).
  • Anthropic offers 90% discount on cached tokens; OpenAI offers 50% discount - both require minimum prefix lengths.
  • Cache invalidation matters: stale context causes worse outputs than no caching at all.

Jump to Why prompt caching matters · Jump to How caching works · Jump to Implementation patterns · Jump to Provider comparison

Prompt Caching for AI Agents: Cut LLM Costs by 60% Without Sacrificing Quality

AI agent costs compound fast. A single agent call might cost £0.02, but run that agent 10,000 times daily and you're burning £200/day - £6,000/month - on LLM inference alone. Scale to multiple agents and enterprise workloads, and costs become the primary constraint on what you can build.

Prompt caching offers the largest single cost reduction available to AI engineers. By structuring prompts so providers can reuse processed prefixes, you can cut per-request costs by 50-90% depending on provider and prompt structure.

This isn't theoretical optimisation. At Athenic, prompt caching reduced our research agent costs from £4,200/month to £1,680/month - 60% savings - with zero impact on output quality.

Key takeaways

  • Caching works by reusing KV (key-value) cache from previous requests with identical prefixes.
  • Savings scale with prefix length: longer stable prefixes = more cached tokens = bigger discounts.
  • All major providers now support caching, but implementation details differ significantly.
  • Wrong caching invalidation is worse than no caching - stale context produces wrong answers confidently.

Why prompt caching matters

To understand caching value, look at typical agent prompt composition:

Research agent prompt breakdown:

  • System prompt: 1,200 tokens (stable)
  • Knowledge context: 8,000 tokens (semi-stable, updates daily)
  • Conversation history: 2,000 tokens (changes each turn)
  • User query: 50 tokens (unique each request)

Total: 11,250 tokens per request. But 9,200 tokens (82%) are identical or nearly identical across requests. Without caching, you pay full price for all 11,250 tokens every time. With caching, you pay full price once, then discounted rates for subsequent requests.

Cost comparison (GPT-4o pricing):

ScenarioInput tokensInput costRequests/dayDaily cost
No caching11,250$0.0281,000$28.00
50% cached11,250 (9,200 cached)$0.0161,000$16.00
90% cached (Anthropic)11,250 (9,200 cached)$0.0071,000$7.00

The maths is straightforward: if you're running agents at scale and not using prompt caching, you're overpaying by 40-75%.

How prompt caching works

LLM providers process prompts through a transformer architecture that builds internal representations (KV cache) for each token. This computation is expensive and takes time.

When you send the same prompt prefix repeatedly, providers can skip recomputation if they've cached the KV values. They pass savings to you through reduced pricing.

The prefix matching requirement

Caching only works for exact prefix matches. The cached portion must be:

  1. Byte-for-byte identical to a previous request
  2. At the start of the prompt (not middle or end)
  3. Above the provider's minimum length threshold

This works (cache hit):

Request 1: [System prompt] [Context docs] [Query A]
Request 2: [System prompt] [Context docs] [Query B]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           Identical prefix - CACHED

This doesn't work (cache miss):

Request 1: [System prompt] [Context docs v1] [Query]
Request 2: [System prompt] [Context docs v2] [Query]
           ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
           Identical         Different - breaks cache

Even a single whitespace change breaks the cache. Prompt hygiene matters.

Minimum cache thresholds

Providers require minimum prefix lengths before caching activates:

ProviderMin tokens for cachingCache discount
Anthropic Claude1,024 tokens90% off input tokens
OpenAI1,024 tokens50% off input tokens
Google Gemini32,000 tokens75% off (context caching)

Anthropic and OpenAI's thresholds are reasonable for most agent prompts. Google's 32K minimum makes caching impractical for shorter interactions but valuable for massive context windows.

Implementation patterns

Moving from "caching exists" to "caching works in production" requires intentional prompt design.

Pattern 1: Static prefix ordering

Restructure prompts so stable content comes first, variable content comes last.

Before (not cache-friendly):

const prompt = `
User query: ${userQuery}

You are a research assistant. Use these sources:
${contextDocuments}

System rules:
${systemPrompt}
`;

After (cache-friendly):

const prompt = `
${systemPrompt}

Reference documents:
${contextDocuments}

User query: ${userQuery}
`;

Same information, different ordering. The second version caches the system prompt and documents, only paying full price for the user query.

Pattern 2: Context layering

When context varies in stability, layer it appropriately:

function buildCacheOptimisedPrompt(
  systemPrompt: string,        // Never changes
  orgContext: string,          // Changes monthly
  sessionContext: string,      // Changes hourly
  conversationHistory: string, // Changes each turn
  userQuery: string            // Unique each request
): string {
  return `${systemPrompt}

## Organisation Context
${orgContext}

## Session Context
${sessionContext}

## Conversation
${conversationHistory}

## Current Query
${userQuery}`;
}

This structure maximises cache hits:

  • Users from the same org share system prompt + org context cache
  • Users in the same session additionally share session context cache
  • Only conversation history and query vary per request

Pattern 3: Document chunk ordering

For RAG systems, order retrieved documents by staleness:

async function buildRAGPrompt(
  query: string,
  retrievedDocs: Document[]
): Promise<string> {
  // Sort documents: oldest first (most likely cached)
  const sortedDocs = retrievedDocs.sort(
    (a, b) => a.updatedAt.getTime() - b.updatedAt.getTime()
  );

  // Stable documents first
  const stableDocs = sortedDocs.filter(
    d => daysSince(d.updatedAt) > 7
  );
  const recentDocs = sortedDocs.filter(
    d => daysSince(d.updatedAt) <= 7
  );

  return `${systemPrompt}

## Reference Documents (Stable)
${stableDocs.map(d => d.content).join('\n\n')}

## Reference Documents (Recent)
${recentDocs.map(d => d.content).join('\n\n')}

Query: ${query}`;
}

Pattern 4: Cache warming

For predictable workloads, pre-warm caches before peak usage:

async function warmCache(agentConfig: AgentConfig) {
  const warmupPrompt = buildPrompt({
    systemPrompt: agentConfig.systemPrompt,
    context: agentConfig.defaultContext,
    query: "This is a cache warming request. Respond briefly."
  });

  await llm.complete({
    prompt: warmupPrompt,
    maxTokens: 10  // Minimal output to save cost
  });

  console.log('Cache warmed for agent:', agentConfig.name);
}

// Run before business hours
schedule('0 8 * * *', () => {
  agents.forEach(warmCache);
});

Provider comparison

Each major provider implements caching differently. Here's what you need to know for each.

Anthropic Claude

Implementation: Automatic prompt caching with explicit cache breakpoints.

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const response = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: systemPrompt,
      cache_control: { type: 'ephemeral' }  // Mark as cacheable
    }
  ],
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: contextDocuments,
          cache_control: { type: 'ephemeral' }
        },
        {
          type: 'text',
          text: userQuery
        }
      ]
    }
  ]
});

Key details:

  • 90% discount on cached input tokens
  • 25% surcharge on cache write (first request)
  • Cache TTL: 5 minutes (ephemeral)
  • Minimum: 1,024 tokens for caching to activate
  • Beta feature flag required: anthropic-beta: prompt-caching-2024-07-31

Best for: High-volume, repetitive workloads where 90% savings outweigh 25% write penalty.

OpenAI

Implementation: Automatic caching with no explicit API changes needed.

import OpenAI from 'openai';

const client = new OpenAI();

// Structure prompt for caching (no special API)
const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: systemPrompt },
    { role: 'user', content: `${contextDocuments}\n\nQuery: ${userQuery}` }
  ]
});

// Check cache usage in response
console.log('Cached tokens:', response.usage?.prompt_tokens_details?.cached_tokens);

Key details:

  • 50% discount on cached input tokens
  • No cache write penalty
  • Automatic caching, no opt-in required
  • Minimum: 1,024 tokens
  • Cache persists for "several minutes" (not documented precisely)

Best for: General use - free to enable, solid 50% savings, no write penalty.

Google Gemini

Implementation: Explicit context caching with longer TTL.

import { GoogleGenerativeAI } from '@google/generative-ai';

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY);

// Create cached context
const cachedContent = await genAI.cacheManager.create({
  model: 'gemini-1.5-pro',
  displayName: 'research-context',
  systemInstruction: systemPrompt,
  contents: [{ role: 'user', parts: [{ text: contextDocuments }] }],
  ttlSeconds: 3600  // 1 hour TTL
});

// Use cached context
const model = genAI.getGenerativeModelFromCachedContent(cachedContent);
const result = await model.generateContent(userQuery);

Key details:

  • 75% discount on cached tokens
  • Storage cost: $1.00 per million tokens per hour
  • Minimum: 32,000 tokens
  • Explicit TTL control (up to 1 hour default, configurable)

Best for: Very long context (100K+ tokens) with predictable reuse patterns. Not practical for short prompts.

Cache invalidation strategy

Stale caches cause subtle bugs. When your knowledge base updates but cached prompts contain old information, agents confidently provide outdated answers.

Version-based invalidation

Include version identifiers in prompts to force cache refresh when content updates:

interface ContextConfig {
  version: string;
  content: string;
  updatedAt: Date;
}

function buildVersionedPrompt(
  systemPrompt: string,
  contextConfig: ContextConfig,
  query: string
): string {
  return `${systemPrompt}

## Context (v${contextConfig.version}, updated ${contextConfig.updatedAt.toISOString()})
${contextConfig.content}

Query: ${query}`;
}

// When context updates, increment version
async function updateContext(newContent: string) {
  const currentVersion = await getContextVersion();
  const newVersion = incrementVersion(currentVersion);

  await saveContext({
    version: newVersion,
    content: newContent,
    updatedAt: new Date()
  });

  // Version change breaks cache automatically
}

TTL-based invalidation

For time-sensitive content, enforce maximum cache age:

class CacheAwarePromptBuilder {
  private lastCacheRefresh: Date = new Date(0);
  private maxCacheAgeMinutes: number = 30;

  buildPrompt(query: string): string {
    const now = new Date();
    const cacheAge = (now.getTime() - this.lastCacheRefresh.getTime()) / 60000;

    // Add cache-buster if too old
    const cacheBuster = cacheAge > this.maxCacheAgeMinutes
      ? `\n<!-- cache-refresh: ${now.toISOString()} -->`
      : '';

    if (cacheBuster) {
      this.lastCacheRefresh = now;
    }

    return `${this.systemPrompt}${cacheBuster}

${this.context}

Query: ${query}`;
  }
}

Event-driven invalidation

For critical updates, invalidate immediately:

// When product pricing changes, invalidate sales agent cache
eventBus.on('pricing.updated', async () => {
  // Force next request to miss cache by updating context
  await refreshSalesAgentContext();

  // Optionally pre-warm new cache
  await warmCache(salesAgentConfig);
});

Monitoring and optimisation

Track cache performance to identify optimisation opportunities.

Key metrics

MetricTargetAction if below target
Cache hit rate>70%Reorganise prompt structure
Cache savings rate>50%Increase stable prefix length
Average cached tokens>5,000Add more stable context
Cache write frequency<10% of requestsImprove prefix stability

Implementation

interface CacheMetrics {
  totalRequests: number;
  cacheHits: number;
  cachedTokens: number;
  uncachedTokens: number;
  estimatedSavings: number;
}

class CacheMonitor {
  private metrics: CacheMetrics = {
    totalRequests: 0,
    cacheHits: 0,
    cachedTokens: 0,
    uncachedTokens: 0,
    estimatedSavings: 0
  };

  recordRequest(response: LLMResponse) {
    this.metrics.totalRequests++;

    const cached = response.usage?.cachedTokens || 0;
    const uncached = response.usage?.inputTokens - cached;

    if (cached > 0) {
      this.metrics.cacheHits++;
      this.metrics.cachedTokens += cached;
    }
    this.metrics.uncachedTokens += uncached;

    // Estimate savings (assuming 50% discount)
    const savingsPerToken = 0.0000025; // GPT-4o input price / 2
    this.metrics.estimatedSavings += cached * savingsPerToken;
  }

  getHitRate(): number {
    return this.metrics.cacheHits / this.metrics.totalRequests;
  }

  getSavingsRate(): number {
    const totalTokens = this.metrics.cachedTokens + this.metrics.uncachedTokens;
    return this.metrics.cachedTokens / totalTokens;
  }

  report() {
    console.log({
      hitRate: `${(this.getHitRate() * 100).toFixed(1)}%`,
      savingsRate: `${(this.getSavingsRate() * 100).toFixed(1)}%`,
      estimatedSavings: `$${this.metrics.estimatedSavings.toFixed(2)}`,
      cachedTokens: this.metrics.cachedTokens.toLocaleString(),
      uncachedTokens: this.metrics.uncachedTokens.toLocaleString()
    });
  }
}

Real-world results: Athenic research agent

We applied these patterns to our research agent, which processes 3,000-4,000 requests daily.

Before optimisation:

  • Average prompt: 12,400 tokens
  • No caching (prompts not structured for it)
  • Monthly cost: £4,200

After optimisation:

  • Same average prompt size
  • 78% cache hit rate
  • 82% of tokens cached on hits
  • Monthly cost: £1,680

Changes made:

  1. Moved system prompt (2,100 tokens) to start
  2. Sorted knowledge context by update frequency
  3. Added cache breakpoints with Anthropic API
  4. Implemented version-based invalidation

Unexpected benefit: Response latency improved 15% because cached prompts process faster (KV cache reuse reduces compute).

Common pitfalls

Pitfall 1: Dynamic content in prefix

Problem: Including timestamps, request IDs, or user names early in prompts.

// BAD: Timestamp in prefix breaks cache
const prompt = `[${new Date().toISOString()}] System: ${systemPrompt}...`;

// GOOD: Timestamp at end
const prompt = `${systemPrompt}...\n\nTimestamp: ${new Date().toISOString()}`;

Pitfall 2: Inconsistent formatting

Problem: Whitespace or formatting variations between requests.

// BAD: Template literal preserves inconsistent whitespace
const prompt = `
  ${systemPrompt}
  ${context}
`;

// GOOD: Explicit formatting
const prompt = [systemPrompt, context].join('\n\n').trim();

Pitfall 3: Over-caching stale content

Problem: Maximising cache hit rate at the expense of accuracy.

Don't cache content that changes frequently just to save money. Stale answers cost more in user trust than cache misses cost in tokens.

Pitfall 4: Ignoring provider differences

Problem: Using same caching strategy across providers.

Anthropic's 90% discount with 25% write penalty favours high-frequency, repetitive workloads. OpenAI's 50% discount with no write penalty favours variable workloads. Choose provider based on your access patterns.

FAQs

Does caching affect output quality?

No. Caching only affects cost and latency, not the model's reasoning. Cached and uncached prompts produce identical outputs.

How long do caches persist?

Varies by provider: Anthropic ~5 minutes, OpenAI "several minutes", Google configurable up to hours. Design for short TTLs; don't rely on caches persisting.

Can I cache across different users?

Yes, if the cached prefix is identical. System prompts and static context can be shared. User-specific content should come after the cached prefix.

Is caching worth it for low-volume applications?

Generally no. If you're making fewer than 100 requests/day with the same prefix, savings are minimal. Focus on caching for high-volume agents.

How do I know if caching is working?

Check response metadata. OpenAI returns prompt_tokens_details.cached_tokens. Anthropic returns cache hit/miss status. Monitor these metrics.

Summary and next steps

Prompt caching offers 50-90% cost reduction with zero quality impact - the highest-leverage optimisation available for AI agents at scale.

Implementation checklist:

  1. Audit current prompts for stable vs variable content
  2. Restructure prompts with stable prefixes
  3. Implement provider-specific caching mechanisms
  4. Add version-based invalidation for dynamic content
  5. Monitor hit rates and savings

Quick wins:

  • Move system prompts to the start (immediate savings)
  • Sort RAG documents by staleness (improves hit rate)
  • Remove timestamps and request IDs from prefixes (fixes cache misses)

Internal links:

External references: