Academy30 Oct 202411 min read

Cost Optimization Strategies for LLM-Based Agents: Cut Spend by 60%

Practical techniques to reduce agent operational costs without sacrificing quality -from model selection and prompt compression to caching and batching strategies.

MB
Max Beech
Head of Content

TL;DR

  • LLM costs often dominate agent economics -for typical production systems, 70-80% of operational spend goes to API calls.
  • Nine proven optimization strategies reduced our costs from $0.47/task to $0.18/task (-62%) whilst maintaining 85%+ task success rate.
  • Biggest wins: Smart model routing (-35% costs), prompt compression (-22%), and response caching (-18%). Combined effect compounds.

Jump to Cost breakdown · Jump to Model selection · Jump to Prompt optimization · Jump to Caching · Jump to Batching

Cost Optimization Strategies for LLM-Based Agents: Cut Spend by 60%

Our agent was working brilliantly. Task success rate: 87%. User satisfaction: 4.2/5. Monthly bill: £18,400.

At 40,000 tasks/month, we were paying £0.46 per task completion. Extrapolate that to 200,000 tasks (our 12-month growth target) and we'd hit £92,000/month -£1.1M annually. For context, our total engineering budget was £400K.

We had two choices: accept that agent economics don't work at scale, or find a way to cut costs radically without breaking quality.

We chose option two. Three months later, cost per task dropped to £0.18 (-61%) whilst task success improved to 89%. This guide shares exactly how.

"LLM costs are the new cloud compute bill -if you're not optimizing, you're leaving 50-70% savings on the table." – Simon Willison, AI researcher & creator of Datasette (blog post, 2024)

Understanding agent cost structure

Before optimizing, understand where money goes.

Typical agent cost breakdown

Cost component% of totalExample (£0.46/task)
LLM API calls72%£0.33
Tool API calls (enrichment, search)18%£0.08
Infrastructure (hosting, DB)7%£0.03
Monitoring & logs3%£0.02

LLM costs dominate. That's where optimization yields biggest returns.

LLM cost drivers

LLM cost = (Input tokens × Input price) + (Output tokens × Output price)

Example: GPT-4 Turbo call

  • Input: 2,500 tokens @ £0.01/1K tokens = £0.025
  • Output: 800 tokens @ £0.03/1K tokens = £0.024
  • Total: £0.049 per call

Multiply by average 8 LLM calls/task → £0.39/task just for LLM

Three levers to pull:

  1. Reduce tokens (input + output)
  2. Use cheaper models (GPT-3.5 vs GPT-4)
  3. Reduce calls (fewer LLM invocations per task)

Strategy 1: Smart model selection

Not every task needs GPT-4 Opus. Route intelligently based on complexity.

Model tier strategy

TierModelsCost/1M tokensUse for
PremiumGPT-4, Claude Opus£30-60Complex reasoning, code generation, analysis
StandardGPT-4 Turbo, Claude Sonnet£10-15General tasks, multi-step workflows
EconomyGPT-3.5, Claude Haiku£0.50-3Classification, summarization, simple Q&A
BudgetMixtral, Llama 3 (self-hosted)~£0High-volume, latency-tolerant tasks

Implementation:

def select_model(task_complexity: str, task_type: str) -> str:
    """Route to appropriate model based on task."""

    # High-complexity tasks → premium models
    if task_complexity == "high" or task_type in ["code_generation", "research_synthesis"]:
        return "gpt-4-turbo"

    # Medium complexity → standard models
    elif task_complexity == "medium" or task_type in ["analysis", "planning"]:
        return "gpt-3.5-turbo"

    # Simple tasks → economy models
    elif task_type in ["classification", "extraction", "summarization"]:
        return "gpt-3.5-turbo"  # or claude-haiku

    # Default to standard
    return "gpt-4-turbo"

# Enhanced with confidence-based routing
def select_model_adaptive(task: dict, previous_attempts: int = 0) -> str:
    """Start cheap, escalate if needed."""

    # First attempt: try economy model
    if previous_attempts == 0:
        return "gpt-3.5-turbo"

    # If failed or low confidence, escalate to standard
    elif previous_attempts == 1:
        return "gpt-4-turbo"

    # Last resort: premium model
    else:
        return "gpt-4"

Model routing results

Before (all GPT-4 Turbo):

  • Cost: £0.33/task (LLM only)
  • Success rate: 87%

After (tiered routing):

  • GPT-3.5: 48% of tasks (£0.04/task avg)
  • GPT-4 Turbo: 44% of tasks (£0.16/task avg)
  • GPT-4: 8% of tasks (£0.28/task avg)
  • Blended cost: £0.21/task (-36%)
  • Success rate: 89% (+2pp)

Why success improved: GPT-3.5 handles simple tasks faster with less overthinking. GPT-4 focused on genuinely complex cases.

When to self-host

For very high volume (>5M tasks/month), self-hosting open models (Llama 3, Mixtral) makes economic sense:

Break-even analysis:

ScenarioAPI-based (GPT-3.5)Self-hosted (Llama 3 70B)
Setup cost£0£15,000 (GPU servers)
Monthly cost @ 1M tasks£1,500£2,800 (infrastructure)
Monthly cost @ 5M tasks£7,500£3,200
Monthly cost @ 10M tasks£15,000£3,600

Break-even: ~4M tasks/month

Trade-offs:

  • ✅ Unlimited usage above break-even
  • ✅ Data stays in your infrastructure
  • ❌ Engineering overhead (deployment, monitoring)
  • ❌ Lower quality than GPT-4 (acceptable for many use cases)

Strategy 2: Prompt compression

Shorter prompts = lower input token costs.

Techniques

1. Remove redundancy

Before (182 tokens):

You are a helpful AI assistant designed to help users with customer support queries. Please analyse the following customer support ticket carefully and provide a detailed, helpful response that addresses all of the customer's concerns. Make sure your response is professional, empathetic, and actionable.

Customer query: [...]

After (89 tokens, -51%):

Analyse this support ticket and provide a professional, actionable response.

Query: [...]

Savings: £0.001 per call × 8 calls/task × 40K tasks/month = £320/month

2. Use structured formats

Before (verbose):

Please extract the following information from the document: the customer's name, their email address, their company name, their job title, and the date they signed up.

After (JSON schema):

Extract to JSON:
{"name": "", "email": "", "company": "", "title": "", "signup_date": ""}

Token reduction: 45% fewer tokens

3. Eliminate few-shot examples when possible

Few-shot examples (showing the model examples before the task) improve quality but cost tokens.

Test whether they're necessary:

def test_fewshot_necessity(task_sample: list, prompt_with_examples: str, prompt_without_examples: str):
    """A/B test few-shot vs zero-shot."""
    results_with = []
    results_without = []

    for task in task_sample:
        # With examples
        response_with = llm.complete(prompt_with_examples + task)
        results_with.append(evaluate_quality(response_with, task))

        # Without examples
        response_without = llm.complete(prompt_without_examples + task)
        results_without.append(evaluate_quality(response_without, task))

    print(f"With few-shot: {np.mean(results_with):.2%} quality")
    print(f"Without few-shot: {np.mean(results_without):.2%} quality")
    print(f"Token savings: {calculate_token_diff(prompt_with_examples, prompt_without_examples)}")

# Real result from our testing:
# With few-shot: 87% quality (avg 450 tokens/prompt)
# Without few-shot: 84% quality (avg 120 tokens/prompt)
# → 3pp quality loss for 73% token savings

Decision: We removed few-shot examples for simple tasks (classification, extraction), kept them for complex tasks (code generation, analysis).

Savings: 22% reduction in input tokens overall

Prompt caching

Some LLM providers (Anthropic Claude, OpenAI with prompt caching beta) allow caching prompt prefixes.

How it works:

# First call: full cost
response = client.messages.create(
    model="claude-3-sonnet",
    system="You are a customer support agent. Here's our knowledge base: [5,000 tokens of docs]",
    messages=[{"role": "user", "content": "How do I reset my password?"}]
)
# Cost: 5,100 input tokens

# Subsequent calls within 5 minutes: cached system prompt
response = client.messages.create(
    model="claude-3-sonnet",
    system="You are a customer support agent. Here's our knowledge base: [5,000 tokens of docs]",  # CACHED
    messages=[{"role": "user", "content": "How do I change my email?"}]
)
# Cost: 100 input tokens (only the new message)

Savings: 90%+ on input tokens for repeated prompts

Limitations:

  • Cache expires after 5 minutes (Anthropic) or 1 hour (OpenAI)
  • Only works if system prompt is identical across calls
  • Cache misses still cost full tokens

Use cases:

  • Chatbots (same knowledge base for all queries)
  • Document processing (same instructions, different docs)
  • Multi-turn conversations

Strategy 3: Intelligent caching

Cache LLM responses to avoid redundant calls.

Response caching for repeated queries

import hashlib
from functools import lru_cache

class LLMCache:
    """Cache LLM responses."""

    def __init__(self, ttl: int = 3600):
        self.cache = {}
        self.ttl = ttl

    def get(self, prompt: str, model: str) -> str|None:
        """Get cached response if exists."""
        cache_key = self._hash(prompt, model)
        entry = self.cache.get(cache_key)

        if entry and time.time() - entry["timestamp"] < self.ttl:
            return entry["response"]

        return None

    def set(self, prompt: str, model: str, response: str):
        """Cache response."""
        cache_key = self._hash(prompt, model)
        self.cache[cache_key] = {
            "response": response,
            "timestamp": time.time()
        }

    def _hash(self, prompt: str, model: str) -> str:
        """Generate cache key."""
        return hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()

# Usage
cache = LLMCache(ttl=3600)  # 1-hour TTL

def cached_llm_call(prompt: str, model: str):
    """Call LLM with caching."""
    # Check cache
    cached_response = cache.get(prompt, model)
    if cached_response:
        return cached_response

    # Cache miss, call LLM
    response = llm.complete(prompt, model=model)

    # Store in cache
    cache.set(prompt, model, response)

    return response

Results:

  • Cache hit rate: 34% (varies by use case)
  • Cost savings: 34% × £0.33 = £0.11/task savings

Cache hit rate by use case:

Use caseHit rateWhy
FAQ chatbot60-70%Repeated questions
Document summarization15-25%Unique documents
Code review30-40%Common patterns
Customer support45-55%Similar queries

Semantic caching

Standard caching requires exact prompt match. Semantic caching matches similar prompts:

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    """Cache based on semantic similarity."""

    def __init__(self, similarity_threshold: float = 0.95):
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = []  # List of (embedding, response) tuples
        self.similarity_threshold = similarity_threshold

    def get(self, prompt: str) -> str|None:
        """Find semantically similar cached response."""
        if not self.cache:
            return None

        # Embed query
        query_embedding = self.embedder.encode(prompt)

        # Find most similar cached prompt
        for cached_embedding, cached_response in self.cache:
            similarity = np.dot(query_embedding, cached_embedding)

            if similarity > self.similarity_threshold:
                return cached_response

        return None

    def set(self, prompt: str, response: str):
        """Cache response with prompt embedding."""
        embedding = self.embedder.encode(prompt)
        self.cache.append((embedding, response))

        # Limit cache size
        if len(self.cache) > 1000:
            self.cache.pop(0)  # Remove oldest

# Example
cache = SemanticCache()

# First query
response_1 = llm.complete("How do I reset my password?")
cache.set("How do I reset my password?", response_1)

# Similar query (different wording) → cache hit!
response_2 = cache.get("What's the process for resetting my password?")
# Returns cached response_1 (95%+ similarity)

Trade-off: Embedding cost (£0.00002/query) vs. LLM call savings (£0.05/query) → 2,500× ROI

Strategy 4: Batching and parallelization

Process multiple items in one LLM call instead of many sequential calls.

Batch processing

Before (sequential, £0.40):

for email in emails:
    classification = llm.classify_email(email)
    # 10 emails × £0.04/call = £0.40

After (batched, £0.08):

batch_prompt = f"""
Classify these 10 emails as spam/not spam:

{format_emails(emails)}

Return JSON array: [{{"email_id": 1, "classification": "spam"}}, ...]
"""
classifications = llm.complete(batch_prompt)
# 1 call × £0.08 = £0.08 (-80% cost)

Limitations:

  • Batch size limited by context window (can't fit 1,000 emails)
  • Quality may degrade for very large batches (model loses focus)
  • Single failure affects entire batch

Optimal batch size: Test 5, 10, 25, 50 items. We found 20-25 items balances cost and quality.

Parallel tool calls

Many agents make sequential tool calls. Enable parallelization:

Before (sequential, 3.2s latency):

result_1 = fetch_data_from_api_1()  # 800ms
result_2 = fetch_data_from_api_2()  # 1,200ms
result_3 = fetch_data_from_api_3()  # 1,200ms
# Total: 3,200ms

After (parallel, 1.2s latency):

import asyncio

results = await asyncio.gather(
    fetch_data_from_api_1(),
    fetch_data_from_api_2(),
    fetch_data_from_api_3()
)
# Total: 1,200ms (longest call)

Cost impact: Indirect -faster execution = better user experience = higher agent adoption = more value from agent investment.

Strategy 5: Output length control

LLMs often over-generate. Constrain output to save tokens.

Techniques

1. Explicit length limits

prompt = f"""
Summarise this article in EXACTLY 3 sentences. No more, no less.

Article: {article_text}
"""

2. Token limits (max_tokens parameter)

response = client.completions.create(
    model="gpt-4-turbo",
    prompt=prompt,
    max_tokens=100  # Hard cap at 100 output tokens
)

3. Structured outputs (JSON)

Before (free-form, 400 tokens average):

"The customer seems frustrated about the delayed shipment. They ordered on Jan 15th and expected delivery by Jan 20th but haven't received it yet..."

After (JSON, 80 tokens):

{
  "sentiment": "frustrated",
  "issue": "delayed_shipment",
  "order_date": "2024-01-15",
  "expected_delivery": "2024-01-20",
  "status": "not_received"
}

Savings: 80% fewer output tokens

Strategy 6: Streaming for user experience

Streaming doesn't reduce costs but improves perceived performance:

def stream_response(prompt: str):
    """Stream LLM response token-by-token."""
    for chunk in client.completions.create(
        model="gpt-4-turbo",
        prompt=prompt,
        stream=True
    ):
        yield chunk.choices[0].text

# Display to user immediately
for token in stream_response(user_query):
    print(token, end="", flush=True)

Benefit: User sees response start in 200ms instead of waiting 3s for full completion.

Cost: Identical to non-streaming

Strategy 7: Fine-tuning for efficiency

Fine-tuned models need shorter prompts to achieve same quality.

Example: Customer support classification

Base model (GPT-3.5, 450-token prompt with examples):

You are a customer support classifier. Examples:
[10 examples, 400 tokens]

Classify this ticket: [50 tokens]

Cost: £0.00045/call

Fine-tuned model (GPT-3.5 fine-tuned on 500 examples):

Classify: [50 tokens]

Cost: £0.00005/call (90% cheaper)

Fine-tuning costs:

  • Training: £50 one-time (500 examples)
  • Inference: 10% cheaper per call
  • Break-even: 50,000 calls (achievable in 2-4 weeks for most agents)

When to fine-tune:

  • High-volume tasks (>10K/month)
  • Repeated patterns (classification, extraction, formatting)
  • Quality ceiling reached with prompting

Strategy 8: Monitoring and alerting

Track costs in real-time to catch spikes:

class CostMonitor:
    """Track LLM costs per task."""

    def __init__(self):
        self.task_costs = []

    def record_task_cost(self, task_id: str, cost: float):
        """Log task cost."""
        self.task_costs.append({
            "task_id": task_id,
            "cost": cost,
            "timestamp": datetime.utcnow()
        })

        # Alert if anomaly
        recent_avg = np.mean([t["cost"] for t in self.task_costs[-100:]])

        if cost > recent_avg * 3:  # 3× average cost
            self.alert_anomaly(task_id, cost, recent_avg)

    def alert_anomaly(self, task_id: str, cost: float, avg: float):
        """Alert on cost spike."""
        send_slack_alert(f"⚠️ Cost anomaly: Task {task_id} cost £{cost:.4f} (avg: £{avg:.4f})")

# Daily summary
def daily_cost_report():
    """Generate cost summary."""
    today_tasks = [t for t in monitor.task_costs if is_today(t["timestamp"])]

    report = {
        "total_cost": sum(t["cost"] for t in today_tasks),
        "task_count": len(today_tasks),
        "avg_cost_per_task": np.mean([t["cost"] for t in today_tasks]),
        "max_cost": max([t["cost"] for t in today_tasks]),
        "p95_cost": np.percentile([t["cost"] for t in today_tasks], 95)
    }

    return report

Combined optimization results

Baseline (no optimizations):

  • Model: GPT-4 Turbo for all tasks
  • Prompts: Verbose with few-shot examples
  • No caching
  • Sequential processing
  • Cost: £0.47/task

Optimized (all strategies):

  • Smart model routing (GPT-3.5 → GPT-4 escalation)
  • Compressed prompts (-40% tokens)
  • Response caching (34% hit rate)
  • Batched processing where applicable
  • Structured outputs (JSON)
  • Cost: £0.18/task (-62%)

Cost breakdown:

ComponentBaselineOptimizedSavings
LLM API£0.33£0.12-64%
Tool APIs£0.08£0.04-50%
Infrastructure£0.03£0.01-67%
Monitoring£0.02£0.01-50%
Caching savings-£0.11-
Total£0.47£0.18-62%

Monthly savings @ 40K tasks:

  • Before: £18,800
  • After: £7,200
  • Saved: £11,600/month (£139K/year)

Implementation roadmap

Week 1: Baseline measurement

  • Instrument all LLM calls to log tokens and costs
  • Calculate current cost/task
  • Identify top 3 cost drivers

Week 2: Quick wins

  • Implement response caching
  • Compress prompts (remove redundancy)
  • Add max_tokens limits

Week 3: Model routing

  • Define task complexity tiers
  • Implement routing logic
  • A/B test quality vs baseline

Week 4: Advanced optimizations

  • Batch eligible tasks
  • Test fine-tuning for high-volume tasks
  • Set up cost monitoring and alerts

Month 2+:

  • Continuous optimization based on cost analytics
  • Explore self-hosting for very high volumes
  • Regular review of model pricing (providers update frequently)

Key takeaways

  • LLM costs dominate agent economics -optimizing inference costs is critical for scalability.

  • Smart model routing offers biggest single win -route simple tasks to cheap models, escalate complex tasks to expensive models.

  • Caching delivers immediate ROI -34% hit rate typical, costs £0.00002 to save £0.05 per cache hit.

  • Optimizations compound -combining strategies yields 60%+ savings, not additive but multiplicative.

  • Quality doesn't have to suffer -our task success rate improved +2pp whilst cutting costs 62%.


Agent economics improve dramatically with deliberate cost optimization. Start with model routing and caching for quick wins, then layer in prompt compression, batching, and fine-tuning as volume scales. The goal isn't minimum cost -it's maximum value per pound spent.

Frequently asked questions

Q: Will cheaper models hurt quality? A: For many tasks, no. GPT-3.5 handles classification, extraction, and simple Q&A at 85-90% of GPT-4 quality for 1/10th the cost. Test on your use case.

Q: How do I know if optimizations are working? A: Track cost/task weekly. If cost drops but task success rate stays flat or improves, you're winning.

Q: Should I optimize before launching or after? A: Get to product-market fit first. Optimize once you have consistent usage and understand cost drivers. Premature optimization wastes time.

Q: What's a good cost/task target? A: Depends on value delivered. If agent saves £2 in human time per task, £0.50/task is excellent ROI. If it saves £0.50, you need <£0.10/task.

Further reading:

External references: