Cost Optimization Strategies for LLM-Based Agents: Cut Spend by 60%
Practical techniques to reduce agent operational costs without sacrificing quality -from model selection and prompt compression to caching and batching strategies.
Practical techniques to reduce agent operational costs without sacrificing quality -from model selection and prompt compression to caching and batching strategies.
TL;DR
Jump to Cost breakdown · Jump to Model selection · Jump to Prompt optimization · Jump to Caching · Jump to Batching
Our agent was working brilliantly. Task success rate: 87%. User satisfaction: 4.2/5. Monthly bill: £18,400.
At 40,000 tasks/month, we were paying £0.46 per task completion. Extrapolate that to 200,000 tasks (our 12-month growth target) and we'd hit £92,000/month -£1.1M annually. For context, our total engineering budget was £400K.
We had two choices: accept that agent economics don't work at scale, or find a way to cut costs radically without breaking quality.
We chose option two. Three months later, cost per task dropped to £0.18 (-61%) whilst task success improved to 89%. This guide shares exactly how.
"LLM costs are the new cloud compute bill -if you're not optimizing, you're leaving 50-70% savings on the table." – Simon Willison, AI researcher & creator of Datasette (blog post, 2024)
Before optimizing, understand where money goes.
| Cost component | % of total | Example (£0.46/task) |
|---|---|---|
| LLM API calls | 72% | £0.33 |
| Tool API calls (enrichment, search) | 18% | £0.08 |
| Infrastructure (hosting, DB) | 7% | £0.03 |
| Monitoring & logs | 3% | £0.02 |
LLM costs dominate. That's where optimization yields biggest returns.
LLM cost = (Input tokens × Input price) + (Output tokens × Output price)
Example: GPT-4 Turbo call
Multiply by average 8 LLM calls/task → £0.39/task just for LLM
Three levers to pull:
Not every task needs GPT-4 Opus. Route intelligently based on complexity.
| Tier | Models | Cost/1M tokens | Use for |
|---|---|---|---|
| Premium | GPT-4, Claude Opus | £30-60 | Complex reasoning, code generation, analysis |
| Standard | GPT-4 Turbo, Claude Sonnet | £10-15 | General tasks, multi-step workflows |
| Economy | GPT-3.5, Claude Haiku | £0.50-3 | Classification, summarization, simple Q&A |
| Budget | Mixtral, Llama 3 (self-hosted) | ~£0 | High-volume, latency-tolerant tasks |
Implementation:
def select_model(task_complexity: str, task_type: str) -> str:
"""Route to appropriate model based on task."""
# High-complexity tasks → premium models
if task_complexity == "high" or task_type in ["code_generation", "research_synthesis"]:
return "gpt-4-turbo"
# Medium complexity → standard models
elif task_complexity == "medium" or task_type in ["analysis", "planning"]:
return "gpt-3.5-turbo"
# Simple tasks → economy models
elif task_type in ["classification", "extraction", "summarization"]:
return "gpt-3.5-turbo" # or claude-haiku
# Default to standard
return "gpt-4-turbo"
# Enhanced with confidence-based routing
def select_model_adaptive(task: dict, previous_attempts: int = 0) -> str:
"""Start cheap, escalate if needed."""
# First attempt: try economy model
if previous_attempts == 0:
return "gpt-3.5-turbo"
# If failed or low confidence, escalate to standard
elif previous_attempts == 1:
return "gpt-4-turbo"
# Last resort: premium model
else:
return "gpt-4"
Before (all GPT-4 Turbo):
After (tiered routing):
Why success improved: GPT-3.5 handles simple tasks faster with less overthinking. GPT-4 focused on genuinely complex cases.
For very high volume (>5M tasks/month), self-hosting open models (Llama 3, Mixtral) makes economic sense:
Break-even analysis:
| Scenario | API-based (GPT-3.5) | Self-hosted (Llama 3 70B) |
|---|---|---|
| Setup cost | £0 | £15,000 (GPU servers) |
| Monthly cost @ 1M tasks | £1,500 | £2,800 (infrastructure) |
| Monthly cost @ 5M tasks | £7,500 | £3,200 |
| Monthly cost @ 10M tasks | £15,000 | £3,600 |
Break-even: ~4M tasks/month
Trade-offs:
Shorter prompts = lower input token costs.
1. Remove redundancy
Before (182 tokens):
You are a helpful AI assistant designed to help users with customer support queries. Please analyse the following customer support ticket carefully and provide a detailed, helpful response that addresses all of the customer's concerns. Make sure your response is professional, empathetic, and actionable.
Customer query: [...]
After (89 tokens, -51%):
Analyse this support ticket and provide a professional, actionable response.
Query: [...]
Savings: £0.001 per call × 8 calls/task × 40K tasks/month = £320/month
2. Use structured formats
Before (verbose):
Please extract the following information from the document: the customer's name, their email address, their company name, their job title, and the date they signed up.
After (JSON schema):
Extract to JSON:
{"name": "", "email": "", "company": "", "title": "", "signup_date": ""}
Token reduction: 45% fewer tokens
3. Eliminate few-shot examples when possible
Few-shot examples (showing the model examples before the task) improve quality but cost tokens.
Test whether they're necessary:
def test_fewshot_necessity(task_sample: list, prompt_with_examples: str, prompt_without_examples: str):
"""A/B test few-shot vs zero-shot."""
results_with = []
results_without = []
for task in task_sample:
# With examples
response_with = llm.complete(prompt_with_examples + task)
results_with.append(evaluate_quality(response_with, task))
# Without examples
response_without = llm.complete(prompt_without_examples + task)
results_without.append(evaluate_quality(response_without, task))
print(f"With few-shot: {np.mean(results_with):.2%} quality")
print(f"Without few-shot: {np.mean(results_without):.2%} quality")
print(f"Token savings: {calculate_token_diff(prompt_with_examples, prompt_without_examples)}")
# Real result from our testing:
# With few-shot: 87% quality (avg 450 tokens/prompt)
# Without few-shot: 84% quality (avg 120 tokens/prompt)
# → 3pp quality loss for 73% token savings
Decision: We removed few-shot examples for simple tasks (classification, extraction), kept them for complex tasks (code generation, analysis).
Savings: 22% reduction in input tokens overall
Some LLM providers (Anthropic Claude, OpenAI with prompt caching beta) allow caching prompt prefixes.
How it works:
# First call: full cost
response = client.messages.create(
model="claude-3-sonnet",
system="You are a customer support agent. Here's our knowledge base: [5,000 tokens of docs]",
messages=[{"role": "user", "content": "How do I reset my password?"}]
)
# Cost: 5,100 input tokens
# Subsequent calls within 5 minutes: cached system prompt
response = client.messages.create(
model="claude-3-sonnet",
system="You are a customer support agent. Here's our knowledge base: [5,000 tokens of docs]", # CACHED
messages=[{"role": "user", "content": "How do I change my email?"}]
)
# Cost: 100 input tokens (only the new message)
Savings: 90%+ on input tokens for repeated prompts
Limitations:
Use cases:
Cache LLM responses to avoid redundant calls.
import hashlib
from functools import lru_cache
class LLMCache:
"""Cache LLM responses."""
def __init__(self, ttl: int = 3600):
self.cache = {}
self.ttl = ttl
def get(self, prompt: str, model: str) -> str|None:
"""Get cached response if exists."""
cache_key = self._hash(prompt, model)
entry = self.cache.get(cache_key)
if entry and time.time() - entry["timestamp"] < self.ttl:
return entry["response"]
return None
def set(self, prompt: str, model: str, response: str):
"""Cache response."""
cache_key = self._hash(prompt, model)
self.cache[cache_key] = {
"response": response,
"timestamp": time.time()
}
def _hash(self, prompt: str, model: str) -> str:
"""Generate cache key."""
return hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
# Usage
cache = LLMCache(ttl=3600) # 1-hour TTL
def cached_llm_call(prompt: str, model: str):
"""Call LLM with caching."""
# Check cache
cached_response = cache.get(prompt, model)
if cached_response:
return cached_response
# Cache miss, call LLM
response = llm.complete(prompt, model=model)
# Store in cache
cache.set(prompt, model, response)
return response
Results:
Cache hit rate by use case:
| Use case | Hit rate | Why |
|---|---|---|
| FAQ chatbot | 60-70% | Repeated questions |
| Document summarization | 15-25% | Unique documents |
| Code review | 30-40% | Common patterns |
| Customer support | 45-55% | Similar queries |
Standard caching requires exact prompt match. Semantic caching matches similar prompts:
from sentence_transformers import SentenceTransformer
import numpy as np
class SemanticCache:
"""Cache based on semantic similarity."""
def __init__(self, similarity_threshold: float = 0.95):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.cache = [] # List of (embedding, response) tuples
self.similarity_threshold = similarity_threshold
def get(self, prompt: str) -> str|None:
"""Find semantically similar cached response."""
if not self.cache:
return None
# Embed query
query_embedding = self.embedder.encode(prompt)
# Find most similar cached prompt
for cached_embedding, cached_response in self.cache:
similarity = np.dot(query_embedding, cached_embedding)
if similarity > self.similarity_threshold:
return cached_response
return None
def set(self, prompt: str, response: str):
"""Cache response with prompt embedding."""
embedding = self.embedder.encode(prompt)
self.cache.append((embedding, response))
# Limit cache size
if len(self.cache) > 1000:
self.cache.pop(0) # Remove oldest
# Example
cache = SemanticCache()
# First query
response_1 = llm.complete("How do I reset my password?")
cache.set("How do I reset my password?", response_1)
# Similar query (different wording) → cache hit!
response_2 = cache.get("What's the process for resetting my password?")
# Returns cached response_1 (95%+ similarity)
Trade-off: Embedding cost (£0.00002/query) vs. LLM call savings (£0.05/query) → 2,500× ROI
Process multiple items in one LLM call instead of many sequential calls.
Before (sequential, £0.40):
for email in emails:
classification = llm.classify_email(email)
# 10 emails × £0.04/call = £0.40
After (batched, £0.08):
batch_prompt = f"""
Classify these 10 emails as spam/not spam:
{format_emails(emails)}
Return JSON array: [{{"email_id": 1, "classification": "spam"}}, ...]
"""
classifications = llm.complete(batch_prompt)
# 1 call × £0.08 = £0.08 (-80% cost)
Limitations:
Optimal batch size: Test 5, 10, 25, 50 items. We found 20-25 items balances cost and quality.
Many agents make sequential tool calls. Enable parallelization:
Before (sequential, 3.2s latency):
result_1 = fetch_data_from_api_1() # 800ms
result_2 = fetch_data_from_api_2() # 1,200ms
result_3 = fetch_data_from_api_3() # 1,200ms
# Total: 3,200ms
After (parallel, 1.2s latency):
import asyncio
results = await asyncio.gather(
fetch_data_from_api_1(),
fetch_data_from_api_2(),
fetch_data_from_api_3()
)
# Total: 1,200ms (longest call)
Cost impact: Indirect -faster execution = better user experience = higher agent adoption = more value from agent investment.
LLMs often over-generate. Constrain output to save tokens.
1. Explicit length limits
prompt = f"""
Summarise this article in EXACTLY 3 sentences. No more, no less.
Article: {article_text}
"""
2. Token limits (max_tokens parameter)
response = client.completions.create(
model="gpt-4-turbo",
prompt=prompt,
max_tokens=100 # Hard cap at 100 output tokens
)
3. Structured outputs (JSON)
Before (free-form, 400 tokens average):
"The customer seems frustrated about the delayed shipment. They ordered on Jan 15th and expected delivery by Jan 20th but haven't received it yet..."
After (JSON, 80 tokens):
{
"sentiment": "frustrated",
"issue": "delayed_shipment",
"order_date": "2024-01-15",
"expected_delivery": "2024-01-20",
"status": "not_received"
}
Savings: 80% fewer output tokens
Streaming doesn't reduce costs but improves perceived performance:
def stream_response(prompt: str):
"""Stream LLM response token-by-token."""
for chunk in client.completions.create(
model="gpt-4-turbo",
prompt=prompt,
stream=True
):
yield chunk.choices[0].text
# Display to user immediately
for token in stream_response(user_query):
print(token, end="", flush=True)
Benefit: User sees response start in 200ms instead of waiting 3s for full completion.
Cost: Identical to non-streaming
Fine-tuned models need shorter prompts to achieve same quality.
Example: Customer support classification
Base model (GPT-3.5, 450-token prompt with examples):
You are a customer support classifier. Examples:
[10 examples, 400 tokens]
Classify this ticket: [50 tokens]
Cost: £0.00045/call
Fine-tuned model (GPT-3.5 fine-tuned on 500 examples):
Classify: [50 tokens]
Cost: £0.00005/call (90% cheaper)
Fine-tuning costs:
When to fine-tune:
Track costs in real-time to catch spikes:
class CostMonitor:
"""Track LLM costs per task."""
def __init__(self):
self.task_costs = []
def record_task_cost(self, task_id: str, cost: float):
"""Log task cost."""
self.task_costs.append({
"task_id": task_id,
"cost": cost,
"timestamp": datetime.utcnow()
})
# Alert if anomaly
recent_avg = np.mean([t["cost"] for t in self.task_costs[-100:]])
if cost > recent_avg * 3: # 3× average cost
self.alert_anomaly(task_id, cost, recent_avg)
def alert_anomaly(self, task_id: str, cost: float, avg: float):
"""Alert on cost spike."""
send_slack_alert(f"⚠️ Cost anomaly: Task {task_id} cost £{cost:.4f} (avg: £{avg:.4f})")
# Daily summary
def daily_cost_report():
"""Generate cost summary."""
today_tasks = [t for t in monitor.task_costs if is_today(t["timestamp"])]
report = {
"total_cost": sum(t["cost"] for t in today_tasks),
"task_count": len(today_tasks),
"avg_cost_per_task": np.mean([t["cost"] for t in today_tasks]),
"max_cost": max([t["cost"] for t in today_tasks]),
"p95_cost": np.percentile([t["cost"] for t in today_tasks], 95)
}
return report
Baseline (no optimizations):
Optimized (all strategies):
Cost breakdown:
| Component | Baseline | Optimized | Savings |
|---|---|---|---|
| LLM API | £0.33 | £0.12 | -64% |
| Tool APIs | £0.08 | £0.04 | -50% |
| Infrastructure | £0.03 | £0.01 | -67% |
| Monitoring | £0.02 | £0.01 | -50% |
| Caching savings | - | £0.11 | - |
| Total | £0.47 | £0.18 | -62% |
Monthly savings @ 40K tasks:
Week 1: Baseline measurement
Week 2: Quick wins
Week 3: Model routing
Week 4: Advanced optimizations
Month 2+:
LLM costs dominate agent economics -optimizing inference costs is critical for scalability.
Smart model routing offers biggest single win -route simple tasks to cheap models, escalate complex tasks to expensive models.
Caching delivers immediate ROI -34% hit rate typical, costs £0.00002 to save £0.05 per cache hit.
Optimizations compound -combining strategies yields 60%+ savings, not additive but multiplicative.
Quality doesn't have to suffer -our task success rate improved +2pp whilst cutting costs 62%.
Agent economics improve dramatically with deliberate cost optimization. Start with model routing and caching for quick wins, then layer in prompt compression, batching, and fine-tuning as volume scales. The goal isn't minimum cost -it's maximum value per pound spent.
Q: Will cheaper models hurt quality? A: For many tasks, no. GPT-3.5 handles classification, extraction, and simple Q&A at 85-90% of GPT-4 quality for 1/10th the cost. Test on your use case.
Q: How do I know if optimizations are working? A: Track cost/task weekly. If cost drops but task success rate stays flat or improves, you're winning.
Q: Should I optimize before launching or after? A: Get to product-market fit first. Optimize once you have consistent usage and understand cost drivers. Premature optimization wastes time.
Q: What's a good cost/task target? A: Depends on value delivered. If agent saves £2 in human time per task, £0.50/task is excellent ROI. If it saves £0.50, you need <£0.10/task.
Further reading:
External references: