AI Agent Cost Optimization: Cut Your LLM Bills by 60% Without Sacrificing Quality
Data-driven strategies to reduce AI agent costs by 40-70% -model tiering, prompt optimization, caching, token management, with real ROI calculations.
Data-driven strategies to reduce AI agent costs by 40-70% -model tiering, prompt optimization, caching, token management, with real ROI calculations.
TL;DR
Your OpenAI bill last month: £12,000.
This month you'll process 50% more queries. At current spend, that's £18K. Your CFO is asking questions.
Here's how to cut costs 40-70% without your agent getting dumber. Real tactics, real data, no "just use a cheaper model and hope for the best."
Typical AI agent cost breakdown (10K queries/month):
| Component | Cost/Query | Monthly Cost | % of Total |
|---|---|---|---|
| Input tokens (context + query) | £0.015 | £150 | 60% |
| Output tokens (response) | £0.008 | £80 | 32% |
| Embedding (for RAG) | £0.001 | £10 | 4% |
| Vector search | £0.001 | £10 | 4% |
| Total | £0.025 | £250 | 100% |
Key insight: 60% of cost is input tokens. Most optimization should focus here.
Don't use GPT-4 for everything. Most queries don't need GPT-4's reasoning power.
Strategy: Route queries by complexity.
def select_model(query, complexity_score):
"""
Tier 1 (Simple): Classification, lookup, FAQ → GPT-3.5 Turbo (£0.001/1K)
Tier 2 (Moderate): Analysis, summarization → Claude Sonnet (£0.003/1K)
Tier 3 (Complex): Deep reasoning, code gen → GPT-4 Turbo (£0.01/1K)
"""
if complexity_score < 0.3:
return "gpt-3.5-turbo" # 70% of queries
elif complexity_score < 0.7:
return "claude-3-5-sonnet" # 25% of queries
else:
return "gpt-4-turbo" # 5% of queries
def estimate_complexity(query):
"""Simple heuristic or use cheap classifier"""
# Method 1: Rule-based
if any(word in query.lower() for word in ["explain", "analyze", "compare"]):
return 0.8
elif any(word in query.lower() for word in ["summarize", "list", "find"]):
return 0.5
else:
return 0.2
# Method 2: Use GPT-3.5 as classifier (£0.001 vs £0.01)
classifier_prompt = f"Rate query complexity 0-1: {query}"
# ... call GPT-3.5, parse score
Real results (customer support agent, 10K queries/month):
| Metric | Before (All GPT-4) | After (Tiered) | Change |
|---|---|---|---|
| Cost/query | £0.025 | £0.011 | -56% |
| Monthly cost | £250 | £110 | -56% |
| Accuracy | 91% | 89% | -2% |
| User satisfaction | 4.2/5 | 4.1/5 | -2.4% |
ROI: 56% cost reduction for 2% quality drop = massive win.
Quote from Sarah Chen, Head of AI at FinTech Startup: "We were burning £8K/month on GPT-4. Model tiering dropped it to £3.2K with imperceptible quality difference. Customers didn't notice, CFO was thrilled."
Most prompts have bloat. Every unnecessary word costs money.
prompt = f"""
You are a helpful customer support assistant for our company.
Our company sells software products to businesses. We have a knowledge
base of support documentation that you should reference when answering
questions. Please provide accurate, helpful responses based on the
context provided below.
Context from knowledge base:
{retrieved_docs} # 5 docs × 800 tokens = 4,000 tokens
User question: {user_question} # 50 tokens
Please answer the question thoughtfully and comprehensively, making sure
to reference specific sections from the context where relevant.
"""
# Total: ~4,200 tokens
Cost: 4,200 tokens × £0.01/1K = £0.042 per query
prompt = f"""
Answer using context below. Cite sources.
Context:
{compressed_docs} # Top 3 docs × 400 tokens = 1,200 tokens
Q: {user_question} # 50 tokens
"""
# Total: ~1,300 tokens
Cost: 1,300 tokens × £0.01/1K = £0.013 per query
Savings: 69% reduction in input tokens = £0.029 saved per query
1. Remove fluff
2. Limit retrieved context
3. Compress retrieved docs
def compress_document(doc, max_tokens=400):
"""Extract only relevant sentences"""
sentences = doc.split('. ')
# Use cheap model to score relevance
scored = [(sent, relevance_score(sent, query)) for sent in sentences]
top_sentences = sorted(scored, key=lambda x: x[1], reverse=True)[:5]
return '. '.join([s[0] for s in top_sentences])
4. Use chain-of-thought only when needed
Don't add "Let's think step by step" to every prompt. Reserve for complex reasoning tasks.
Many queries repeat. "What's your return policy?" asked 50 times/day = 50× the same LLM call.
import hashlib
from functools import lru_cache
# In-memory cache (simple)
response_cache = {}
def get_cached_response(query, ttl=3600):
cache_key = hashlib.md5(query.lower().encode()).hexdigest()
if cache_key in response_cache:
cached = response_cache[cache_key]
if time.time() - cached['timestamp'] < ttl:
return cached['response'] # Cache hit - £0 cost
# Cache miss - call LLM
response = call_llm(query) # £0.025 cost
response_cache[cache_key] = {
'response': response,
'timestamp': time.time()
}
return response
# Redis cache (production)
import redis
r = redis.Redis()
def get_cached_response_redis(query, ttl=3600):
cache_key = f"llm:{hashlib.md5(query.lower().encode()).hexdigest()}"
cached = r.get(cache_key)
if cached:
return json.loads(cached)
response = call_llm(query)
r.setex(cache_key, ttl, json.dumps(response))
return response
Cache hit rate analysis (FAQ agent, 1,000 queries/day):
| Day | Queries | Cache Hits | Hit Rate | LLM Calls | Daily Savings |
|---|---|---|---|---|---|
| 1 | 1,000 | 0 | 0% | 1,000 | £0 |
| 2 | 1,000 | 420 | 42% | 580 | £10.50 |
| 7 | 1,000 | 680 | 68% | 320 | £17 |
| 30 | 1,000 | 720 | 72% | 280 | £18 |
Monthly savings: ~£450 out of £750 = 60% reduction
Exact match caching misses variations:
Solution: Semantic similarity caching
from sentence_transformers import SentenceTransformer
import faiss
embedder = SentenceTransformer('all-MiniLM-L6-v2')
cache_index = faiss.IndexFlatL2(384) # 384 = embedding dim
def semantic_cache_lookup(query, threshold=0.85):
query_embedding = embedder.encode([query])[0]
# Search for similar queries
distances, indices = cache_index.search(query_embedding.reshape(1, -1), k=1)
if distances[0][0] < (1 - threshold): # Cosine similarity > 0.85
return cached_responses[indices[0][0]]
return None # Cache miss
Result: Cache hit rate improves from 72% → 84% (+12%)
OpenAI Batch API: 50% discount for 24-hour turnaround.
When to use: Non-urgent tasks (reports, analysis, bulk processing)
from openai import OpenAI
client = OpenAI()
# Create batch file
requests = [
{"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4-turbo",
"messages": [{"role": "user", "content": queries[i]}]
}}
for i in range(1000)
]
# Submit batch
batch = client.batches.create(
input_file_id=upload_file(requests),
endpoint="/v1/chat/completions",
completion_window="24h"
)
# Retrieve results 24h later (50% cheaper)
Use cases:
Stop paying for tokens you don't use.
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[...],
# No max_tokens set - model decides
)
# Model returns 800-token response when 200 would suffice
Cost: 800 tokens × £0.03/1K = £0.024
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[...],
max_tokens=250 # Enforce limit
)
Cost: 250 tokens × £0.03/1K = £0.0075
Savings: 69% on output tokens = 22% overall savings
Set appropriate limits by use case:
| Use Case | max_tokens | Reasoning |
|---|---|---|
| Classification | 10 | Just need category label |
| FAQ answer | 150 | Concise answer |
| Summarization | 300 | Brief summary |
| Long-form content | 2,000 | Full article |
For interactive use, stream responses and let users stop early if satisfied.
def stream_with_early_stop(query, max_tokens=500):
stream = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": query}],
stream=True,
max_tokens=max_tokens
)
tokens_used = 0
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end='')
tokens_used += len(delta.split())
# User can press 'q' to stop early
if user_satisfied():
break
# Only pay for tokens generated before stop
return tokens_used
Savings: If users stop at 40% of response on average → 60% output token savings
Don't stuff context with irrelevant history.
# Keep entire conversation history
conversation_history = [] # Grows unbounded
conversation_history.append({"role": "user", "content": user_msg})
conversation_history.append({"role": "assistant", "content": ai_response})
# After 10 turns: 10K tokens of context (£0.10 per query!)
MAX_HISTORY = 3 # Last 3 turns only
def get_context(conversation_history):
return conversation_history[-MAX_HISTORY:] # 1.5K tokens (£0.015)
Savings: 85% reduction in context tokens
def manage_context(conversation_history):
if len(conversation_history) > 10:
# Summarize old context with cheap model
old_context = conversation_history[:-3]
summary = summarize_with_gpt35(old_context) # £0.005
return [
{"role": "system", "content": f"Conversation summary: {summary}"},
*conversation_history[-3:] # Recent context
]
return conversation_history
Company: B2B SaaS, 50K users Use case: Customer support agent (knowledge base Q&A, ticket creation) Before optimization: £11,200/month
| Tactic | Implementation | Monthly Savings |
|---|---|---|
| Model tiering | GPT-3.5 for 70% of queries | £4,800 |
| Prompt compression | Reduced avg prompt from 4.2K → 1.5K tokens | £1,200 |
| Caching | 68% cache hit rate | £1,100 |
| Output limiting | max_tokens=200 for most queries | £600 |
| Total Savings | £7,700 |
Results:
ROI: £7,700/month savings = £92,400/year
Time to implement: 2 weeks (1 engineer)
Start
↓
Are >50% queries simple? → YES → Implement model tiering (save 40-60%)
↓ NO
↓
Do queries repeat? → YES → Add caching (save 30-50%)
↓ NO
↓
Are prompts >2K tokens? → YES → Compress prompts (save 20-40%)
↓ NO
↓
Responses >500 tokens? → YES → Set max_tokens limits (save 10-20%)
↓ NO
↓
Any async workloads? → YES → Use batch API (save 50% on batched)
↓ NO
↓
Long conversations? → YES → Implement sliding window or summarization
↓
Monitor and iterate
Track these dashboards:
# Per-query cost tracking
def track_query_cost(query, model, input_tokens, output_tokens):
cost = calculate_cost(model, input_tokens, output_tokens)
metrics.log({
'timestamp': datetime.now(),
'model': model,
'input_tokens': input_tokens,
'output_tokens': output_tokens,
'cost': cost,
'query_type': classify_query(query)
})
# Daily cost rollup
SELECT
DATE(timestamp) as date,
SUM(cost) as daily_cost,
AVG(input_tokens) as avg_input,
AVG(output_tokens) as avg_output,
model
FROM query_costs
GROUP BY date, model
ORDER BY date DESC
Set alerts:
Will cheaper models hurt quality?
For most tasks, no. We tested GPT-3.5 vs GPT-4 on 1,000 customer support queries. GPT-3.5 accuracy: 87%. GPT-4: 91%. For 4% accuracy gain, you pay 10× more. Not worth it for tier-1 support.
Use GPT-4 where it matters: Complex reasoning, code generation, high-stakes decisions.
How aggressive should prompt compression be?
Test incrementally. Start by removing obvious fluff ("You are a helpful assistant..."). Then reduce retrieved docs (5 → 3). Monitor quality. If accuracy drops >5%, you've compressed too much.
Golden rule: Compress until quality drops 3-5%, then back off one step.
Is caching safe for dynamic data?
Set appropriate TTL (time-to-live):
Always include timestamp in cache key for time-sensitive queries.
What's the fastest win?
Model tiering. Takes 2-3 hours to implement, saves 40-60% immediately. Start there.
Bottom line: £12K/month → £4-5K/month is realistic with these tactics. Most teams over-optimize for quality and under-optimize for cost. A 2-3% quality drop for 60% cost savings is almost always the right trade-off.
Next: Read our Complete Guide to RAG to optimize retrieval costs specifically.