Complete Guide to Agent Evaluation: Metrics, Benchmarks, and Testing Strategies
How to evaluate AI agent performance -success metrics, benchmark datasets, A/B testing strategies, and production monitoring for reliable agent deployments.
How to evaluate AI agent performance -success metrics, benchmark datasets, A/B testing strategies, and production monitoring for reliable agent deployments.
TL;DR
Common scenario:
Engineer: "I built an agent!"
Manager: "Does it work?"
Engineer: "...it seems to work?"
Manager: "How well?"
Engineer: "...I tested it on 3 examples?"
Problem: No systematic evaluation = no confidence in deployment.
Solution: Rigorous evaluation framework.
Definition: Percentage of tasks completed correctly.
How to measure:
def evaluate_task_success(agent_output, expected_output, task_type):
"""
Determine if agent successfully completed task.
"""
if task_type == "data_extraction":
# Check if extracted all required fields
return all(field in agent_output for field in expected_output.keys())
elif task_type == "classification":
# Check if classification matches
return agent_output["category"] == expected_output["category"]
elif task_type == "generation":
# Use LLM-as-judge to evaluate quality
judge_prompt = f"""
Task: {expected_output['task_description']}
Agent output: {agent_output}
Expected criteria: {expected_output['criteria']}
Does the output meet all criteria? (yes/no)
"""
judgment = call_llm(judge_prompt, model="gpt-4-turbo")
return "yes" in judgment.lower()
return False
# Evaluate on test set
test_cases = load_evaluation_dataset()
successes = 0
for test in test_cases:
agent_output = agent.execute(test['input'])
if evaluate_task_success(agent_output, test['expected_output'], test['task_type']):
successes += 1
success_rate = successes / len(test_cases)
print(f"Success rate: {success_rate:.1%}")
| Metric | What It Measures | Target | How to Calculate |
|---|---|---|---|
| Accuracy | Correctness of outputs | >95% | Correct outputs / Total outputs |
| Latency | Response time | <5s (p95) | Time from input to final output |
| Cost | LLM API costs | <$0.10/task | Sum of all API calls per task |
| User satisfaction | End-user happiness | >4/5 | Survey ratings or thumbs up/down |
| Error rate | Unhandled exceptions | <2% | Errors / Total requests |
Example Metrics Dashboard:
class AgentMetrics:
def __init__(self):
self.total_tasks = 0
self.successful_tasks = 0
self.total_latency = 0
self.total_cost = 0
self.errors = 0
def record_task(self, success, latency_ms, cost_usd, error=None):
self.total_tasks += 1
if success:
self.successful_tasks += 1
self.total_latency += latency_ms
self.total_cost += cost_usd
if error:
self.errors += 1
def get_summary(self):
return {
"success_rate": self.successful_tasks / self.total_tasks,
"avg_latency_ms": self.total_latency / self.total_tasks,
"avg_cost_per_task": self.total_cost / self.total_tasks,
"error_rate": self.errors / self.total_tasks
}
Size: 50-200 examples minimum (more is better).
Coverage: Representative of real-world distribution.
def create_evaluation_dataset(production_logs, sample_size=200):
"""
Sample diverse, representative test cases from production.
"""
dataset = []
# Stratified sampling by task type
task_types = ["simple", "medium_complexity", "complex"]
samples_per_type = sample_size // len(task_types)
for task_type in task_types:
# Get examples of this type
examples = [
log for log in production_logs
if log['complexity'] == task_type
]
# Random sample
sampled = random.sample(examples, samples_per_type)
for example in sampled:
dataset.append({
"input": example['user_input'],
"expected_output": example['correct_output'],
"task_type": task_type,
"difficulty": example.get('difficulty', 'medium')
})
# Add edge cases manually
dataset.extend(load_edge_cases())
return dataset
Include:
[
{
"id": "test_001",
"input": {
"task": "Extract invoice data",
"document": "invoice_sample_1.pdf"
},
"expected_output": {
"invoice_number": "INV-12345",
"date": "2024-06-15",
"total": 1250.00,
"vendor": "Acme Corp"
},
"task_type": "data_extraction",
"difficulty": "easy"
},
{
"id": "test_002",
"input": {
"task": "Classify customer support ticket",
"text": "My payment failed but I was still charged."
},
"expected_output": {
"category": "billing_issue",
"priority": "high",
"department": "finance"
},
"task_type": "classification",
"difficulty": "medium"
}
]
def run_benchmark(agent, evaluation_dataset):
"""
Evaluate agent on full dataset and return metrics.
"""
results = []
for test_case in evaluation_dataset:
start_time = time.time()
try:
# Run agent
output = agent.execute(test_case['input'])
# Evaluate success
success = evaluate_task_success(
output,
test_case['expected_output'],
test_case['task_type']
)
latency = (time.time() - start_time) * 1000 # ms
results.append({
"test_id": test_case['id'],
"success": success,
"latency_ms": latency,
"cost_usd": calculate_cost(output),
"output": output
})
except Exception as e:
results.append({
"test_id": test_case['id'],
"success": False,
"error": str(e)
})
# Calculate aggregate metrics
total = len(results)
successful = sum(1 for r in results if r['success'])
avg_latency = sum(r.get('latency_ms', 0) for r in results) / total
total_cost = sum(r.get('cost_usd', 0) for r in results)
return {
"success_rate": successful / total,
"avg_latency_ms": avg_latency,
"total_cost_usd": total_cost,
"avg_cost_per_task": total_cost / total,
"detailed_results": results
}
# Run benchmark
benchmark_results = run_benchmark(my_agent, eval_dataset)
print(f"Success rate: {benchmark_results['success_rate']:.1%}")
print(f"Avg latency: {benchmark_results['avg_latency_ms']:.0f}ms")
print(f"Avg cost: ${benchmark_results['avg_cost_per_task']:.4f}/task")
Baseline 1: Direct LLM call (no agent framework)
baseline_gpt4 = SimpleAgent(model="gpt-4-turbo", system_prompt="You are a helpful assistant.")
baseline_results = run_benchmark(baseline_gpt4, eval_dataset)
print(f"Your agent: {benchmark_results['success_rate']:.1%}")
print(f"GPT-4 baseline: {baseline_results['success_rate']:.1%}")
Baseline 2: Previous version of your agent
previous_version_results = load_benchmark("agent_v1.2_results.json")
current_version_results = run_benchmark(agent_v1_3, eval_dataset)
improvement = current_version_results['success_rate'] - previous_version_results['success_rate']
print(f"Improvement: {improvement:+.1%}")
| Model | Success Rate | Avg Latency | Cost/Task | Best For |
|---|---|---|---|---|
| GPT-4 Turbo | 89% | 3.2s | $0.042 | Complex reasoning |
| Claude 3.5 Sonnet | 91% | 2.8s | $0.038 | Balanced quality/speed |
| GPT-3.5 Turbo | 78% | 1.1s | $0.008 | Simple tasks |
| Claude 3 Haiku | 81% | 0.9s | $0.005 | High-volume, simple |
(Benchmarked on mixed task dataset, June 2024)
For open-ended tasks (content generation, summarization), use another LLM to evaluate quality.
def llm_as_judge(task, agent_output, criteria):
"""
Use GPT-4 to evaluate agent output quality.
"""
judge_prompt = f"""
You are evaluating an AI agent's performance.
Task: {task}
Agent output:
{agent_output}
Evaluation criteria:
{criteria}
Rate the output on each criterion (1-5 scale):
- Accuracy: Is the information correct?
- Completeness: Does it address all parts of the task?
- Clarity: Is it easy to understand?
- Relevance: Is it on-topic?
Respond in JSON format:
{{
"accuracy": <1-5>,
"completeness": <1-5>,
"clarity": <1-5>,
"relevance": <1-5>,
"overall_score": <average>,
"reasoning": "<brief explanation>"
}}
"""
judgment = call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
return json.loads(judgment)
# Evaluate agent output
judgment = llm_as_judge(
task="Summarize this 10-page document",
agent_output=agent_summary,
criteria="Summary should be 3-5 sentences, capture key points, and be accurate."
)
print(f"Overall score: {judgment['overall_score']}/5")
print(f"Reasoning: {judgment['reasoning']}")
Reliability: LLM-as-judge agrees with human evaluators 85-90% of the time (research).
Goal: Compare two agent versions with real users.
Setup:
import random
class ABTestRouter:
def __init__(self, version_a_agent, version_b_agent, b_traffic_percent=5):
self.version_a = version_a_agent
self.version_b = version_b_agent
self.b_traffic_percent = b_traffic_percent
self.metrics_a = AgentMetrics()
self.metrics_b = AgentMetrics()
async def route_request(self, user_input):
# Randomly assign to A or B
use_version_b = random.random() < (self.b_traffic_percent / 100)
if use_version_b:
agent = self.version_b
metrics = self.metrics_b
version = "B"
else:
agent = self.version_a
metrics = self.metrics_a
version = "A"
# Execute and track
start_time = time.time()
try:
result = await agent.execute(user_input)
latency = (time.time() - start_time) * 1000
cost = calculate_cost(result)
metrics.record_task(
success=True,
latency_ms=latency,
cost_usd=cost
)
# Log for analysis
log_ab_test_result(version, user_input, result, latency, cost)
return result
except Exception as e:
metrics.record_task(
success=False,
latency_ms=0,
cost_usd=0,
error=str(e)
)
raise
def get_comparison(self):
"""Compare A vs B performance"""
a_stats = self.metrics_a.get_summary()
b_stats = self.metrics_b.get_summary()
return {
"version_a": a_stats,
"version_b": b_stats,
"improvement": {
"success_rate": b_stats['success_rate'] - a_stats['success_rate'],
"latency": b_stats['avg_latency_ms'] - a_stats['avg_latency_ms'],
"cost": b_stats['avg_cost_per_task'] - a_stats['avg_cost_per_task']
}
}
from scipy import stats
def is_statistically_significant(metrics_a, metrics_b, min_samples=100):
"""
Check if difference between A and B is statistically significant.
"""
if metrics_a.total_tasks < min_samples or metrics_b.total_tasks < min_samples:
return False, "Insufficient sample size"
# Two-proportion z-test
successes_a = metrics_a.successful_tasks
successes_b = metrics_b.successful_tasks
total_a = metrics_a.total_tasks
total_b = metrics_b.total_tasks
# Calculate p-value
stat, p_value = stats.proportions_ztest(
[successes_a, successes_b],
[total_a, total_b]
)
# Significant if p < 0.05
is_significant = p_value < 0.05
return is_significant, f"p-value: {p_value:.4f}"
Track metrics in real-time to catch regressions.
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
tasks_total = Counter('agent_tasks_total', 'Total tasks', ['agent_name', 'status'])
task_duration = Histogram('agent_task_duration_seconds', 'Task duration', ['agent_name'])
task_cost = Histogram('agent_task_cost_usd', 'Task cost', ['agent_name'])
success_rate = Gauge('agent_success_rate', 'Current success rate', ['agent_name'])
def track_agent_execution(agent_name, task_input):
start_time = time.time()
try:
result = agent.execute(task_input)
# Record success
tasks_total.labels(agent_name=agent_name, status='success').inc()
duration = time.time() - start_time
task_duration.labels(agent_name=agent_name).observe(duration)
cost = calculate_cost(result)
task_cost.labels(agent_name=agent_name).observe(cost)
# Update success rate (rolling window)
update_success_rate(agent_name, success=True)
return result
except Exception as e:
# Record failure
tasks_total.labels(agent_name=agent_name, status='failure').inc()
update_success_rate(agent_name, success=False)
raise
# Alert if success rate drops below 85%
alert: LowSuccessRate
expr: agent_success_rate < 0.85
for: 5m
annotations:
summary: "Agent success rate dropped to {{ $value }}%"
# Alert if latency spikes
alert: HighLatency
expr: histogram_quantile(0.95, agent_task_duration_seconds) > 10
for: 5m
annotations:
summary: "Agent p95 latency is {{ $value }}s"
Company: E-commerce customer support
Agent: Automated ticket routing and responses
Evaluation process:
1. Defined metrics:
2. Created dataset:
3. Benchmarked:
4. Improved with prompt engineering:
5. A/B tested:
6. Production monitoring:
Results: Agent handles 67% of tickets autonomously (up from 0%). Customer satisfaction: 4.2/5 for agent responses.
How often should I re-evaluate my agent?
Recommendation:
What if my agent has no "correct" output?
For open-ended tasks (creative writing, brainstorming), use:
How many test cases do I need?
Minimum: 50 cases Good: 200+ cases Ideal: 1,000+ cases
More test cases = higher confidence in metrics.
Should I use human or LLM evaluation?
| Method | Cost | Speed | Reliability | Best For |
|---|---|---|---|---|
| Human | High | Slow | Highest | Final validation, edge cases |
| LLM-as-judge | Low | Fast | 85-90% | Iteration, bulk evaluation |
| Automated metrics | Lowest | Fastest | Varies | Objective tasks (extraction, classification) |
Use LLM-as-judge for iteration, human evaluation for final validation.
Bottom line: Rigorous evaluation is essential for reliable AI agents. Define success metrics, create diverse test datasets (50-200 examples), benchmark against baselines, A/B test in production, and monitor continuously. Teams with systematic evaluation deploy 3× faster with 40% fewer production issues.
Next: Read our Agent Testing Strategies guide for comprehensive testing approaches.