Academy20 Jun 202411 min read

Complete Guide to Agent Evaluation: Metrics, Benchmarks, and Testing Strategies

How to evaluate AI agent performance -success metrics, benchmark datasets, A/B testing strategies, and production monitoring for reliable agent deployments.

MB
Max Beech
Head of Content

TL;DR

  • Problem: How do you know if your AI agent actually works well?
  • Solution: Define success metrics → Create evaluation dataset → Benchmark performance → A/B test changes → Monitor production.
  • Key metrics: Task success rate (most important), accuracy, latency, cost per task, user satisfaction.
  • Evaluation dataset: 50-200 representative examples with expected outputs.
  • Benchmark: GPT-4 baseline achieves 85-92% on most tasks, Claude 3.5 Sonnet 87-94%.
  • A/B testing: Run new agent version on 5-10% traffic, compare metrics to baseline.
  • Production monitoring: Track success rate, latency, cost in real-time with alerts.
  • Real data: Teams with systematic evaluation deploy agents 3× faster with 40% fewer issues.

Complete Guide to Agent Evaluation

Common scenario:

Engineer: "I built an agent!"
Manager: "Does it work?"
Engineer: "...it seems to work?"
Manager: "How well?"
Engineer: "...I tested it on 3 examples?"

Problem: No systematic evaluation = no confidence in deployment.

Solution: Rigorous evaluation framework.

Step 1: Define Success Metrics

Primary Metric: Task Success Rate

Definition: Percentage of tasks completed correctly.

How to measure:

def evaluate_task_success(agent_output, expected_output, task_type):
    """
    Determine if agent successfully completed task.
    """
    if task_type == "data_extraction":
        # Check if extracted all required fields
        return all(field in agent_output for field in expected_output.keys())
    
    elif task_type == "classification":
        # Check if classification matches
        return agent_output["category"] == expected_output["category"]
    
    elif task_type == "generation":
        # Use LLM-as-judge to evaluate quality
        judge_prompt = f"""
        Task: {expected_output['task_description']}
        Agent output: {agent_output}
        Expected criteria: {expected_output['criteria']}
        
        Does the output meet all criteria? (yes/no)
        """
        judgment = call_llm(judge_prompt, model="gpt-4-turbo")
        return "yes" in judgment.lower()
    
    return False

# Evaluate on test set
test_cases = load_evaluation_dataset()
successes = 0

for test in test_cases:
    agent_output = agent.execute(test['input'])
    if evaluate_task_success(agent_output, test['expected_output'], test['task_type']):
        successes += 1

success_rate = successes / len(test_cases)
print(f"Success rate: {success_rate:.1%}")

Secondary Metrics

MetricWhat It MeasuresTargetHow to Calculate
AccuracyCorrectness of outputs>95%Correct outputs / Total outputs
LatencyResponse time<5s (p95)Time from input to final output
CostLLM API costs<$0.10/taskSum of all API calls per task
User satisfactionEnd-user happiness>4/5Survey ratings or thumbs up/down
Error rateUnhandled exceptions<2%Errors / Total requests

Example Metrics Dashboard:

class AgentMetrics:
    def __init__(self):
        self.total_tasks = 0
        self.successful_tasks = 0
        self.total_latency = 0
        self.total_cost = 0
        self.errors = 0
    
    def record_task(self, success, latency_ms, cost_usd, error=None):
        self.total_tasks += 1
        if success:
            self.successful_tasks += 1
        self.total_latency += latency_ms
        self.total_cost += cost_usd
        if error:
            self.errors += 1
    
    def get_summary(self):
        return {
            "success_rate": self.successful_tasks / self.total_tasks,
            "avg_latency_ms": self.total_latency / self.total_tasks,
            "avg_cost_per_task": self.total_cost / self.total_tasks,
            "error_rate": self.errors / self.total_tasks
        }

Step 2: Create Evaluation Dataset

Size: 50-200 examples minimum (more is better).

Coverage: Representative of real-world distribution.

Sampling Strategy

def create_evaluation_dataset(production_logs, sample_size=200):
    """
    Sample diverse, representative test cases from production.
    """
    dataset = []
    
    # Stratified sampling by task type
    task_types = ["simple", "medium_complexity", "complex"]
    samples_per_type = sample_size // len(task_types)
    
    for task_type in task_types:
        # Get examples of this type
        examples = [
            log for log in production_logs 
            if log['complexity'] == task_type
        ]
        
        # Random sample
        sampled = random.sample(examples, samples_per_type)
        
        for example in sampled:
            dataset.append({
                "input": example['user_input'],
                "expected_output": example['correct_output'],
                "task_type": task_type,
                "difficulty": example.get('difficulty', 'medium')
            })
    
    # Add edge cases manually
    dataset.extend(load_edge_cases())
    
    return dataset

Include:

  • Common cases (70%): Typical inputs
  • Edge cases (20%): Unusual but valid inputs
  • Error cases (10%): Invalid inputs (should fail gracefully)

Example Dataset Structure

[
  {
    "id": "test_001",
    "input": {
      "task": "Extract invoice data",
      "document": "invoice_sample_1.pdf"
    },
    "expected_output": {
      "invoice_number": "INV-12345",
      "date": "2024-06-15",
      "total": 1250.00,
      "vendor": "Acme Corp"
    },
    "task_type": "data_extraction",
    "difficulty": "easy"
  },
  {
    "id": "test_002",
    "input": {
      "task": "Classify customer support ticket",
      "text": "My payment failed but I was still charged."
    },
    "expected_output": {
      "category": "billing_issue",
      "priority": "high",
      "department": "finance"
    },
    "task_type": "classification",
    "difficulty": "medium"
  }
]

Step 3: Benchmark Performance

Run Evaluation Suite

def run_benchmark(agent, evaluation_dataset):
    """
    Evaluate agent on full dataset and return metrics.
    """
    results = []
    
    for test_case in evaluation_dataset:
        start_time = time.time()
        
        try:
            # Run agent
            output = agent.execute(test_case['input'])
            
            # Evaluate success
            success = evaluate_task_success(
                output,
                test_case['expected_output'],
                test_case['task_type']
            )
            
            latency = (time.time() - start_time) * 1000  # ms
            
            results.append({
                "test_id": test_case['id'],
                "success": success,
                "latency_ms": latency,
                "cost_usd": calculate_cost(output),
                "output": output
            })
        
        except Exception as e:
            results.append({
                "test_id": test_case['id'],
                "success": False,
                "error": str(e)
            })
    
    # Calculate aggregate metrics
    total = len(results)
    successful = sum(1 for r in results if r['success'])
    avg_latency = sum(r.get('latency_ms', 0) for r in results) / total
    total_cost = sum(r.get('cost_usd', 0) for r in results)
    
    return {
        "success_rate": successful / total,
        "avg_latency_ms": avg_latency,
        "total_cost_usd": total_cost,
        "avg_cost_per_task": total_cost / total,
        "detailed_results": results
    }

# Run benchmark
benchmark_results = run_benchmark(my_agent, eval_dataset)
print(f"Success rate: {benchmark_results['success_rate']:.1%}")
print(f"Avg latency: {benchmark_results['avg_latency_ms']:.0f}ms")
print(f"Avg cost: ${benchmark_results['avg_cost_per_task']:.4f}/task")

Compare to Baselines

Baseline 1: Direct LLM call (no agent framework)

baseline_gpt4 = SimpleAgent(model="gpt-4-turbo", system_prompt="You are a helpful assistant.")
baseline_results = run_benchmark(baseline_gpt4, eval_dataset)

print(f"Your agent: {benchmark_results['success_rate']:.1%}")
print(f"GPT-4 baseline: {baseline_results['success_rate']:.1%}")

Baseline 2: Previous version of your agent

previous_version_results = load_benchmark("agent_v1.2_results.json")
current_version_results = run_benchmark(agent_v1_3, eval_dataset)

improvement = current_version_results['success_rate'] - previous_version_results['success_rate']
print(f"Improvement: {improvement:+.1%}")

Model Comparison Benchmarks

ModelSuccess RateAvg LatencyCost/TaskBest For
GPT-4 Turbo89%3.2s$0.042Complex reasoning
Claude 3.5 Sonnet91%2.8s$0.038Balanced quality/speed
GPT-3.5 Turbo78%1.1s$0.008Simple tasks
Claude 3 Haiku81%0.9s$0.005High-volume, simple

(Benchmarked on mixed task dataset, June 2024)

Step 4: LLM-as-Judge Evaluation

For open-ended tasks (content generation, summarization), use another LLM to evaluate quality.

def llm_as_judge(task, agent_output, criteria):
    """
    Use GPT-4 to evaluate agent output quality.
    """
    judge_prompt = f"""
    You are evaluating an AI agent's performance.
    
    Task: {task}
    
    Agent output:
    {agent_output}
    
    Evaluation criteria:
    {criteria}
    
    Rate the output on each criterion (1-5 scale):
    - Accuracy: Is the information correct?
    - Completeness: Does it address all parts of the task?
    - Clarity: Is it easy to understand?
    - Relevance: Is it on-topic?
    
    Respond in JSON format:
    {{
      "accuracy": <1-5>,
      "completeness": <1-5>,
      "clarity": <1-5>,
      "relevance": <1-5>,
      "overall_score": <average>,
      "reasoning": "<brief explanation>"
    }}
    """
    
    judgment = call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
    return json.loads(judgment)

# Evaluate agent output
judgment = llm_as_judge(
    task="Summarize this 10-page document",
    agent_output=agent_summary,
    criteria="Summary should be 3-5 sentences, capture key points, and be accurate."
)

print(f"Overall score: {judgment['overall_score']}/5")
print(f"Reasoning: {judgment['reasoning']}")

Reliability: LLM-as-judge agrees with human evaluators 85-90% of the time (research).

Step 5: A/B Testing in Production

Goal: Compare two agent versions with real users.

Setup:

  1. Deploy both versions
  2. Randomly route 5% traffic to Version B, 95% to Version A
  3. Track success metrics for both
  4. If B performs better, gradually increase to 100%

Implementation

import random

class ABTestRouter:
    def __init__(self, version_a_agent, version_b_agent, b_traffic_percent=5):
        self.version_a = version_a_agent
        self.version_b = version_b_agent
        self.b_traffic_percent = b_traffic_percent
        self.metrics_a = AgentMetrics()
        self.metrics_b = AgentMetrics()
    
    async def route_request(self, user_input):
        # Randomly assign to A or B
        use_version_b = random.random() < (self.b_traffic_percent / 100)
        
        if use_version_b:
            agent = self.version_b
            metrics = self.metrics_b
            version = "B"
        else:
            agent = self.version_a
            metrics = self.metrics_a
            version = "A"
        
        # Execute and track
        start_time = time.time()
        try:
            result = await agent.execute(user_input)
            latency = (time.time() - start_time) * 1000
            cost = calculate_cost(result)
            
            metrics.record_task(
                success=True,
                latency_ms=latency,
                cost_usd=cost
            )
            
            # Log for analysis
            log_ab_test_result(version, user_input, result, latency, cost)
            
            return result
        
        except Exception as e:
            metrics.record_task(
                success=False,
                latency_ms=0,
                cost_usd=0,
                error=str(e)
            )
            raise
    
    def get_comparison(self):
        """Compare A vs B performance"""
        a_stats = self.metrics_a.get_summary()
        b_stats = self.metrics_b.get_summary()
        
        return {
            "version_a": a_stats,
            "version_b": b_stats,
            "improvement": {
                "success_rate": b_stats['success_rate'] - a_stats['success_rate'],
                "latency": b_stats['avg_latency_ms'] - a_stats['avg_latency_ms'],
                "cost": b_stats['avg_cost_per_task'] - a_stats['avg_cost_per_task']
            }
        }

Statistical Significance

from scipy import stats

def is_statistically_significant(metrics_a, metrics_b, min_samples=100):
    """
    Check if difference between A and B is statistically significant.
    """
    if metrics_a.total_tasks < min_samples or metrics_b.total_tasks < min_samples:
        return False, "Insufficient sample size"
    
    # Two-proportion z-test
    successes_a = metrics_a.successful_tasks
    successes_b = metrics_b.successful_tasks
    total_a = metrics_a.total_tasks
    total_b = metrics_b.total_tasks
    
    # Calculate p-value
    stat, p_value = stats.proportions_ztest(
        [successes_a, successes_b],
        [total_a, total_b]
    )
    
    # Significant if p < 0.05
    is_significant = p_value < 0.05
    
    return is_significant, f"p-value: {p_value:.4f}"

Step 6: Production Monitoring

Track metrics in real-time to catch regressions.

Monitoring Setup

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
tasks_total = Counter('agent_tasks_total', 'Total tasks', ['agent_name', 'status'])
task_duration = Histogram('agent_task_duration_seconds', 'Task duration', ['agent_name'])
task_cost = Histogram('agent_task_cost_usd', 'Task cost', ['agent_name'])
success_rate = Gauge('agent_success_rate', 'Current success rate', ['agent_name'])

def track_agent_execution(agent_name, task_input):
    start_time = time.time()
    
    try:
        result = agent.execute(task_input)
        
        # Record success
        tasks_total.labels(agent_name=agent_name, status='success').inc()
        
        duration = time.time() - start_time
        task_duration.labels(agent_name=agent_name).observe(duration)
        
        cost = calculate_cost(result)
        task_cost.labels(agent_name=agent_name).observe(cost)
        
        # Update success rate (rolling window)
        update_success_rate(agent_name, success=True)
        
        return result
    
    except Exception as e:
        # Record failure
        tasks_total.labels(agent_name=agent_name, status='failure').inc()
        update_success_rate(agent_name, success=False)
        
        raise

Alerts

# Alert if success rate drops below 85%
alert: LowSuccessRate
expr: agent_success_rate < 0.85
for: 5m
annotations:
  summary: "Agent success rate dropped to {{ $value }}%"
  
# Alert if latency spikes
alert: HighLatency
expr: histogram_quantile(0.95, agent_task_duration_seconds) > 10
for: 5m
annotations:
  summary: "Agent p95 latency is {{ $value }}s"

Real-World Example

Company: E-commerce customer support

Agent: Automated ticket routing and responses

Evaluation process:

1. Defined metrics:

  • Primary: Correct routing accuracy
  • Secondary: Response quality (LLM-as-judge), response time

2. Created dataset:

  • 150 historical tickets (sampled across categories)
  • Expert-labeled correct routing + ideal responses

3. Benchmarked:

  • Version 1.0: 82% routing accuracy
  • GPT-4 baseline (no agent): 76% routing accuracy

4. Improved with prompt engineering:

  • Version 1.1: 89% routing accuracy

5. A/B tested:

  • Deployed v1.1 to 10% traffic
  • Monitored for 2 weeks
  • v1.1 outperformed v1.0 (89% vs 82%)
  • Rolled out to 100%

6. Production monitoring:

  • Daily success rate tracked
  • Alert if drops below 85%
  • Monthly re-evaluation on new test cases

Results: Agent handles 67% of tickets autonomously (up from 0%). Customer satisfaction: 4.2/5 for agent responses.

Frequently Asked Questions

How often should I re-evaluate my agent?

Recommendation:

  • After every major change (new model, prompt update)
  • Monthly on fixed test set (catch regressions)
  • Continuous monitoring in production

What if my agent has no "correct" output?

For open-ended tasks (creative writing, brainstorming), use:

  • LLM-as-judge with clear rubric
  • Human evaluation on sample (expensive but necessary)
  • User satisfaction ratings (thumbs up/down)

How many test cases do I need?

Minimum: 50 cases Good: 200+ cases Ideal: 1,000+ cases

More test cases = higher confidence in metrics.

Should I use human or LLM evaluation?

MethodCostSpeedReliabilityBest For
HumanHighSlowHighestFinal validation, edge cases
LLM-as-judgeLowFast85-90%Iteration, bulk evaluation
Automated metricsLowestFastestVariesObjective tasks (extraction, classification)

Use LLM-as-judge for iteration, human evaluation for final validation.


Bottom line: Rigorous evaluation is essential for reliable AI agents. Define success metrics, create diverse test datasets (50-200 examples), benchmark against baselines, A/B test in production, and monitor continuously. Teams with systematic evaluation deploy 3× faster with 40% fewer production issues.

Next: Read our Agent Testing Strategies guide for comprehensive testing approaches.