Evaluating AI Agent Performance: 12 Metrics That Actually Matter
Move beyond accuracy and latency -discover the operational metrics that predict agent success in production, from task completion rate to user satisfaction.
Move beyond accuracy and latency -discover the operational metrics that predict agent success in production, from task completion rate to user satisfaction.
TL;DR
Jump to Why traditional metrics fail · Jump to Task metrics · Jump to Quality metrics · Jump to Operational metrics · Jump to Business metrics
Your agent has 94% accuracy. Great! But does it actually work?
Last month, a team showed me their customer support agent. "94% accuracy on our test set," they said proudly. Then they showed production logs: 40% of user conversations ended in frustration, customers were requesting human transfer twice as often, and support tickets mentioning "AI unhelpful" tripled.
The agent was technically accurate -it correctly classified queries and retrieved relevant knowledge -but failed at its job: helping customers solve problems.
The gap: Traditional ML metrics measure model performance. Agent metrics must measure task completion, user satisfaction, and business outcomes.
This guide covers the 12 metrics that predict whether your agent succeeds in the real world, with benchmarks from 50+ production systems and red flags that indicate problems before users complain.
"We optimised for accuracy and got an agent nobody wanted to use. We optimised for task completion and got an agent that actually helped people." – Sarah Guo, Conviction VC (podcast, 2024)
Accuracy: % of predictions that match ground truth labels
Sounds reasonable. But consider this customer support query:
User: "My payment failed, can you help?"
Agent A response: "Payment processing errors can occur due to insufficient funds, expired cards, or bank restrictions." (Accurate information)
Agent B response: "I've checked your account. Your card expired last month. Would you like to update your payment method now?" (Actionable solution)
Both are "accurate" but only B actually helps the user. Accuracy doesn't measure helpfulness.
These metrics assume a fixed classification task. Agents perform open-ended tasks:
How do you calculate precision/recall for creative, multi-step workflows? You can't cleanly.
Agents in production face:
New metrics needed: Measure task-level success, operational efficiency, and business outcomes.
Definition: % of tasks completed successfully without human intervention
Calculation:
Task success rate = (Successfully completed tasks / Total tasks attempted) × 100
What counts as success: User's goal was achieved, verified by:
Benchmarks by use case:
| Use case | Target success rate | Production median |
|---|---|---|
| Customer support (tier-1) | 80-85% | 73% |
| Lead qualification | 85-90% | 82% |
| Document summarization | 90-95% | 88% |
| Code generation | 70-80% | 68% |
| Research synthesis | 75-85% | 71% |
Red flags:
How to measure:
def calculate_task_success_rate(tasks: list) -> dict:
"""Calculate task success rate from execution logs."""
total_tasks = len(tasks)
successful_tasks = 0
for task in tasks:
# Success criteria
if task.get("status") == "completed" and \
task.get("user_satisfied") == True and \
task.get("human_intervention") == False:
successful_tasks += 1
success_rate = (successful_tasks / total_tasks) * 100 if total_tasks > 0 else 0
return {
"success_rate": round(success_rate, 2),
"successful": successful_tasks,
"total": total_tasks,
"failed": total_tasks - successful_tasks
}
Definition: % of workflow handled autonomously (no human in the loop)
Calculation:
Autonomy = (Fully autonomous tasks / Total tasks) × 100
Autonomy tiers:
| Level | Description | Example |
|---|---|---|
| 0-20% | Human-driven, agent assists | Agent drafts email, human edits and sends |
| 20-50% | Collaborative | Agent qualifies lead, human approves score |
| 50-80% | Agent-driven, human approves | Agent books meeting, awaits confirmation |
| 80-100% | Fully autonomous | Agent handles end-to-end without human |
Target: Aim for 70-90% autonomy for production ROI. Below 50% autonomy means agents save minimal time.
Definition: % of potential use cases the agent can handle
Calculation:
Coverage = (Use cases agent handles / Total use cases in domain) × 100
Example: Customer support
Total use cases: 50 (password reset, billing questions, feature requests, bug reports, etc.)
Agent handles: 32 use cases
Coverage: 64%
Why it matters: Agents with low coverage (<50%) require frequent human hand-offs, frustrating users.
Improvement strategy:
def identify_coverage_gaps(support_tickets: list, agent_capabilities: set) -> dict:
"""Find which ticket types agent can't handle."""
ticket_types = {}
for ticket in support_tickets:
ticket_type = ticket["category"]
ticket_types[ticket_type] = ticket_types.get(ticket_type, 0) + 1
# Find uncovered categories
total_tickets = len(support_tickets)
uncovered = {}
for ticket_type, count in sorted(ticket_types.items(), key=lambda x: x[1], reverse=True):
if ticket_type not in agent_capabilities:
uncovered[ticket_type] = {
"count": count,
"percentage": round(count / total_tickets * 100, 2)
}
return uncovered
Definition: % of agent responses containing fabricated information
Calculation:
Hallucination rate = (Responses with hallucinations / Total responses) × 100
Detection methods:
1. Citation checking:
def detect_hallucination(response: str, source_documents: list) -> bool:
"""Check if response claims are supported by sources."""
claims = extract_claims(response)
for claim in claims:
if not any(claim_in_document(claim, doc) for doc in source_documents):
return True # Hallucination detected
return False
2. Consistency checking:
def check_consistency(agent_responses: list) -> float:
"""Check if agent gives consistent answers to similar questions."""
similarity_scores = []
for i, response_a in enumerate(agent_responses):
for response_b in agent_responses[i+1:]:
if questions_similar(response_a["question"], response_b["question"]):
similarity = answer_similarity(response_a["answer"], response_b["answer"])
similarity_scores.append(similarity)
# High inconsistency suggests hallucination
avg_consistency = sum(similarity_scores) / len(similarity_scores) if similarity_scores else 1.0
return 1.0 - avg_consistency # Inconsistency rate
Benchmarks:
Definition: % of tool/function calls that succeed and produce expected results
Calculation:
Tool accuracy = (Successful tool calls / Total tool calls) × 100
Tool call failures:
Example tracking:
def track_tool_performance(tool_calls: list) -> dict:
"""Analyse tool use effectiveness."""
tool_stats = {}
for call in tool_calls:
tool_name = call["tool"]
if tool_name not in tool_stats:
tool_stats[tool_name] = {
"total_calls": 0,
"successful": 0,
"failed": 0,
"avg_latency": 0,
"error_types": {}
}
tool_stats[tool_name]["total_calls"] += 1
if call["status"] == "success":
tool_stats[tool_name]["successful"] += 1
else:
tool_stats[tool_name]["failed"] += 1
error_type = call.get("error_type", "unknown")
tool_stats[tool_name]["error_types"][error_type] = \
tool_stats[tool_name]["error_types"].get(error_type, 0) + 1
tool_stats[tool_name]["avg_latency"] += call.get("latency_ms", 0)
# Calculate accuracy per tool
for tool_name, stats in tool_stats.items():
stats["accuracy"] = (stats["successful"] / stats["total_calls"] * 100) if stats["total_calls"] > 0 else 0
stats["avg_latency"] /= stats["total_calls"] if stats["total_calls"] > 0 else 1
return tool_stats
Target: >92% tool accuracy. Below 85% indicates poor tool selection or parameter handling.
Definition: How satisfied users are with agent interactions
Measurement methods:
1. Explicit feedback:
"Was this helpful?" → Yes (satisfied) / No (unsatisfied)
"Rate this response 1-5" → ≥4 = satisfied
2. Implicit signals:
3. Net Promoter Score: "How likely are you to use this agent again?" (0-10 scale)
NPS = % Promoters - % Detractors
Benchmarks:
Satisfaction by response type:
| Interaction type | Target satisfaction |
|---|---|
| Direct answer | 90%+ |
| Multi-step guidance | 80-85% |
| Escalated to human | 70-75% |
| Unable to help | 40-50% |
Definition: Average cost to complete one task
Calculation:
Cost per task = Total costs / Tasks completed
Total costs = (LLM API costs + Infrastructure + Tool API costs)
Cost breakdown example:
def calculate_task_cost(task_log: dict) -> dict:
"""Calculate cost components for a task."""
llm_cost = 0
tool_cost = 0
# LLM costs
for llm_call in task_log.get("llm_calls", []):
input_tokens = llm_call["input_tokens"]
output_tokens = llm_call["output_tokens"]
model = llm_call["model"]
# Price per million tokens
prices = {
"gpt-4-turbo": {"input": 10, "output": 30},
"gpt-3.5-turbo": {"input": 0.5, "output": 1.5},
"claude-3-sonnet": {"input": 3, "output": 15}
}
model_price = prices.get(model, prices["gpt-4-turbo"])
llm_cost += (input_tokens / 1_000_000 * model_price["input"]) + \
(output_tokens / 1_000_000 * model_price["output"])
# Tool API costs (example: Clearbit enrichment)
for tool_call in task_log.get("tool_calls", []):
if tool_call["tool"] == "clearbit_enrich":
tool_cost += 0.02 # $0.02 per enrichment
infrastructure_cost = 0.01 # Amortized server/database costs
total_cost = llm_cost + tool_cost + infrastructure_cost
return {
"total_cost": round(total_cost, 4),
"llm_cost": round(llm_cost, 4),
"tool_cost": round(tool_cost, 4),
"infrastructure_cost": infrastructure_cost
}
Benchmarks by use case:
| Use case | Target cost/task | Acceptable range |
|---|---|---|
| Customer support | $0.08-0.15 | Up to $0.30 |
| Lead qualification | $0.20-0.40 | Up to $0.80 |
| Content generation | $0.30-0.60 | Up to $1.20 |
| Research synthesis | $0.50-1.00 | Up to $2.50 |
| Code review | $0.15-0.35 | Up to $0.70 |
Red flags:
Definition: Time from task start to completion
Why percentiles matter:
Example distribution:
| Percentile | Latency | Interpretation |
|---|---|---|
| P50 | 2.3s | Half of tasks complete in <2.3s |
| P95 | 8.1s | 95% complete in <8.1s |
| P99 | 24.7s | 1% take >24.7s (investigation needed) |
Latency targets by use case:
| Use case | P50 target | P95 target |
|---|---|---|
| Chatbot | <2s | <5s |
| Search | <500ms | <2s |
| Document processing | <10s | <30s |
| Research | <30s | <90s |
Tracking code:
import time
import numpy as np
class LatencyTracker:
"""Track latency percentiles."""
def __init__(self):
self.latencies = []
def record(self, latency_ms: float):
"""Record task latency."""
self.latencies.append(latency_ms)
def get_percentiles(self) -> dict:
"""Calculate latency percentiles."""
if not self.latencies:
return {}
return {
"p50": np.percentile(self.latencies, 50),
"p95": np.percentile(self.latencies, 95),
"p99": np.percentile(self.latencies, 99),
"max": max(self.latencies),
"mean": np.mean(self.latencies)
}
# Usage
tracker = LatencyTracker()
for task in tasks:
start = time.time()
result = agent.run(task)
latency_ms = (time.time() - start) * 1000
tracker.record(latency_ms)
print(tracker.get_percentiles())
Definition: % of tasks requiring human intervention
Calculation:
Escalation rate = (Tasks escalated to humans / Total tasks) × 100
Types of escalations:
| Escalation type | Cause | Target % |
|---|---|---|
| Ambiguity | Agent can't determine intent | <2% |
| Low confidence | Agent unsure of answer | <3% |
| User request | User asks for human | <5% |
| Error/failure | Agent encounters technical issue | <1% |
| Policy | Task requires human judgment | <2% |
Overall target: <8% total escalation rate
When high escalation is acceptable:
Definition: Average number of attempts before task completion
Calculation:
Avg retries = Total retry attempts / Total tasks
Interpretation:
Retry reasons to track:
def analyze_retries(tasks: list) -> dict:
"""Understand why tasks require retries."""
retry_reasons = {}
for task in tasks:
if task.get("retry_count", 0) > 0:
reason = task.get("retry_reason", "unknown")
retry_reasons[reason] = retry_reasons.get(reason, 0) + 1
return retry_reasons
# Example output:
# {
# "tool_timeout": 12,
# "invalid_output_format": 8,
# "hallucination_detected": 5,
# "user_clarification_needed": 3
# }
Definition: Human hours saved by agent automation
Calculation:
Time saved = (Tasks completed × Avg manual time per task) - (Agent failures × Avg recovery time)
Example:
Time saved = (1,000 × 15 min) - (120 × 20 min)
= 15,000 min - 2,400 min
= 12,600 minutes (210 hours) saved
ROI calculation:
Value = Time saved × Hourly rate
= 210 hours × $50/hr
= $10,500/month
Cost = Agent API + infrastructure
= $1,200/month
ROI = (Value - Cost) / Cost × 100
= ($10,500 - $1,200) / $1,200 × 100
= 775% monthly ROI
Definition: Degradation in agent performance over time
Tracking:
def detect_drift(current_metrics: dict, baseline_metrics: dict) -> dict:
"""Compare current performance to baseline."""
drift_detected = {}
for metric, current_value in current_metrics.items():
baseline_value = baseline_metrics.get(metric)
if baseline_value:
change_pct = ((current_value - baseline_value) / baseline_value) * 100
# Thresholds for drift
if abs(change_pct) > 10:
drift_detected[metric] = {
"baseline": baseline_value,
"current": current_value,
"change_pct": round(change_pct, 2),
"severity": "high" if abs(change_pct) > 20 else "medium"
}
return drift_detected
# Example usage
baseline = {
"task_success_rate": 85.2,
"user_satisfaction": 4.3,
"cost_per_task": 0.28
}
current = {
"task_success_rate": 76.8, # -9.9% (borderline drift)
"user_satisfaction": 3.9, # -9.3%
"cost_per_task": 0.41 # +46% (HIGH drift)
}
drift = detect_drift(current, baseline)
# Output: {"cost_per_task": {"baseline": 0.28, "current": 0.41, "change_pct": 46.43, "severity": "high"}}
Common drift causes:
Mitigation:
Track all metrics in a unified dashboard:
| Metric | Current | Target | Status |
|---|---|---|---|
| Task success rate | 82.4% | 85%+ | ⚠ Yellow |
| User satisfaction | 4.2/5 | 4.0+ | ✅ Green |
| Autonomy | 78% | 70%+ | ✅ Green |
| Coverage | 68% | 75%+ | ⚠ Yellow |
| Hallucination rate | 1.8% | <2% | ✅ Green |
| Tool accuracy | 89.3% | 92%+ | ⚠ Yellow |
| Cost/task | $0.34 | <$0.50 | ✅ Green |
| P95 latency | 6.2s | <8s | ✅ Green |
| Escalation rate | 6.1% | <8% | ✅ Green |
| Avg retries | 0.21 | <0.3 | ✅ Green |
| Time saved | 187 hrs/mo | - | ✅ Green |
| Drift severity | Low | Low | ✅ Green |
Overall health: 8/12 green, 4/12 yellow, 0/12 red → Deploy-ready
Week 1: Baseline measurement
Week 2: Instrumentation
Week 3: Optimization
Week 4: Production rollout
Agent evaluation isn't about achieving 100% on any single metric -it's about balanced performance across task completion, quality, efficiency, and business impact. Track the 12 metrics that predict real-world success, set realistic targets for your use case, and improve systematically based on data.
Q: How many metrics should I track simultaneously? A: Start with 5 core metrics: task success rate, user satisfaction, cost/task, latency, and escalation rate. Add others as your system matures.
Q: What's the minimum sample size for reliable metrics? A: 100+ tasks for directional insights, 500+ for statistical significance, 1,000+ for confident optimization decisions.
Q: Should I use the same metrics for all agents? A: Core metrics (task success, cost) apply universally. Customize others -chatbots prioritize latency, research agents prioritize quality.
Q: How often should I review metrics? A: Daily for first 2 weeks post-launch, weekly for months 1-3, monthly thereafter (unless drift alerts fire).
Further reading:
External references: