Benchmarking AI Agents: Measuring Speed, Quality, and Cost
Build comprehensive performance benchmarks for AI agents tracking latency, success rates, output quality, and cost efficiency with automated testing and regression detection.
Build comprehensive performance benchmarks for AI agents tracking latency, success rates, output quality, and cost efficiency with automated testing and regression detection.
TL;DR
Jump to Benchmark design · Jump to Metrics collection · Jump to Automated testing · Jump to Regression detection
Agent performance degrades silently: prompt changes reduce output quality, model updates increase latency, tool additions inflate costs. Without systematic benchmarking, you discover issues only after users complain. Performance benchmarks provide objective, reproducible measurements of agent behavior across deployments.
This guide covers building comprehensive benchmarks for AI agents, based on Athenic's test suite that runs 240+ test cases on every deployment, catching regressions before they reach production.
Key takeaways
- Measure four dimensions: latency, success rate, quality, and cost.
- Design test cases covering 70% common scenarios, 20% edge cases, 10% known failures.
- Run benchmarks in CI/CD pipeline -block deployments failing quality/latency thresholds.
- Track metrics over time to identify gradual degradation (model drift, cost creep).
| Metric | Definition | Target | Alert threshold |
|---|---|---|---|
| Latency (p50) | Median execution time | <5s for simple tasks | >10s |
| Latency (p95) | 95th percentile | <15s | >25s |
| Success rate | % of tasks completing without errors | >95% | <90% |
| Quality score | Human/automated evaluation of output correctness | >85% | <75% |
| Cost per task | Average $ spent (API calls, compute) | <$0.50 | >$0.75 |
| Tool call efficiency | Avg tool calls per task | <4 | >7 |
Create a representative suite covering normal and edge cases.
interface BenchmarkCase {
id: string;
category: 'common' | 'edge' | 'failure';
agent: string;
input: string;
expected_output?: string; // For quality scoring
max_latency_ms: number;
max_cost_usd: number;
}
const benchmarkSuite: BenchmarkCase[] = [
// Common cases (70%)
{
id: 'research-basic',
category: 'common',
agent: 'research',
input: 'Find 5 companies in fintech using Stripe',
expected_output: undefined, // Will validate count and industry
max_latency_ms: 8000,
max_cost_usd: 0.15,
},
{
id: 'developer-simple-function',
category: 'common',
agent: 'developer',
input: 'Write a TypeScript function that checks if a string is a valid email',
expected_output: undefined, // Will validate syntax and functionality
max_latency_ms: 5000,
max_cost_usd: 0.08,
},
// Edge cases (20%)
{
id: 'research-zero-results',
category: 'edge',
agent: 'research',
input: 'Find companies in a non-existent industry',
expected_output: 'No results found',
max_latency_ms: 6000,
max_cost_usd: 0.12,
},
{
id: 'developer-malformed-request',
category: 'edge',
agent: 'developer',
input: 'Write code for [nonsensical gibberish]',
expected_output: undefined, // Should ask for clarification
max_latency_ms: 4000,
max_cost_usd: 0.05,
},
// Known failure scenarios (10%)
{
id: 'research-rate-limit',
category: 'failure',
agent: 'research',
input: 'Trigger API rate limit by requesting 1000 searches',
expected_output: 'Rate limit exceeded',
max_latency_ms: 3000,
max_cost_usd: 0.10,
},
];
Coverage distribution:
Track end-to-end execution time and component breakdowns.
interface BenchmarkResult {
case_id: string;
run_id: string;
timestamp: Date;
success: boolean;
latency_ms: number;
latency_breakdown: {
initialization: number;
tool_calls: number;
llm_inference: number;
post_processing: number;
};
cost_usd: number;
quality_score?: number;
error_message?: string;
}
async function runBenchmark(testCase: BenchmarkCase): Promise<BenchmarkResult> {
const runId = uuidv4();
const startTime = Date.now();
const breakdown = { initialization: 0, tool_calls: 0, llm_inference: 0, post_processing: 0 };
try {
// Initialize agent
const initStart = Date.now();
const agent = await getAgent(testCase.agent);
breakdown.initialization = Date.now() - initStart;
// Execute with instrumentation
const result = await agent.run({
messages: [{ role: 'user', content: testCase.input }],
onToolCall: (tool, duration) => {
breakdown.tool_calls += duration;
},
onLLMCall: (duration) => {
breakdown.llm_inference += duration;
},
});
// Post-process
const postStart = Date.now();
const qualityScore = await evaluateQuality(testCase, result);
breakdown.post_processing = Date.now() - postStart;
const totalLatency = Date.now() - startTime;
const cost = calculateCost(result.usage);
return {
case_id: testCase.id,
run_id: runId,
timestamp: new Date(),
success: true,
latency_ms: totalLatency,
latency_breakdown: breakdown,
cost_usd: cost,
quality_score: qualityScore,
};
} catch (error) {
return {
case_id: testCase.id,
run_id: runId,
timestamp: new Date(),
success: false,
latency_ms: Date.now() - startTime,
latency_breakdown: breakdown,
cost_usd: 0,
error_message: error.message,
};
}
}
Automated quality scoring using LLM-as-judge or rule-based validators.
async function evaluateQuality(testCase: BenchmarkCase, result: AgentResult): Promise<number> {
// Rule-based validation for specific cases
if (testCase.id === 'research-basic') {
const companies = extractCompanies(result.output);
const correctIndustry = companies.every(c => c.industry === 'fintech');
const correctTech = companies.every(c => c.technologies.includes('Stripe'));
const correctCount = companies.length === 5;
return (correctIndustry ? 0.4 : 0) + (correctTech ? 0.4 : 0) + (correctCount ? 0.2 : 0);
}
// LLM-as-judge for general cases
const judgmentPrompt = `
Rate the quality of this agent output on a scale of 0-1.
Input: ${testCase.input}
Output: ${result.output}
Expected: ${testCase.expected_output || 'N/A'}
Criteria:
- Correctness: Does it answer the question accurately?
- Completeness: Is all required information included?
- Clarity: Is the response well-formatted and understandable?
Return only a number between 0 and 1.
`;
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: judgmentPrompt }],
});
return parseFloat(response.choices[0].message.content);
}
Break down costs by component (LLM calls, tool invocations, compute).
function calculateCost(usage: AgentUsage): number {
const costs = {
'gpt-4o': { input: 0.005 / 1000, output: 0.015 / 1000 },
'gpt-4o-mini': { input: 0.00015 / 1000, output: 0.0006 / 1000 },
'text-embedding-3-small': 0.00002 / 1000,
};
let total = 0;
for (const call of usage.llm_calls) {
const model = costs[call.model];
total += call.input_tokens * model.input + call.output_tokens * model.output;
}
for (const embedding of usage.embeddings) {
total += embedding.tokens * costs['text-embedding-3-small'];
}
// Add tool costs (API calls, compute time)
total += usage.tool_invocations * 0.001; // $0.001 per tool call avg
return total;
}
Run benchmarks on every pull request and deployment.
# .github/workflows/benchmark.yml
name: Agent Benchmarks
on: [pull_request, push]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '20'
- name: Install dependencies
run: npm install
- name: Run benchmark suite
run: npm run benchmark
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
SUPABASE_SERVICE_KEY: ${{ secrets.SUPABASE_SERVICE_KEY }}
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: benchmark-results.json
- name: Compare to baseline
run: npm run benchmark:compare
- name: Fail if regression detected
run: npm run benchmark:check-thresholds
Compare current run against baseline (main branch or previous release).
interface BenchmarkComparison {
metric: string;
baseline: number;
current: number;
change_percent: number;
regression: boolean;
}
async function compareToBenchmark(currentResults: BenchmarkResult[], baselineResults: BenchmarkResult[]) {
const comparisons: BenchmarkComparison[] = [];
// Latency comparison
const currentP95 = percentile(currentResults.map(r => r.latency_ms), 0.95);
const baselineP95 = percentile(baselineResults.map(r => r.latency_ms), 0.95);
const latencyChange = ((currentP95 - baselineP95) / baselineP95) * 100;
comparisons.push({
metric: 'latency_p95',
baseline: baselineP95,
current: currentP95,
change_percent: latencyChange,
regression: latencyChange > 15, // Alert if >15% slower
});
// Success rate comparison
const currentSuccess = currentResults.filter(r => r.success).length / currentResults.length;
const baselineSuccess = baselineResults.filter(r => r.success).length / baselineResults.length;
const successChange = ((currentSuccess - baselineSuccess) / baselineSuccess) * 100;
comparisons.push({
metric: 'success_rate',
baseline: baselineSuccess,
current: currentSuccess,
change_percent: successChange,
regression: successChange < -5, // Alert if >5% drop
});
// Quality comparison
const currentQuality = avg(currentResults.map(r => r.quality_score || 0));
const baselineQuality = avg(baselineResults.map(r => r.quality_score || 0));
const qualityChange = ((currentQuality - baselineQuality) / baselineQuality) * 100;
comparisons.push({
metric: 'quality_score',
baseline: baselineQuality,
current: currentQuality,
change_percent: qualityChange,
regression: qualityChange < -10, // Alert if >10% drop
});
return comparisons;
}
Block deployments if metrics exceed thresholds.
async function checkThresholds(results: BenchmarkResult[]): Promise<boolean> {
const failures = [];
const p95Latency = percentile(results.map(r => r.latency_ms), 0.95);
if (p95Latency > 25000) {
failures.push(`P95 latency (${p95Latency}ms) exceeds threshold (25000ms)`);
}
const successRate = results.filter(r => r.success).length / results.length;
if (successRate < 0.90) {
failures.push(`Success rate (${successRate * 100}%) below threshold (90%)`);
}
const avgQuality = avg(results.map(r => r.quality_score || 0));
if (avgQuality < 0.75) {
failures.push(`Quality score (${avgQuality}) below threshold (0.75)`);
}
const avgCost = avg(results.map(r => r.cost_usd));
if (avgCost > 0.75) {
failures.push(`Average cost ($${avgCost}) exceeds threshold ($0.75)`);
}
if (failures.length > 0) {
console.error('Benchmark failures:', failures);
return false;
}
return true;
}
Track metrics over time to detect gradual degradation.
CREATE TABLE benchmark_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
commit_sha TEXT NOT NULL,
branch TEXT NOT NULL,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
p50_latency_ms INT,
p95_latency_ms INT,
success_rate NUMERIC(5,4),
avg_quality_score NUMERIC(3,2),
avg_cost_usd NUMERIC(6,4),
total_cases INT,
passed_cases INT
);
-- Query for trend analysis
SELECT
DATE_TRUNC('day', timestamp) AS day,
AVG(p95_latency_ms) AS avg_p95_latency,
AVG(success_rate) AS avg_success_rate,
AVG(avg_quality_score) AS avg_quality,
AVG(avg_cost_usd) AS avg_cost
FROM benchmark_runs
WHERE branch = 'main'
AND timestamp > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;
Detect sudden spikes or drops using statistical anomaly detection.
async function detectAnomalies() {
const recent = await db.benchmarkRuns.findAll({
branch: 'main',
timestamp: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) },
order_by: 'timestamp DESC',
});
const latencies = recent.map(r => r.p95_latency_ms);
const mean = avg(latencies);
const stdDev = standardDeviation(latencies);
const latest = recent[0];
const zScore = (latest.p95_latency_ms - mean) / stdDev;
if (Math.abs(zScore) > 2) {
await sendAlert('benchmark_anomaly', {
metric: 'p95_latency',
value: latest.p95_latency_ms,
mean,
stdDev,
zScore,
message: `P95 latency is ${zScore.toFixed(2)} standard deviations from 30-day mean`,
});
}
}
Our benchmark suite runs 240 test cases across 6 agent types on every deployment.
Benchmark execution:
Historical impact:
Example regression caught:
Date: July 15, 2025 Change: Updated research agent prompt to improve output formatting Impact:
Without benchmarks, this would have reached production and affected 400+ daily users.
Call-to-action (Activation stage) Download our agent benchmark starter kit with test cases, CI/CD configs, and trend dashboards.
Start with 20-30 covering your most common scenarios. Add edge cases as you discover them in production. Full coverage requires 100-200+ cases for complex agents.
Both. Staging benchmarks run on every deployment. Production benchmarks run nightly to catch environment-specific issues (API changes, data drift).
Run each test case 3-5 times and aggregate results (median latency, average quality score). Track variance -high variance indicates unstable behavior.
2-3× your p95 latency target. If target is 10s, timeout at 25s. This catches failures without waiting indefinitely.
Mark flaky tests and track flake rate. If >10% flaky, fix the test or the agent. Don't ignore -flaky tests hide real regressions.
Comprehensive agent benchmarking measures latency, success rates, quality scores, and costs across representative test suites. Run benchmarks in CI/CD to catch regressions, track trends to detect gradual degradation, and set threshold gates to block problematic deployments.
Next steps:
Internal links:
External references:
Crosslinks: