AI Agent Testing Frameworks: Complete 2025 Guide
Comprehensive guide to testing AI agents in production -unit testing prompts, integration testing workflows, evaluation frameworks, and regression detection strategies.
Comprehensive guide to testing AI agents in production -unit testing prompts, integration testing workflows, evaluation frameworks, and regression detection strategies.
TL;DR
Tested 5 different approaches across 20 production agents. Here's what works.
Why traditional testing fails:
def test_support_agent():
response = agent.invoke("I want a refund")
assert response == "I've processed your refund of $50"
# ❌ FAILS: LLM outputs vary ("I've issued..." vs "I've processed...")
Non-determinism problem: Same input → different outputs (but semantically equivalent).
Solutions:
Goal: Test individual prompts for correctness.
Framework: Promptfoo (open-source)
Setup:
# promptfooconfig.yaml
prompts:
- "Classify this support ticket: {{ticket}}\nCategories: billing, technical, sales"
providers:
- openai:gpt-4-turbo
tests:
- vars:
ticket: "I want a refund"
assert:
- type: contains
value: "billing"
- vars:
ticket: "App keeps crashing"
assert:
- type: contains
value: "technical"
- vars:
ticket: "Do you offer enterprise plans?"
assert:
- type: llm-rubric
value: "Output should be 'sales' category"
Run tests:
promptfoo eval
# Tests 100 examples, shows pass/fail rate
Assertion types:
| Type | Use Case | Example |
|---|---|---|
contains | Check substring | "must contain 'billing'" |
regex | Pattern matching | "must match ^\d{4}-\d{2}-\d{2}$" |
is-json | Valid JSON | Structured output validation |
llm-rubric | Semantic check | "Should apologize to customer" |
similarity | Embedding distance | >0.9 cosine similarity to expected |
Cost: Free (open-source), plus LLM costs (£0.50 for 100 examples with GPT-4)
Rating: 4.5/5 (best for prompt-level testing)
Goal: Test multi-step agent workflows end-to-end.
Framework: LangSmith Evaluations
Setup:
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
# Create dataset (once)
dataset = client.create_dataset("support-workflows")
client.create_examples(
dataset_id=dataset.id,
inputs=[
{"ticket": "I want a refund, order #1234"},
{"ticket": "App won't load on iPhone 15"},
],
outputs=[
{"category": "billing", "order_id": "1234", "action": "refund_processed"},
{"category": "technical", "device": "iPhone 15", "action": "escalated"},
]
)
# Define evaluator
def workflow_correctness(run, example):
output = run.outputs
expected = example.outputs
# Check: correct category
if output.get("category") != expected.get("category"):
return {"score": 0, "reason": "Wrong category"}
# Check: correct action
if output.get("action") != expected.get("action"):
return {"score": 0, "reason": "Wrong action"}
return {"score": 1}
# Run evaluation
evaluate(
lambda inputs: agent.invoke(inputs["ticket"]),
data="support-workflows",
evaluators=[workflow_correctness],
experiment_prefix="support-agent-v2.1"
)
Output:
support-agent-v2.1
├── Accuracy: 92% (46/50)
├── Avg latency: 2.3s
├── Total cost: £1.20
└── Failures:
├── Example 12: Wrong category (billing vs technical)
└── Example 34: Missing order_id extraction
Advantages:
Cost: LangSmith Plus £40/month + LLM inference costs
Rating: 4.6/5 (best for LangChain workflows)
Goal: Catch regressions in production (without disrupting users).
Framework: Braintrust Monitoring
Setup:
from braintrust import init_logger
logger = init_logger(project="support-agent-prod")
def agent_with_monitoring(ticket):
span = logger.start_span(name="agent-invocation", input={"ticket": ticket})
try:
output = agent.invoke(ticket)
# Log output
span.log(output=output)
# Evaluate (async, doesn't block response)
span.score(
name="category-confidence",
score=output.get("confidence", 0)
)
return output
finally:
span.end()
Dashboard alerts:
Cost: Braintrust free tier (10K logs/month), then $50/month
Rating: 4.3/5 (best for production monitoring)
Concept: Use GPT-4 to grade GPT-4 outputs (sounds circular, but works).
Example (LangSmith):
from langsmith.evaluation import LangChainStringEvaluator
# Built-in evaluators
evaluators = [
LangChainStringEvaluator("cot_qa"), # Chain-of-thought Q&A
LangChainStringEvaluator("criteria", config={
"criteria": {
"helpfulness": "Is the response helpful to the user?",
"harmlessness": "Does it avoid harmful content?"
}
})
]
evaluate(
lambda inputs: agent.invoke(inputs),
data="support-tickets",
evaluators=evaluators
)
How it works:
Prompt to GPT-4:
Is this response helpful? Answer yes or no.
Question: "I want a refund"
Response: "I've processed your refund of $50, you'll see it in 3-5 days"
GPT-4: "Yes, the response directly addresses the refund request with specifics."
Score: 1 (helpful)
Accuracy: Model-graded evaluations correlate 0.85-0.90 with human judgments (OpenAI research).
Cost: ~£0.01 per evaluation (GPT-4 Turbo as judge)
When to use:
When not to use:
Problem: How do you know v2 of your agent is better than v1?
Solution: Benchmark dataset + version comparison
Setup:
# 1. Create golden dataset (100 real examples from production)
dataset = create_golden_dataset()
# 2. Run v1 baseline
results_v1 = evaluate(agent_v1, data=dataset, experiment_prefix="agent-v1")
# 3. Make changes (new prompt, different model, etc.)
# 4. Run v2
results_v2 = evaluate(agent_v2, data=dataset, experiment_prefix="agent-v2")
# 5. Compare
comparison = client.compare_experiments(["agent-v1", "agent-v2"])
print(comparison.summary())
Output:
Metric | v1 | v2 | Δ
----------------|-------|-------|-------
Accuracy | 89% | 92% | +3%
Avg latency | 2.1s | 1.8s | -14%
Cost per query | £0.02 | £0.015| -25%
Harmfulness | 2 incidents | 0 | -100%
Decision rule: Ship v2 if:
Real example: Ramp tested 47 prompt variations, found 3% accuracy gain with 30% cost reduction by switching from GPT-4 to Claude 3.5 Sonnet.
| Tool | Best For | Pricing | Pros | Cons |
|---|---|---|---|---|
| Promptfoo | Prompt unit tests | Free | Open-source, CLI | No UI |
| LangSmith | Workflow integration tests | $40/mo | Best debugging, versioning | LangChain lock-in |
| Braintrust | Production monitoring | $50/mo | Excellent dashboards | Smaller ecosystem |
| Patronus AI | Enterprise compliance | $500+/mo | Hallucination detection | Expensive |
| Manual (Pytest) | Custom needs | Free | Full control | High maintenance |
Recommendation: Start with Promptfoo (prompt tests) + LangSmith (workflow tests).
At Athenic, we test agents in 4 stages:
Tool: Promptfoo
Process:
promptfoo eval (100 examples, 30 seconds)Cost: ~£0.50 per test run
Tool: LangSmith Evaluations (via GitHub Actions)
Process:
Example CI config:
# .github/workflows/test-agents.yml
name: Test Agents
on: pull_request
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: pip install langsmith
- run: python scripts/evaluate_agents.py
- name: Check regression
run: |
if [ $ACCURACY_DROP -gt 2 ]; then
echo "Accuracy dropped by ${ACCURACY_DROP}%, blocking merge"
exit 1
fi
Cost: ~£5 per PR (500 examples × £0.01)
Tool: Braintrust Monitoring
Process:
Cost: ~£10 per staging deployment
Tool: Braintrust + Custom Metrics
Process:
Cost: ~£50 per canary (7 days × monitoring costs)
Total testing cost per release: £0.50 (local) + £5 (CI) + £10 (staging) + £50 (canary) = £65.50
Value: Prevents shipping broken agents that cost £1000s in lost revenue or support costs.
Problem: GPT-4 grading GPT-4 creates bias (model favors its own outputs).
Solution: Mix model grading (80%) with human review (20%).
Problem: Testing only happy path (miss edge cases).
Solution: Include adversarial examples:
Problem: assert "billing" in output breaks when output format changes.
Solution: Use semantic checks:
# ❌ Brittle
assert "billing" in output
# ✅ Robust
embedding_similarity(output, "This is a billing question") > 0.8
Problem: Can't detect production regressions.
Solution: Log 1% of production traffic as golden dataset, re-test monthly.
Minimum viable testing stack:
Total cost: £40/month + 7 hours/month
ROI: Prevents 1 major incident (£5K cost) every 2-3 months → £25K/year savings.
Advanced stack (for teams with >10 agents in production):
Sources: