TL;DR
- Three test levels: Unit (individual components), Integration (component interactions), E2E (full agent workflows).
- Unit tests: Fast, deterministic, test logic without LLM calls. Mock LLM responses.
- Integration tests: Test agent + external services (APIs, databases). Use staging environment.
- E2E tests: Test complete user workflow. Slow, expensive, but catches real issues.
- Mocking: Replace LLM calls with fixed responses for fast, cheap tests. 95% of tests should use mocks.
- LLM-as-judge: For E2E tests, use another LLM to evaluate output quality (can't use exact string matching).
- CI/CD: Run unit tests on every commit (seconds), integration tests on PR (minutes), E2E tests nightly (hours).
- Real data: Teams with comprehensive testing deploy 2.5× more frequently with 60% fewer production bugs.
Agent Testing Strategies
Without testing:
Engineer: "I updated the prompt"
Deploy to production
User: "Agent is broken!"
Engineer: "Oops, didn't test that"
With testing:
Engineer: "I updated the prompt"
Run tests → 12/15 tests fail
Engineer: "Found the issue, fixing..."
Run tests → 15/15 pass
Deploy with confidence
Testing Pyramid for AI Agents
Traditional software testing pyramid:
E2E (few)
Integration (some)
Unit (many)
AI agent testing pyramid (same structure):
E2E: Full workflow (10 tests, run nightly)
Integration: Agent + services (50 tests, run on PR)
Unit: Components (200 tests, run every commit)
Why pyramid shape: Unit tests fast/cheap (run constantly), E2E tests slow/expensive (run sparingly).
Level 1: Unit Tests
What: Test individual components in isolation (parsers, validators, formatters, tool functions).
Goal: Fast, deterministic, no external dependencies.
Example: Test Tool Function
# tool: search_database.py
def search_database(query: str, limit: int = 10) -> list:
"""Search database for records matching query"""
if not query or len(query) < 3:
raise ValueError("Query must be at least 3 characters")
if limit < 1 or limit > 100:
raise ValueError("Limit must be between 1 and 100")
# Execute database search
results = db.execute(f"SELECT * FROM records WHERE content LIKE '%{query}%' LIMIT {limit}")
return results
# test_search_database.py
import pytest
from unittest.mock import Mock, patch
def test_search_database_valid_query():
"""Test search with valid query returns results"""
with patch('tools.db.execute') as mock_db:
mock_db.return_value = [{"id": 1, "content": "test result"}]
results = search_database("test", limit=10)
assert len(results) == 1
assert results[0]["content"] == "test result"
mock_db.assert_called_once()
def test_search_database_short_query():
"""Test search rejects query < 3 chars"""
with pytest.raises(ValueError, match="at least 3 characters"):
search_database("ab")
def test_search_database_invalid_limit():
"""Test search rejects invalid limit"""
with pytest.raises(ValueError, match="between 1 and 100"):
search_database("test", limit=200)
Run time: <1 second for 100 unit tests.
Mock LLM Responses
Problem: LLM calls slow, expensive, non-deterministic.
Solution: Mock responses in unit tests.
# agent.py
class CustomerSupportAgent:
async def classify_ticket(self, ticket_text):
"""Classify support ticket into category"""
response = await call_llm(
f"Classify this ticket: {ticket_text}\nCategories: billing, technical, account",
model="gpt-3.5-turbo"
)
return json.loads(response)
# test_agent.py
@pytest.mark.asyncio
async def test_classify_ticket_billing():
"""Test classification of billing ticket"""
agent = CustomerSupportAgent()
# Mock LLM response
with patch('agent.call_llm') as mock_llm:
mock_llm.return_value = '{"category": "billing", "confidence": 0.95}'
result = await agent.classify_ticket("My payment failed")
assert result["category"] == "billing"
assert result["confidence"] == 0.95
mock_llm.assert_called_once()
@pytest.mark.asyncio
async def test_classify_ticket_technical():
"""Test classification of technical ticket"""
agent = CustomerSupportAgent()
with patch('agent.call_llm') as mock_llm:
mock_llm.return_value = '{"category": "technical", "confidence": 0.89}'
result = await agent.classify_ticket("App is crashing")
assert result["category"] == "technical"
Benefits:
- Fast (no LLM API call)
- Free (no API costs)
- Deterministic (same input = same output)
Limitations:
- Doesn't test actual LLM behavior
- Might pass even if prompt is broken
Rule: 95% of tests should mock LLMs, 5% use real LLM calls (integration/E2E tests).
Level 2: Integration Tests
What: Test agent with real external services (databases, APIs, LLMs) in staging environment.
Goal: Catch integration issues before production.
Example: Test Agent + Database + LLM
@pytest.mark.integration # Mark as integration test
@pytest.mark.asyncio
async def test_agent_end_to_end_search():
"""Test agent can search database and format results"""
# Setup: Staging database with test data
test_db = setup_staging_db()
test_db.insert("test_record", {"id": 1, "content": "Integration test data"})
# Create agent connected to staging DB
agent = SearchAgent(database=test_db)
# Execute agent with REAL LLM call
result = await agent.execute("Find records about integration test")
# Verify
assert "integration test data" in result.lower()
assert len(result) > 50 # Agent formatted response (not just raw data)
# Cleanup
test_db.cleanup()
Run time: 5-10 seconds per test (LLM call adds latency).
Cost: $0.001-0.01 per test (LLM API calls).
Test Agent Workflow
@pytest.mark.integration
@pytest.mark.asyncio
async def test_customer_support_workflow():
"""Test full support ticket workflow"""
agent = CustomerSupportAgent()
# Step 1: Classify
classification = await agent.classify_ticket("My payment failed but I was charged")
assert classification["category"] == "billing"
# Step 2: Retrieve context
context = await agent.get_customer_context(user_id="test_user_123")
assert "payment_method" in context
# Step 3: Generate response
response = await agent.generate_response(classification, context)
# Verify response quality (fuzzy matching, not exact)
assert "refund" in response.lower() or "charge" in response.lower()
assert len(response) > 100 # Substantial response
When to run: On pull requests, before merging to main.
Level 3: E2E (End-to-End) Tests
What: Test complete user workflow from input to final output, including all agent steps.
Goal: Verify agent works as users experience it.
Example: Multi-Step Research Agent
@pytest.mark.e2e
@pytest.mark.asyncio
@pytest.mark.slow # Mark slow tests
async def test_research_agent_full_workflow():
"""
Test research agent:
1. Receives research query
2. Searches web
3. Analyzes sources
4. Generates report
"""
agent = ResearchAgent()
# Execute full workflow (takes minutes)
report = await agent.research("What are the latest developments in quantum computing?")
# Verify report structure
assert "## Summary" in report
assert "## Key Findings" in report
assert "## Sources" in report
# Verify quality with LLM-as-judge
quality_score = await evaluate_report_quality(report)
assert quality_score >= 7/10 # At least 7/10 quality
LLM-as-Judge for E2E Tests
Problem: Can't use exact string matching (LLM outputs vary).
Solution: Use another LLM to evaluate output quality.
async def evaluate_report_quality(report: str) -> float:
"""Use GPT-4 to score report quality 1-10"""
judge_prompt = f"""
Evaluate this research report on a scale of 1-10.
Criteria:
- Accuracy: Information appears correct
- Completeness: Covers topic thoroughly
- Clarity: Well-organized and readable
- Sources: Includes credible citations
Report:
{report}
Respond with just a number 1-10.
"""
score = await call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
return float(score.strip())
Reliability: LLM-as-judge agrees with humans 85-90% of the time.
Cost: Doubles test cost (2 LLM calls instead of 1).
CI/CD Integration
Continuous Integration pipeline:
# .github/workflows/test.yml
name: Agent Tests
on: [push, pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run unit tests
run: pytest tests/unit -v
# Run on every commit
# Time: 10-30 seconds
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests # Only if unit tests pass
steps:
- uses: actions/checkout@v2
- name: Setup staging environment
run: ./scripts/setup-staging.sh
- name: Run integration tests
run: pytest tests/integration -v
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# Run on pull requests
# Time: 5-10 minutes
e2e-tests:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' # Only on main branch
steps:
- uses: actions/checkout@v2
- name: Run E2E tests
run: pytest tests/e2e -v --slow
# Run nightly (cron schedule)
# Time: 30-60 minutes
Test execution frequency:
- Unit: Every commit (seconds)
- Integration: Every PR (minutes)
- E2E: Nightly or pre-release (hours)
Test Data Management
Golden Datasets
Create fixed test datasets for consistent evaluation.
# tests/data/golden_dataset.json
[
{
"id": "test_001",
"input": "Analyze Q3 revenue trends",
"expected_contains": ["revenue", "Q3", "trend"],
"expected_min_length": 200,
"expected_quality_score": 7
},
{
"id": "test_002",
"input": "Summarize customer feedback from last month",
"expected_contains": ["customer", "feedback", "summary"],
"expected_min_length": 150,
"expected_quality_score": 7
}
]
# test_agent_golden_dataset.py
@pytest.mark.parametrize("test_case", load_golden_dataset())
@pytest.mark.asyncio
async def test_agent_on_golden_dataset(test_case):
"""Test agent on curated golden dataset"""
agent = AnalysisAgent()
result = await agent.analyze(test_case["input"])
# Verify expected keywords present
for keyword in test_case["expected_contains"]:
assert keyword.lower() in result.lower()
# Verify minimum length
assert len(result) >= test_case["expected_min_length"]
# Verify quality
quality = await evaluate_quality(result)
assert quality >= test_case["expected_quality_score"]
Benefits:
- Consistent benchmarking
- Catch regressions (new version performs worse)
- Track improvement over time
Regression Testing
After each change, re-run golden dataset:
def test_no_regression():
"""Ensure new version performs at least as well as previous version"""
current_scores = run_golden_dataset(current_agent)
previous_scores = load_previous_scores("v1.2_scores.json")
avg_current = sum(current_scores) / len(current_scores)
avg_previous = sum(previous_scores) / len(previous_scores)
# Allow 5% degradation tolerance
assert avg_current >= avg_previous * 0.95, "Performance regression detected"
Testing Best Practices
1. Test pyramid ratio: 70% unit, 25% integration, 5% E2E.
2. Mock by default: Mock LLM calls in unit/integration tests, use real LLM only in E2E.
3. Test failure modes:
def test_agent_handles_api_timeout():
"""Verify agent handles API timeout gracefully"""
agent = Agent()
with patch('agent.call_llm', side_effect=TimeoutError):
result = agent.execute("test")
# Should return error message, not crash
assert "error" in result.lower()
assert "timeout" in result.lower()
4. Test edge cases:
- Empty input
- Very long input (exceeds context window)
- Invalid JSON responses from LLM
- External API down
5. Parameterized tests for multiple scenarios:
@pytest.mark.parametrize("ticket,expected_category", [
("Payment failed", "billing"),
("App crashes on startup", "technical"),
("Can't reset password", "account"),
("Upgrade to pro plan", "sales")
])
def test_classify_various_tickets(ticket, expected_category):
result = classify_ticket(ticket)
assert result["category"] == expected_category
Measuring Test Coverage
# Run tests with coverage report
pytest --cov=agents --cov-report=html
# View coverage
open htmlcov/index.html
Target coverage:
- Unit tests: >90% code coverage
- Integration tests: >70% workflow coverage
- E2E tests: >50% user journey coverage
Frequently Asked Questions
How do I test non-deterministic LLM outputs?
Approaches:
- Fuzzy matching: Check keywords present, not exact string
- LLM-as-judge: Use another LLM to evaluate quality
- Seed/temperature=0: Force deterministic outputs (not always available)
- Statistical testing: Run 10 times, verify >80% pass threshold
Should I test prompts?
Yes. Prompt changes can break agents.
def test_prompt_produces_valid_json():
"""Verify prompt reliably produces parseable JSON"""
for _ in range(10): # Run 10 times (account for variability)
response = call_llm(classification_prompt, temperature=0)
# Should be valid JSON
try:
parsed = json.loads(response)
assert "category" in parsed
except json.JSONDecodeError:
pytest.fail("Prompt produced invalid JSON")
How often should I run E2E tests?
Recommendation:
- Nightly (automated)
- Before every release (manual trigger)
- After major changes (on-demand)
Don't run on every commit (too slow/expensive).
Bottom line: Comprehensive testing requires unit (fast, many), integration (medium, some), E2E (slow, few) tests. Mock LLM calls in 95% of tests for speed. Use LLM-as-judge for E2E quality evaluation. Run unit tests on every commit, integration on PRs, E2E nightly. Teams with systematic testing deploy 2.5× more frequently with 60% fewer bugs.
Next: Read our Agent Evaluation guide for performance measurement strategies.