Academy10 May 202410 min read

Agent Testing Strategies: Unit, Integration, and End-to-End Testing for AI Systems

Comprehensive testing strategies for AI agents -unit tests for components, integration tests for workflows, E2E tests for full systems, with mocking patterns and CI/CD integration.

MB
Max Beech
Head of Content

TL;DR

  • Three test levels: Unit (individual components), Integration (component interactions), E2E (full agent workflows).
  • Unit tests: Fast, deterministic, test logic without LLM calls. Mock LLM responses.
  • Integration tests: Test agent + external services (APIs, databases). Use staging environment.
  • E2E tests: Test complete user workflow. Slow, expensive, but catches real issues.
  • Mocking: Replace LLM calls with fixed responses for fast, cheap tests. 95% of tests should use mocks.
  • LLM-as-judge: For E2E tests, use another LLM to evaluate output quality (can't use exact string matching).
  • CI/CD: Run unit tests on every commit (seconds), integration tests on PR (minutes), E2E tests nightly (hours).
  • Real data: Teams with comprehensive testing deploy 2.5× more frequently with 60% fewer production bugs.

Agent Testing Strategies

Without testing:

Engineer: "I updated the prompt"
Deploy to production
User: "Agent is broken!"
Engineer: "Oops, didn't test that"

With testing:

Engineer: "I updated the prompt"
Run tests → 12/15 tests fail
Engineer: "Found the issue, fixing..."
Run tests → 15/15 pass
Deploy with confidence

Testing Pyramid for AI Agents

Traditional software testing pyramid:

      E2E (few)
   Integration (some)
  Unit (many)

AI agent testing pyramid (same structure):

      E2E: Full workflow (10 tests, run nightly)
   Integration: Agent + services (50 tests, run on PR)
  Unit: Components (200 tests, run every commit)

Why pyramid shape: Unit tests fast/cheap (run constantly), E2E tests slow/expensive (run sparingly).

Level 1: Unit Tests

What: Test individual components in isolation (parsers, validators, formatters, tool functions).

Goal: Fast, deterministic, no external dependencies.

Example: Test Tool Function

# tool: search_database.py
def search_database(query: str, limit: int = 10) -> list:
    """Search database for records matching query"""
    if not query or len(query) < 3:
        raise ValueError("Query must be at least 3 characters")
    
    if limit < 1 or limit > 100:
        raise ValueError("Limit must be between 1 and 100")
    
    # Execute database search
    results = db.execute(f"SELECT * FROM records WHERE content LIKE '%{query}%' LIMIT {limit}")
    return results

# test_search_database.py
import pytest
from unittest.mock import Mock, patch

def test_search_database_valid_query():
    """Test search with valid query returns results"""
    with patch('tools.db.execute') as mock_db:
        mock_db.return_value = [{"id": 1, "content": "test result"}]
        
        results = search_database("test", limit=10)
        
        assert len(results) == 1
        assert results[0]["content"] == "test result"
        mock_db.assert_called_once()

def test_search_database_short_query():
    """Test search rejects query < 3 chars"""
    with pytest.raises(ValueError, match="at least 3 characters"):
        search_database("ab")

def test_search_database_invalid_limit():
    """Test search rejects invalid limit"""
    with pytest.raises(ValueError, match="between 1 and 100"):
        search_database("test", limit=200)

Run time: <1 second for 100 unit tests.

Mock LLM Responses

Problem: LLM calls slow, expensive, non-deterministic.

Solution: Mock responses in unit tests.

# agent.py
class CustomerSupportAgent:
    async def classify_ticket(self, ticket_text):
        """Classify support ticket into category"""
        response = await call_llm(
            f"Classify this ticket: {ticket_text}\nCategories: billing, technical, account",
            model="gpt-3.5-turbo"
        )
        return json.loads(response)

# test_agent.py
@pytest.mark.asyncio
async def test_classify_ticket_billing():
    """Test classification of billing ticket"""
    agent = CustomerSupportAgent()
    
    # Mock LLM response
    with patch('agent.call_llm') as mock_llm:
        mock_llm.return_value = '{"category": "billing", "confidence": 0.95}'
        
        result = await agent.classify_ticket("My payment failed")
        
        assert result["category"] == "billing"
        assert result["confidence"] == 0.95
        mock_llm.assert_called_once()

@pytest.mark.asyncio
async def test_classify_ticket_technical():
    """Test classification of technical ticket"""
    agent = CustomerSupportAgent()
    
    with patch('agent.call_llm') as mock_llm:
        mock_llm.return_value = '{"category": "technical", "confidence": 0.89}'
        
        result = await agent.classify_ticket("App is crashing")
        
        assert result["category"] == "technical"

Benefits:

  • Fast (no LLM API call)
  • Free (no API costs)
  • Deterministic (same input = same output)

Limitations:

  • Doesn't test actual LLM behavior
  • Might pass even if prompt is broken

Rule: 95% of tests should mock LLMs, 5% use real LLM calls (integration/E2E tests).

Level 2: Integration Tests

What: Test agent with real external services (databases, APIs, LLMs) in staging environment.

Goal: Catch integration issues before production.

Example: Test Agent + Database + LLM

@pytest.mark.integration  # Mark as integration test
@pytest.mark.asyncio
async def test_agent_end_to_end_search():
    """Test agent can search database and format results"""
    # Setup: Staging database with test data
    test_db = setup_staging_db()
    test_db.insert("test_record", {"id": 1, "content": "Integration test data"})
    
    # Create agent connected to staging DB
    agent = SearchAgent(database=test_db)
    
    # Execute agent with REAL LLM call
    result = await agent.execute("Find records about integration test")
    
    # Verify
    assert "integration test data" in result.lower()
    assert len(result) > 50  # Agent formatted response (not just raw data)
    
    # Cleanup
    test_db.cleanup()

Run time: 5-10 seconds per test (LLM call adds latency).

Cost: $0.001-0.01 per test (LLM API calls).

Test Agent Workflow

@pytest.mark.integration
@pytest.mark.asyncio
async def test_customer_support_workflow():
    """Test full support ticket workflow"""
    agent = CustomerSupportAgent()
    
    # Step 1: Classify
    classification = await agent.classify_ticket("My payment failed but I was charged")
    assert classification["category"] == "billing"
    
    # Step 2: Retrieve context
    context = await agent.get_customer_context(user_id="test_user_123")
    assert "payment_method" in context
    
    # Step 3: Generate response
    response = await agent.generate_response(classification, context)
    
    # Verify response quality (fuzzy matching, not exact)
    assert "refund" in response.lower() or "charge" in response.lower()
    assert len(response) > 100  # Substantial response

When to run: On pull requests, before merging to main.

Level 3: E2E (End-to-End) Tests

What: Test complete user workflow from input to final output, including all agent steps.

Goal: Verify agent works as users experience it.

Example: Multi-Step Research Agent

@pytest.mark.e2e
@pytest.mark.asyncio
@pytest.mark.slow  # Mark slow tests
async def test_research_agent_full_workflow():
    """
    Test research agent:
    1. Receives research query
    2. Searches web
    3. Analyzes sources
    4. Generates report
    """
    agent = ResearchAgent()
    
    # Execute full workflow (takes minutes)
    report = await agent.research("What are the latest developments in quantum computing?")
    
    # Verify report structure
    assert "## Summary" in report
    assert "## Key Findings" in report
    assert "## Sources" in report
    
    # Verify quality with LLM-as-judge
    quality_score = await evaluate_report_quality(report)
    assert quality_score >= 7/10  # At least 7/10 quality

LLM-as-Judge for E2E Tests

Problem: Can't use exact string matching (LLM outputs vary).

Solution: Use another LLM to evaluate output quality.

async def evaluate_report_quality(report: str) -> float:
    """Use GPT-4 to score report quality 1-10"""
    judge_prompt = f"""
    Evaluate this research report on a scale of 1-10.
    
    Criteria:
    - Accuracy: Information appears correct
    - Completeness: Covers topic thoroughly
    - Clarity: Well-organized and readable
    - Sources: Includes credible citations
    
    Report:
    {report}
    
    Respond with just a number 1-10.
    """
    
    score = await call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
    return float(score.strip())

Reliability: LLM-as-judge agrees with humans 85-90% of the time.

Cost: Doubles test cost (2 LLM calls instead of 1).

CI/CD Integration

Continuous Integration pipeline:

# .github/workflows/test.yml
name: Agent Tests

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run unit tests
        run: pytest tests/unit -v
    # Run on every commit
    # Time: 10-30 seconds
  
  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests  # Only if unit tests pass
    steps:
      - uses: actions/checkout@v2
      - name: Setup staging environment
        run: ./scripts/setup-staging.sh
      - name: Run integration tests
        run: pytest tests/integration -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    # Run on pull requests
    # Time: 5-10 minutes
  
  e2e-tests:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'  # Only on main branch
    steps:
      - uses: actions/checkout@v2
      - name: Run E2E tests
        run: pytest tests/e2e -v --slow
    # Run nightly (cron schedule)
    # Time: 30-60 minutes

Test execution frequency:

  • Unit: Every commit (seconds)
  • Integration: Every PR (minutes)
  • E2E: Nightly or pre-release (hours)

Test Data Management

Golden Datasets

Create fixed test datasets for consistent evaluation.

# tests/data/golden_dataset.json
[
  {
    "id": "test_001",
    "input": "Analyze Q3 revenue trends",
    "expected_contains": ["revenue", "Q3", "trend"],
    "expected_min_length": 200,
    "expected_quality_score": 7
  },
  {
    "id": "test_002",
    "input": "Summarize customer feedback from last month",
    "expected_contains": ["customer", "feedback", "summary"],
    "expected_min_length": 150,
    "expected_quality_score": 7
  }
]

# test_agent_golden_dataset.py
@pytest.mark.parametrize("test_case", load_golden_dataset())
@pytest.mark.asyncio
async def test_agent_on_golden_dataset(test_case):
    """Test agent on curated golden dataset"""
    agent = AnalysisAgent()
    
    result = await agent.analyze(test_case["input"])
    
    # Verify expected keywords present
    for keyword in test_case["expected_contains"]:
        assert keyword.lower() in result.lower()
    
    # Verify minimum length
    assert len(result) >= test_case["expected_min_length"]
    
    # Verify quality
    quality = await evaluate_quality(result)
    assert quality >= test_case["expected_quality_score"]

Benefits:

  • Consistent benchmarking
  • Catch regressions (new version performs worse)
  • Track improvement over time

Regression Testing

After each change, re-run golden dataset:

def test_no_regression():
    """Ensure new version performs at least as well as previous version"""
    current_scores = run_golden_dataset(current_agent)
    previous_scores = load_previous_scores("v1.2_scores.json")
    
    avg_current = sum(current_scores) / len(current_scores)
    avg_previous = sum(previous_scores) / len(previous_scores)
    
    # Allow 5% degradation tolerance
    assert avg_current >= avg_previous * 0.95, "Performance regression detected"

Testing Best Practices

1. Test pyramid ratio: 70% unit, 25% integration, 5% E2E.

2. Mock by default: Mock LLM calls in unit/integration tests, use real LLM only in E2E.

3. Test failure modes:

def test_agent_handles_api_timeout():
    """Verify agent handles API timeout gracefully"""
    agent = Agent()
    
    with patch('agent.call_llm', side_effect=TimeoutError):
        result = agent.execute("test")
        
        # Should return error message, not crash
        assert "error" in result.lower()
        assert "timeout" in result.lower()

4. Test edge cases:

  • Empty input
  • Very long input (exceeds context window)
  • Invalid JSON responses from LLM
  • External API down

5. Parameterized tests for multiple scenarios:

@pytest.mark.parametrize("ticket,expected_category", [
    ("Payment failed", "billing"),
    ("App crashes on startup", "technical"),
    ("Can't reset password", "account"),
    ("Upgrade to pro plan", "sales")
])
def test_classify_various_tickets(ticket, expected_category):
    result = classify_ticket(ticket)
    assert result["category"] == expected_category

Measuring Test Coverage

# Run tests with coverage report
pytest --cov=agents --cov-report=html

# View coverage
open htmlcov/index.html

Target coverage:

  • Unit tests: >90% code coverage
  • Integration tests: >70% workflow coverage
  • E2E tests: >50% user journey coverage

Frequently Asked Questions

How do I test non-deterministic LLM outputs?

Approaches:

  1. Fuzzy matching: Check keywords present, not exact string
  2. LLM-as-judge: Use another LLM to evaluate quality
  3. Seed/temperature=0: Force deterministic outputs (not always available)
  4. Statistical testing: Run 10 times, verify >80% pass threshold

Should I test prompts?

Yes. Prompt changes can break agents.

def test_prompt_produces_valid_json():
    """Verify prompt reliably produces parseable JSON"""
    for _ in range(10):  # Run 10 times (account for variability)
        response = call_llm(classification_prompt, temperature=0)
        
        # Should be valid JSON
        try:
            parsed = json.loads(response)
            assert "category" in parsed
        except json.JSONDecodeError:
            pytest.fail("Prompt produced invalid JSON")

How often should I run E2E tests?

Recommendation:

  • Nightly (automated)
  • Before every release (manual trigger)
  • After major changes (on-demand)

Don't run on every commit (too slow/expensive).


Bottom line: Comprehensive testing requires unit (fast, many), integration (medium, some), E2E (slow, few) tests. Mock LLM calls in 95% of tests for speed. Use LLM-as-judge for E2E quality evaluation. Run unit tests on every commit, integration on PRs, E2E nightly. Teams with systematic testing deploy 2.5× more frequently with 60% fewer bugs.

Next: Read our Agent Evaluation guide for performance measurement strategies.