Academy18 Apr 202410 min read

Agent Observability: Monitoring, Tracing, and Debugging Production AI Systems

Complete observability guide for production AI agents -distributed tracing, metrics dashboards, log aggregation, debugging techniques, and monitoring tools like LangSmith, Helicone, Sentry.

MB
Max Beech
Head of Content

TL;DR

  • Observability: Ability to understand what's happening inside your agent from external outputs (logs, metrics, traces).
  • Three pillars: Metrics (quantitative), Logs (detailed events), Traces (request flow through system).
  • Metrics: Track success rate, latency, cost, error rate. Alert if success <95%, latency p95 >10s.
  • Logs: Record every LLM call (prompt, response, tokens, cost, timestamp). Use structured logging (JSON).
  • Traces: Track multi-step agent workflows (which steps ran, how long each took, where it failed).
  • Tools: LangSmith (agent-specific), Datadog/Grafana (general), Sentry (error tracking), Helicone (LLM analytics).
  • Real data: Teams with comprehensive observability resolve issues 4× faster, reduce MTTR from hours to minutes.

Agent Observability

Production incident without observability:

User: "Agent is broken!"
Engineer: "What's broken?"
User: "It just doesn't work."
Engineer: [Spends 2 hours debugging, can't reproduce]

With observability:

User: "Agent is broken!"
Engineer: [Checks dashboard]
  - Success rate dropped to 72% (normally 95%)
  - Spike in timeouts at 3:42 PM
  - Trace shows: RAG retrieval failing (database connection timeout)
Engineer: [Fixes database connection pool in 10 minutes]

Observability = ability to understand system internals from external data.

The Three Pillars

1. Metrics (Quantitative)

What: Numeric time-series data (counts, rates, durations).

Examples:

  • Requests per second
  • Success rate (%)
  • p50/p95/p99 latency
  • Token usage
  • Cost per request
  • Error rate

Implementation:

from prometheus_client import Counter, Histogram, Gauge
import time

# Define metrics
requests_total = Counter('agent_requests_total', 'Total requests', ['agent_name', 'status'])
request_duration = Histogram('agent_request_duration_seconds', 'Request duration')
tokens_used = Counter('agent_tokens_used_total', 'Tokens used', ['model'])
cost_usd = Counter('agent_cost_usd_total', 'Cost in USD')
active_requests = Gauge('agent_active_requests', 'Currently processing requests')

def track_agent_request(agent_name, func):
    """Decorator to track agent metrics"""
    @wraps(func)
    async def wrapper(*args, **kwargs):
        active_requests.inc()
        start_time = time.time()
        
        try:
            result = await func(*args, **kwargs)
            
            # Success metrics
            requests_total.labels(agent_name=agent_name, status='success').inc()
            duration = time.time() - start_time
            request_duration.observe(duration)
            
            # Token/cost tracking
            tokens = result.get('tokens_used', 0)
            tokens_used.labels(model=result.get('model', 'unknown')).inc(tokens)
            
            cost = calculate_cost(tokens, result.get('model'))
            cost_usd.inc(cost)
            
            return result
        
        except Exception as e:
            # Error metrics
            requests_total.labels(agent_name=agent_name, status='error').inc()
            raise
        
        finally:
            active_requests.dec()
    
    return wrapper

# Usage
@track_agent_request("customer_support")
async def handle_support_ticket(ticket):
    # ... agent logic
    pass

Visualize in Grafana:

# Success rate over time
sum(rate(agent_requests_total{status="success"}[5m])) 
/ 
sum(rate(agent_requests_total[5m]))

# p95 latency
histogram_quantile(0.95, agent_request_duration_seconds)

2. Logs (Event Details)

What: Detailed records of specific events (LLM calls, decisions, errors).

Structured logging (JSON, not plain text):

import logging
import json

class StructuredLogger:
    def __init__(self, name):
        self.logger = logging.getLogger(name)
    
    def log_llm_call(self, model, prompt, response, tokens, cost, latency_ms):
        """Log LLM API call with all details"""
        self.logger.info(json.dumps({
            "event": "llm_call",
            "timestamp": datetime.utcnow().isoformat(),
            "model": model,
            "prompt_preview": prompt[:200],  # First 200 chars
            "prompt_length": len(prompt),
            "response_preview": response[:200],
            "tokens": {
                "prompt": tokens["prompt"],
                "completion": tokens["completion"],
                "total": tokens["total"]
            },
            "cost_usd": cost,
            "latency_ms": latency_ms
        }))
    
    def log_agent_decision(self, decision_point, chosen_action, alternatives, reasoning):
        """Log agent decision-making"""
        self.logger.info(json.dumps({
            "event": "agent_decision",
            "timestamp": datetime.utcnow().isoformat(),
            "decision_point": decision_point,
            "chosen_action": chosen_action,
            "alternatives": alternatives,
            "reasoning": reasoning
        }))
    
    def log_error(self, error, context):
        """Log error with full context"""
        self.logger.error(json.dumps({
            "event": "error",
            "timestamp": datetime.utcnow().isoformat(),
            "error_type": type(error).__name__,
            "error_message": str(error),
            "stack_trace": traceback.format_exc(),
            "context": context
        }))

Usage:

logger = StructuredLogger("customer_support_agent")

# Log LLM call
start_time = time.time()
response = await call_llm(prompt, model="gpt-4-turbo")
latency = (time.time() - start_time) * 1000

logger.log_llm_call(
    model="gpt-4-turbo",
    prompt=prompt,
    response=response["content"],
    tokens=response["usage"],
    cost=calculate_cost(response["usage"], "gpt-4-turbo"),
    latency_ms=latency
)

Query logs (with aggregation tool like Datadog, Splunk):

# Find all failed LLM calls in last hour
event:llm_call status:error timestamp:>now-1h

# Find slow requests (>10s latency)
event:llm_call latency_ms:>10000

# Find expensive requests (>$1 cost)
event:llm_call cost_usd:>1.0

3. Traces (Request Flow)

What: Track request through entire system (which functions called, in what order, how long each took).

Implementation with OpenTelemetry:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger import JaegerExporter

# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export to Jaeger (trace visualization)
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831
)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

# Instrument agent
async def customer_support_agent(ticket):
    with tracer.start_as_current_span("customer_support_agent") as span:
        span.set_attribute("ticket_id", ticket["id"])
        span.set_attribute("category", ticket["category"])
        
        # Step 1: Classify
        with tracer.start_as_current_span("classify_ticket"):
            classification = await classify(ticket)
            span.set_attribute("classification", classification)
        
        # Step 2: Retrieve context
        with tracer.start_as_current_span("retrieve_context"):
            context = await get_context(ticket["user_id"])
        
        # Step 3: Generate response
        with tracer.start_as_current_span("generate_response") as gen_span:
            response = await generate_response(classification, context)
            gen_span.set_attribute("response_length", len(response))
        
        return response

Visualized trace (in Jaeger UI):

customer_support_agent [2.3s total]
├─ classify_ticket [0.8s]
├─ retrieve_context [0.3s]
└─ generate_response [1.2s]
   ├─ llm_call [1.1s]
   └─ format_output [0.1s]

Debugging: If agent is slow, trace shows exactly which step is the bottleneck.

Production Monitoring Dashboard

Key metrics to track:

1. Success Rate

Target: >95%
Alert if: Drops below 90% for 5 minutes

2. Latency (p50, p95, p99)

Target: p95 <5s
Alert if: p95 >10s for 5 minutes

3. Error Rate

Target: <2%
Alert if: >5% for 5 minutes

4. Cost per Request

Target: <$0.10
Alert if: >$1.00 (possible infinite loop)

5. Token Usage

Track: Prompt vs completion tokens
Alert if: Unusual spike (indicates prompt bloat)

Grafana Dashboard Example:

{
  "dashboard": {
    "title": "Agent Monitoring",
    "panels": [
      {
        "title": "Success Rate (Last Hour)",
        "targets": [{
          "expr": "sum(rate(agent_requests_total{status='success'}[5m])) / sum(rate(agent_requests_total[5m]))"
        }]
      },
      {
        "title": "p95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, agent_request_duration_seconds)"
        }]
      },
      {
        "title": "Cost per Hour",
        "targets": [{
          "expr": "sum(increase(agent_cost_usd_total[1h]))"
        }]
      }
    ]
  }
}

Agent-Specific Observability Tools

LangSmith (by LangChain)

Features:

  • Automatic tracing for LangChain agents
  • Prompt versioning
  • Dataset management for evaluation
  • Debugging UI (see exact prompts, responses)

Setup:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key"

# All LangChain calls now automatically traced
from langchain.agents import create_agent

agent = create_agent(...)
result = agent.invoke("user query")  # Traced automatically

View in LangSmith UI: See full trace, prompts, intermediate steps, costs.

Helicone (LLM Analytics)

Features:

  • LLM call analytics (latency, cost, tokens)
  • Caching layer (reduce duplicate calls)
  • Prompt templates
  • User feedback tracking

Setup:

from helicone import Helicone

helicone = Helicone(api_key="your_api_key")

# Wrap OpenAI client
openai_client = helicone.wrap_openai(openai.Client())

# All calls automatically logged to Helicone
response = openai_client.chat.completions.create(...)

Dashboard: View cost breakdown, slowest prompts, error rates by model.

Sentry (Error Tracking)

Features:

  • Error aggregation (group similar errors)
  • Stack traces
  • User context (who experienced error)
  • Alerts on new/frequent errors

Setup:

import sentry_sdk

sentry_sdk.init(
    dsn="your_sentry_dsn",
    traces_sample_rate=0.1,  # Sample 10% of traces
    profiles_sample_rate=0.1
)

# Errors automatically captured
try:
    result = agent.execute(query)
except Exception as e:
    sentry_sdk.capture_exception(e)

Alert: Get Slack/email when new error pattern detected.

Debugging Production Issues

Common debugging workflows:

Issue: Agent success rate dropped to 80%

Steps:

  1. Check metrics dashboard → Identify time when drop started
  2. Query logs for that timeframe → Find common error pattern
  3. Check trace for failed request → See which step failing
  4. Reproduce locally with same inputs → Fix bug
  5. Deploy fix → Monitor metrics for recovery

Example:

# 1. Find time of drop (metrics show 3:42 PM)
# 2. Query logs
grep '"event":"error"' logs.json | grep '2024-04-18T15:4' | jq .

# Output shows: "error": "Database connection timeout"

# 3. Check trace
# Trace shows: retrieve_context step timing out

# 4. Fix: Increase database connection pool
# 5. Deploy, verify success rate recovers

Cost Monitoring and Alerts

Track LLM costs in real-time:

class CostTracker:
    def __init__(self):
        self.daily_costs = defaultdict(float)
    
    def track_call(self, model, tokens):
        cost = calculate_cost(tokens, model)
        today = datetime.now().date()
        self.daily_costs[today] += cost
        
        # Alert if daily budget exceeded
        if self.daily_costs[today] > 100:  # $100 daily budget
            send_alert(f"Daily LLM cost exceeded: ${self.daily_costs[today]:.2f}")
        
        return cost

Set budgets by agent:

AGENT_BUDGETS = {
    "customer_support": 50,  # $50/day
    "research_agent": 200,   # $200/day
    "code_review": 30        # $30/day
}

Frequently Asked Questions

How much observability is too much?

Recommendation: Start with metrics + basic logs. Add tracing if debugging multi-step workflows. Don't over-instrument (logs cost money, slow system).

Rule of thumb: Observability overhead should be <5% of total system cost.

Should I log full prompts and responses?

Security consideration: Prompts may contain sensitive data (PII, credentials).

Options:

  1. Log preview only (first 200 chars)
  2. Redact sensitive fields (emails, phone numbers)
  3. Encrypt logs (decrypt only when debugging)
  4. Don't log production data (only in staging)

How long should I retain logs?

Recommendation:

  • Metrics: 90 days (cheap, small)
  • Logs: 30 days (expensive, large)
  • Traces: 7 days (very expensive)

Adjust based on compliance requirements (HIPAA, GDPR may require longer).

What metrics trigger alerts?

Critical (page on-call):

  • Success rate <85%
  • p95 latency >30s
  • Error rate >10%

Warning (Slack/email):

  • Success rate <95%
  • p95 latency >10s
  • Error rate >5%
  • Daily cost >$150 (if budget is $100)

Bottom line: Observability essential for production AI agents. Track metrics (success rate, latency, cost), logs (LLM calls, errors), traces (request flow). Use agent-specific tools (LangSmith, Helicone) for LLM analytics, general tools (Datadog, Sentry) for infrastructure. Teams with comprehensive observability resolve issues 4× faster, reduce MTTR from hours to minutes.

Next: Read our Error Handling guide for reliability patterns.