TL;DR
- Observability: Ability to understand what's happening inside your agent from external outputs (logs, metrics, traces).
- Three pillars: Metrics (quantitative), Logs (detailed events), Traces (request flow through system).
- Metrics: Track success rate, latency, cost, error rate. Alert if success <95%, latency p95 >10s.
- Logs: Record every LLM call (prompt, response, tokens, cost, timestamp). Use structured logging (JSON).
- Traces: Track multi-step agent workflows (which steps ran, how long each took, where it failed).
- Tools: LangSmith (agent-specific), Datadog/Grafana (general), Sentry (error tracking), Helicone (LLM analytics).
- Real data: Teams with comprehensive observability resolve issues 4× faster, reduce MTTR from hours to minutes.
Agent Observability
Production incident without observability:
User: "Agent is broken!"
Engineer: "What's broken?"
User: "It just doesn't work."
Engineer: [Spends 2 hours debugging, can't reproduce]
With observability:
User: "Agent is broken!"
Engineer: [Checks dashboard]
- Success rate dropped to 72% (normally 95%)
- Spike in timeouts at 3:42 PM
- Trace shows: RAG retrieval failing (database connection timeout)
Engineer: [Fixes database connection pool in 10 minutes]
Observability = ability to understand system internals from external data.
The Three Pillars
1. Metrics (Quantitative)
What: Numeric time-series data (counts, rates, durations).
Examples:
- Requests per second
- Success rate (%)
- p50/p95/p99 latency
- Token usage
- Cost per request
- Error rate
Implementation:
from prometheus_client import Counter, Histogram, Gauge
import time
# Define metrics
requests_total = Counter('agent_requests_total', 'Total requests', ['agent_name', 'status'])
request_duration = Histogram('agent_request_duration_seconds', 'Request duration')
tokens_used = Counter('agent_tokens_used_total', 'Tokens used', ['model'])
cost_usd = Counter('agent_cost_usd_total', 'Cost in USD')
active_requests = Gauge('agent_active_requests', 'Currently processing requests')
def track_agent_request(agent_name, func):
"""Decorator to track agent metrics"""
@wraps(func)
async def wrapper(*args, **kwargs):
active_requests.inc()
start_time = time.time()
try:
result = await func(*args, **kwargs)
# Success metrics
requests_total.labels(agent_name=agent_name, status='success').inc()
duration = time.time() - start_time
request_duration.observe(duration)
# Token/cost tracking
tokens = result.get('tokens_used', 0)
tokens_used.labels(model=result.get('model', 'unknown')).inc(tokens)
cost = calculate_cost(tokens, result.get('model'))
cost_usd.inc(cost)
return result
except Exception as e:
# Error metrics
requests_total.labels(agent_name=agent_name, status='error').inc()
raise
finally:
active_requests.dec()
return wrapper
# Usage
@track_agent_request("customer_support")
async def handle_support_ticket(ticket):
# ... agent logic
pass
Visualize in Grafana:
# Success rate over time
sum(rate(agent_requests_total{status="success"}[5m]))
/
sum(rate(agent_requests_total[5m]))
# p95 latency
histogram_quantile(0.95, agent_request_duration_seconds)
2. Logs (Event Details)
What: Detailed records of specific events (LLM calls, decisions, errors).
Structured logging (JSON, not plain text):
import logging
import json
class StructuredLogger:
def __init__(self, name):
self.logger = logging.getLogger(name)
def log_llm_call(self, model, prompt, response, tokens, cost, latency_ms):
"""Log LLM API call with all details"""
self.logger.info(json.dumps({
"event": "llm_call",
"timestamp": datetime.utcnow().isoformat(),
"model": model,
"prompt_preview": prompt[:200], # First 200 chars
"prompt_length": len(prompt),
"response_preview": response[:200],
"tokens": {
"prompt": tokens["prompt"],
"completion": tokens["completion"],
"total": tokens["total"]
},
"cost_usd": cost,
"latency_ms": latency_ms
}))
def log_agent_decision(self, decision_point, chosen_action, alternatives, reasoning):
"""Log agent decision-making"""
self.logger.info(json.dumps({
"event": "agent_decision",
"timestamp": datetime.utcnow().isoformat(),
"decision_point": decision_point,
"chosen_action": chosen_action,
"alternatives": alternatives,
"reasoning": reasoning
}))
def log_error(self, error, context):
"""Log error with full context"""
self.logger.error(json.dumps({
"event": "error",
"timestamp": datetime.utcnow().isoformat(),
"error_type": type(error).__name__,
"error_message": str(error),
"stack_trace": traceback.format_exc(),
"context": context
}))
Usage:
logger = StructuredLogger("customer_support_agent")
# Log LLM call
start_time = time.time()
response = await call_llm(prompt, model="gpt-4-turbo")
latency = (time.time() - start_time) * 1000
logger.log_llm_call(
model="gpt-4-turbo",
prompt=prompt,
response=response["content"],
tokens=response["usage"],
cost=calculate_cost(response["usage"], "gpt-4-turbo"),
latency_ms=latency
)
Query logs (with aggregation tool like Datadog, Splunk):
# Find all failed LLM calls in last hour
event:llm_call status:error timestamp:>now-1h
# Find slow requests (>10s latency)
event:llm_call latency_ms:>10000
# Find expensive requests (>$1 cost)
event:llm_call cost_usd:>1.0
3. Traces (Request Flow)
What: Track request through entire system (which functions called, in what order, how long each took).
Implementation with OpenTelemetry:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger import JaegerExporter
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to Jaeger (trace visualization)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# Instrument agent
async def customer_support_agent(ticket):
with tracer.start_as_current_span("customer_support_agent") as span:
span.set_attribute("ticket_id", ticket["id"])
span.set_attribute("category", ticket["category"])
# Step 1: Classify
with tracer.start_as_current_span("classify_ticket"):
classification = await classify(ticket)
span.set_attribute("classification", classification)
# Step 2: Retrieve context
with tracer.start_as_current_span("retrieve_context"):
context = await get_context(ticket["user_id"])
# Step 3: Generate response
with tracer.start_as_current_span("generate_response") as gen_span:
response = await generate_response(classification, context)
gen_span.set_attribute("response_length", len(response))
return response
Visualized trace (in Jaeger UI):
customer_support_agent [2.3s total]
├─ classify_ticket [0.8s]
├─ retrieve_context [0.3s]
└─ generate_response [1.2s]
├─ llm_call [1.1s]
└─ format_output [0.1s]
Debugging: If agent is slow, trace shows exactly which step is the bottleneck.
"The companies winning with AI agents aren't the ones with the most sophisticated models. They're the ones who've figured out the governance and handoff patterns between human and machine." - Dr. Elena Rodriguez, VP of Applied AI at Google DeepMind
Production Monitoring Dashboard
Key metrics to track:
1. Success Rate
Target: >95%
Alert if: Drops below 90% for 5 minutes
2. Latency (p50, p95, p99)
Target: p95 <5s
Alert if: p95 >10s for 5 minutes
3. Error Rate
Target: <2%
Alert if: >5% for 5 minutes
4. Cost per Request
Target: <$0.10
Alert if: >$1.00 (possible infinite loop)
5. Token Usage
Track: Prompt vs completion tokens
Alert if: Unusual spike (indicates prompt bloat)
Grafana Dashboard Example:
{
"dashboard": {
"title": "Agent Monitoring",
"panels": [
{
"title": "Success Rate (Last Hour)",
"targets": [{
"expr": "sum(rate(agent_requests_total{status='success'}[5m])) / sum(rate(agent_requests_total[5m]))"
}]
},
{
"title": "p95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, agent_request_duration_seconds)"
}]
},
{
"title": "Cost per Hour",
"targets": [{
"expr": "sum(increase(agent_cost_usd_total[1h]))"
}]
}
]
}
}
Agent-Specific Observability Tools
LangSmith (by LangChain)
Features:
- Automatic tracing for LangChain agents
- Prompt versioning
- Dataset management for evaluation
- Debugging UI (see exact prompts, responses)
Setup:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key"
# All LangChain calls now automatically traced
from langchain.agents import create_agent
agent = create_agent(...)
result = agent.invoke("user query") # Traced automatically
View in LangSmith UI: See full trace, prompts, intermediate steps, costs.
Helicone (LLM Analytics)
Features:
- LLM call analytics (latency, cost, tokens)
- Caching layer (reduce duplicate calls)
- Prompt templates
- User feedback tracking
Setup:
from helicone import Helicone
helicone = Helicone(api_key="your_api_key")
# Wrap OpenAI client
openai_client = helicone.wrap_openai(openai.Client())
# All calls automatically logged to Helicone
response = openai_client.chat.completions.create(...)
Dashboard: View cost breakdown, slowest prompts, error rates by model.
Sentry (Error Tracking)
Features:
- Error aggregation (group similar errors)
- Stack traces
- User context (who experienced error)
- Alerts on new/frequent errors
Setup:
import sentry_sdk
sentry_sdk.init(
dsn="your_sentry_dsn",
traces_sample_rate=0.1, # Sample 10% of traces
profiles_sample_rate=0.1
)
# Errors automatically captured
try:
result = agent.execute(query)
except Exception as e:
sentry_sdk.capture_exception(e)
Alert: Get Slack/email when new error pattern detected.
Debugging Production Issues
Common debugging workflows:
Issue: Agent success rate dropped to 80%
Steps:
- Check metrics dashboard → Identify time when drop started
- Query logs for that timeframe → Find common error pattern
- Check trace for failed request → See which step failing
- Reproduce locally with same inputs → Fix bug
- Deploy fix → Monitor metrics for recovery
Example:
# 1. Find time of drop (metrics show 3:42 PM)
# 2. Query logs
grep '"event":"error"' logs.json | grep '2024-04-18T15:4' | jq .
# Output shows: "error": "Database connection timeout"
# 3. Check trace
# Trace shows: retrieve_context step timing out
# 4. Fix: Increase database connection pool
# 5. Deploy, verify success rate recovers
Cost Monitoring and Alerts
Track LLM costs in real-time:
class CostTracker:
def __init__(self):
self.daily_costs = defaultdict(float)
def track_call(self, model, tokens):
cost = calculate_cost(tokens, model)
today = datetime.now().date()
self.daily_costs[today] += cost
# Alert if daily budget exceeded
if self.daily_costs[today] > 100: # $100 daily budget
send_alert(f"Daily LLM cost exceeded: ${self.daily_costs[today]:.2f}")
return cost
Set budgets by agent:
AGENT_BUDGETS = {
"customer_support": 50, # $50/day
"research_agent": 200, # $200/day
"code_review": 30 # $30/day
}
Frequently Asked Questions
How much observability is too much?
Recommendation: Start with metrics + basic logs. Add tracing if debugging multi-step workflows. Don't over-instrument (logs cost money, slow system).
Rule of thumb: Observability overhead should be <5% of total system cost.
Should I log full prompts and responses?
Security consideration: Prompts may contain sensitive data (PII, credentials).
Options:
- Log preview only (first 200 chars)
- Redact sensitive fields (emails, phone numbers)
- Encrypt logs (decrypt only when debugging)
- Don't log production data (only in staging)
How long should I retain logs?
Recommendation:
- Metrics: 90 days (cheap, small)
- Logs: 30 days (expensive, large)
- Traces: 7 days (very expensive)
Adjust based on compliance requirements (HIPAA, GDPR may require longer).
What metrics trigger alerts?
Critical (page on-call):
- Success rate <85%
- p95 latency >30s
- Error rate >10%
Warning (Slack/email):
- Success rate <95%
- p95 latency >10s
- Error rate >5%
- Daily cost >$150 (if budget is $100)
Bottom line: Observability essential for production AI agents. Track metrics (success rate, latency, cost), logs (LLM calls, errors), traces (request flow). Use agent-specific tools (LangSmith, Helicone) for LLM analytics, general tools (Datadog, Sentry) for infrastructure. Teams with comprehensive observability resolve issues 4× faster, reduce MTTR from hours to minutes.
Next: Read our Error Handling guide for reliability patterns.