TL;DR
- Observability: Ability to understand what's happening inside your agent from external outputs (logs, metrics, traces).
- Three pillars: Metrics (quantitative), Logs (detailed events), Traces (request flow through system).
- Metrics: Track success rate, latency, cost, error rate. Alert if success <95%, latency p95 >10s.
- Logs: Record every LLM call (prompt, response, tokens, cost, timestamp). Use structured logging (JSON).
- Traces: Track multi-step agent workflows (which steps ran, how long each took, where it failed).
- Tools: LangSmith (agent-specific), Datadog/Grafana (general), Sentry (error tracking), Helicone (LLM analytics).
- Real data: Teams with comprehensive observability resolve issues 4× faster, reduce MTTR from hours to minutes.
Agent Observability
Production incident without observability:
User: "Agent is broken!"
Engineer: "What's broken?"
User: "It just doesn't work."
Engineer: [Spends 2 hours debugging, can't reproduce]
With observability:
User: "Agent is broken!"
Engineer: [Checks dashboard]
- Success rate dropped to 72% (normally 95%)
- Spike in timeouts at 3:42 PM
- Trace shows: RAG retrieval failing (database connection timeout)
Engineer: [Fixes database connection pool in 10 minutes]
Observability = ability to understand system internals from external data.
The Three Pillars
1. Metrics (Quantitative)
What: Numeric time-series data (counts, rates, durations).
Examples:
- Requests per second
- Success rate (%)
- p50/p95/p99 latency
- Token usage
- Cost per request
- Error rate
Implementation:
from prometheus_client import Counter, Histogram, Gauge
import time
# Define metrics
requests_total = Counter('agent_requests_total', 'Total requests', ['agent_name', 'status'])
request_duration = Histogram('agent_request_duration_seconds', 'Request duration')
tokens_used = Counter('agent_tokens_used_total', 'Tokens used', ['model'])
cost_usd = Counter('agent_cost_usd_total', 'Cost in USD')
active_requests = Gauge('agent_active_requests', 'Currently processing requests')
def track_agent_request(agent_name, func):
"""Decorator to track agent metrics"""
@wraps(func)
async def wrapper(*args, **kwargs):
active_requests.inc()
start_time = time.time()
try:
result = await func(*args, **kwargs)
# Success metrics
requests_total.labels(agent_name=agent_name, status='success').inc()
duration = time.time() - start_time
request_duration.observe(duration)
# Token/cost tracking
tokens = result.get('tokens_used', 0)
tokens_used.labels(model=result.get('model', 'unknown')).inc(tokens)
cost = calculate_cost(tokens, result.get('model'))
cost_usd.inc(cost)
return result
except Exception as e:
# Error metrics
requests_total.labels(agent_name=agent_name, status='error').inc()
raise
finally:
active_requests.dec()
return wrapper
# Usage
@track_agent_request("customer_support")
async def handle_support_ticket(ticket):
# ... agent logic
pass
Visualize in Grafana:
# Success rate over time
sum(rate(agent_requests_total{status="success"}[5m]))
/
sum(rate(agent_requests_total[5m]))
# p95 latency
histogram_quantile(0.95, agent_request_duration_seconds)
2. Logs (Event Details)
What: Detailed records of specific events (LLM calls, decisions, errors).
Structured logging (JSON, not plain text):
import logging
import json
class StructuredLogger:
def __init__(self, name):
self.logger = logging.getLogger(name)
def log_llm_call(self, model, prompt, response, tokens, cost, latency_ms):
"""Log LLM API call with all details"""
self.logger.info(json.dumps({
"event": "llm_call",
"timestamp": datetime.utcnow().isoformat(),
"model": model,
"prompt_preview": prompt[:200], # First 200 chars
"prompt_length": len(prompt),
"response_preview": response[:200],
"tokens": {
"prompt": tokens["prompt"],
"completion": tokens["completion"],
"total": tokens["total"]
},
"cost_usd": cost,
"latency_ms": latency_ms
}))
def log_agent_decision(self, decision_point, chosen_action, alternatives, reasoning):
"""Log agent decision-making"""
self.logger.info(json.dumps({
"event": "agent_decision",
"timestamp": datetime.utcnow().isoformat(),
"decision_point": decision_point,
"chosen_action": chosen_action,
"alternatives": alternatives,
"reasoning": reasoning
}))
def log_error(self, error, context):
"""Log error with full context"""
self.logger.error(json.dumps({
"event": "error",
"timestamp": datetime.utcnow().isoformat(),
"error_type": type(error).__name__,
"error_message": str(error),
"stack_trace": traceback.format_exc(),
"context": context
}))
Usage:
logger = StructuredLogger("customer_support_agent")
# Log LLM call
start_time = time.time()
response = await call_llm(prompt, model="gpt-4-turbo")
latency = (time.time() - start_time) * 1000
logger.log_llm_call(
model="gpt-4-turbo",
prompt=prompt,
response=response["content"],
tokens=response["usage"],
cost=calculate_cost(response["usage"], "gpt-4-turbo"),
latency_ms=latency
)
Query logs (with aggregation tool like Datadog, Splunk):
# Find all failed LLM calls in last hour
event:llm_call status:error timestamp:>now-1h
# Find slow requests (>10s latency)
event:llm_call latency_ms:>10000
# Find expensive requests (>$1 cost)
event:llm_call cost_usd:>1.0
3. Traces (Request Flow)
What: Track request through entire system (which functions called, in what order, how long each took).
Implementation with OpenTelemetry:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger import JaegerExporter
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to Jaeger (trace visualization)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# Instrument agent
async def customer_support_agent(ticket):
with tracer.start_as_current_span("customer_support_agent") as span:
span.set_attribute("ticket_id", ticket["id"])
span.set_attribute("category", ticket["category"])
# Step 1: Classify
with tracer.start_as_current_span("classify_ticket"):
classification = await classify(ticket)
span.set_attribute("classification", classification)
# Step 2: Retrieve context
with tracer.start_as_current_span("retrieve_context"):
context = await get_context(ticket["user_id"])
# Step 3: Generate response
with tracer.start_as_current_span("generate_response") as gen_span:
response = await generate_response(classification, context)
gen_span.set_attribute("response_length", len(response))
return response
Visualized trace (in Jaeger UI):
customer_support_agent [2.3s total]
├─ classify_ticket [0.8s]
├─ retrieve_context [0.3s]
└─ generate_response [1.2s]
├─ llm_call [1.1s]
└─ format_output [0.1s]
Debugging: If agent is slow, trace shows exactly which step is the bottleneck.
Production Monitoring Dashboard
Key metrics to track:
1. Success Rate
Target: >95%
Alert if: Drops below 90% for 5 minutes
2. Latency (p50, p95, p99)
Target: p95 <5s
Alert if: p95 >10s for 5 minutes
3. Error Rate
Target: <2%
Alert if: >5% for 5 minutes
4. Cost per Request
Target: <$0.10
Alert if: >$1.00 (possible infinite loop)
5. Token Usage
Track: Prompt vs completion tokens
Alert if: Unusual spike (indicates prompt bloat)
Grafana Dashboard Example:
{
"dashboard": {
"title": "Agent Monitoring",
"panels": [
{
"title": "Success Rate (Last Hour)",
"targets": [{
"expr": "sum(rate(agent_requests_total{status='success'}[5m])) / sum(rate(agent_requests_total[5m]))"
}]
},
{
"title": "p95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, agent_request_duration_seconds)"
}]
},
{
"title": "Cost per Hour",
"targets": [{
"expr": "sum(increase(agent_cost_usd_total[1h]))"
}]
}
]
}
}
Agent-Specific Observability Tools
LangSmith (by LangChain)
Features:
- Automatic tracing for LangChain agents
- Prompt versioning
- Dataset management for evaluation
- Debugging UI (see exact prompts, responses)
Setup:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key"
# All LangChain calls now automatically traced
from langchain.agents import create_agent
agent = create_agent(...)
result = agent.invoke("user query") # Traced automatically
View in LangSmith UI: See full trace, prompts, intermediate steps, costs.
Helicone (LLM Analytics)
Features:
- LLM call analytics (latency, cost, tokens)
- Caching layer (reduce duplicate calls)
- Prompt templates
- User feedback tracking
Setup:
from helicone import Helicone
helicone = Helicone(api_key="your_api_key")
# Wrap OpenAI client
openai_client = helicone.wrap_openai(openai.Client())
# All calls automatically logged to Helicone
response = openai_client.chat.completions.create(...)
Dashboard: View cost breakdown, slowest prompts, error rates by model.
Sentry (Error Tracking)
Features:
- Error aggregation (group similar errors)
- Stack traces
- User context (who experienced error)
- Alerts on new/frequent errors
Setup:
import sentry_sdk
sentry_sdk.init(
dsn="your_sentry_dsn",
traces_sample_rate=0.1, # Sample 10% of traces
profiles_sample_rate=0.1
)
# Errors automatically captured
try:
result = agent.execute(query)
except Exception as e:
sentry_sdk.capture_exception(e)
Alert: Get Slack/email when new error pattern detected.
Debugging Production Issues
Common debugging workflows:
Issue: Agent success rate dropped to 80%
Steps:
- Check metrics dashboard → Identify time when drop started
- Query logs for that timeframe → Find common error pattern
- Check trace for failed request → See which step failing
- Reproduce locally with same inputs → Fix bug
- Deploy fix → Monitor metrics for recovery
Example:
# 1. Find time of drop (metrics show 3:42 PM)
# 2. Query logs
grep '"event":"error"' logs.json | grep '2024-04-18T15:4' | jq .
# Output shows: "error": "Database connection timeout"
# 3. Check trace
# Trace shows: retrieve_context step timing out
# 4. Fix: Increase database connection pool
# 5. Deploy, verify success rate recovers
Cost Monitoring and Alerts
Track LLM costs in real-time:
class CostTracker:
def __init__(self):
self.daily_costs = defaultdict(float)
def track_call(self, model, tokens):
cost = calculate_cost(tokens, model)
today = datetime.now().date()
self.daily_costs[today] += cost
# Alert if daily budget exceeded
if self.daily_costs[today] > 100: # $100 daily budget
send_alert(f"Daily LLM cost exceeded: ${self.daily_costs[today]:.2f}")
return cost
Set budgets by agent:
AGENT_BUDGETS = {
"customer_support": 50, # $50/day
"research_agent": 200, # $200/day
"code_review": 30 # $30/day
}
Frequently Asked Questions
How much observability is too much?
Recommendation: Start with metrics + basic logs. Add tracing if debugging multi-step workflows. Don't over-instrument (logs cost money, slow system).
Rule of thumb: Observability overhead should be <5% of total system cost.
Should I log full prompts and responses?
Security consideration: Prompts may contain sensitive data (PII, credentials).
Options:
- Log preview only (first 200 chars)
- Redact sensitive fields (emails, phone numbers)
- Encrypt logs (decrypt only when debugging)
- Don't log production data (only in staging)
How long should I retain logs?
Recommendation:
- Metrics: 90 days (cheap, small)
- Logs: 30 days (expensive, large)
- Traces: 7 days (very expensive)
Adjust based on compliance requirements (HIPAA, GDPR may require longer).
What metrics trigger alerts?
Critical (page on-call):
- Success rate <85%
- p95 latency >30s
- Error rate >10%
Warning (Slack/email):
- Success rate <95%
- p95 latency >10s
- Error rate >5%
- Daily cost >$150 (if budget is $100)
Bottom line: Observability essential for production AI agents. Track metrics (success rate, latency, cost), logs (LLM calls, errors), traces (request flow). Use agent-specific tools (LangSmith, Helicone) for LLM analytics, general tools (Datadog, Sentry) for infrastructure. Teams with comprehensive observability resolve issues 4× faster, reduce MTTR from hours to minutes.
Next: Read our Error Handling guide for reliability patterns.