Real-Time Agent Monitoring: Building Production Observability
Implement comprehensive monitoring for AI agents with trace logging, performance metrics, error tracking, and real-time alerting to catch issues before users do.
Implement comprehensive monitoring for AI agents with trace logging, performance metrics, error tracking, and real-time alerting to catch issues before users do.
TL;DR
Jump to Trace logging architecture · Jump to Performance metrics · Jump to Error tracking · Jump to Dashboards and alerts
AI agents fail in creative ways: they hallucinate, time out, invoke wrong tools, or get stuck in loops. Without proper observability, you discover issues hours later when users complain. Real-time monitoring surfaces problems immediately, often before they impact users.
This guide covers building a production monitoring stack for AI agents, drawing from our implementation at Athenic where we track 2,000+ agent executions daily across orchestration, research, development, and partnership workflows.
Key takeaways
- Trace every agent action (thoughts, tool calls, handoffs) with structured logging.
- Monitor success rates, latency percentiles, and token costs as primary health signals.
- Aggregate errors by type (timeout, hallucination, tool failure) to prioritize fixes.
- Alert on anomalies: sudden latency spikes, error rate increases, or cost overruns.
Trace logging records every step of agent execution: what the agent thought, which tools it called, what results it received, and how it responded.
Every trace entry should capture:
| Field | Type | Purpose |
|---|---|---|
trace_id | UUID | Groups related actions in one execution |
session_id | UUID | Links traces across multiple user interactions |
timestamp | ISO 8601 | When the action occurred |
agent_name | string | Which agent performed the action |
action_type | enum | start, tool_call, handoff, complete, error |
input | JSON | Agent input (user message, context) |
output | JSON | Agent output (response, tool results) |
metadata | JSON | Latency, tokens used, cost, model |
Schema example:
interface AgentTrace {
trace_id: string;
session_id: string;
timestamp: string;
agent_name: string;
action_type: 'start' | 'tool_call' | 'handoff' | 'complete' | 'error';
input: Record<string, any>;
output?: Record<string, any>;
metadata: {
latency_ms?: number;
tokens_used?: number;
cost_usd?: number;
model?: string;
error_message?: string;
};
}
Store traces in a time-series optimized table (PostgreSQL with partitioning or TimescaleDB).
CREATE TABLE agent_traces (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
trace_id UUID NOT NULL,
session_id UUID NOT NULL,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
agent_name TEXT NOT NULL,
action_type TEXT NOT NULL,
input JSONB,
output JSONB,
metadata JSONB,
org_id TEXT NOT NULL
);
-- Index for fast trace reconstruction
CREATE INDEX idx_trace_id ON agent_traces(trace_id, timestamp);
-- Index for session-based queries
CREATE INDEX idx_session_id ON agent_traces(session_id, timestamp);
-- Partition by month for efficient pruning
CREATE INDEX idx_timestamp ON agent_traces(timestamp);
Logging wrapper:
class AgentTracer {
private db: Database;
private traceId: string;
private sessionId: string;
constructor(sessionId: string) {
this.traceId = uuidv4();
this.sessionId = sessionId;
}
async logStart(agentName: string, input: any) {
await this.db.agentTraces.insert({
trace_id: this.traceId,
session_id: this.sessionId,
agent_name: agentName,
action_type: 'start',
input,
});
}
async logToolCall(agentName: string, toolName: string, params: any, result: any, latency: number) {
await this.db.agentTraces.insert({
trace_id: this.traceId,
session_id: this.sessionId,
agent_name: agentName,
action_type: 'tool_call',
input: { tool: toolName, params },
output: result,
metadata: { latency_ms: latency },
});
}
async logComplete(agentName: string, output: any, metadata: any) {
await this.db.agentTraces.insert({
trace_id: this.traceId,
session_id: this.sessionId,
agent_name: agentName,
action_type: 'complete',
output,
metadata,
});
}
}
Usage in agent:
async function runAgent(sessionId: string, userMessage: string) {
const tracer = new AgentTracer(sessionId);
await tracer.logStart('orchestrator', { message: userMessage });
const startTime = Date.now();
const result = await agent.run({ messages: [{ role: 'user', content: userMessage }] });
const latency = Date.now() - startTime;
await tracer.logComplete('orchestrator', result.content, {
latency_ms: latency,
tokens_used: result.usage.total_tokens,
cost_usd: calculateCost(result.usage),
});
return result;
}
Don't do this:
console.log(`Agent ${agent} called tool ${tool} with params ${JSON.stringify(params)}`);
Unstructured logs are hard to query and analyze at scale.
Do this:
logger.info('agent.tool_call', {
agent_name: agent,
tool_name: tool,
params,
trace_id: traceId,
});
Structured logs enable fast filtering: "Show me all tool calls by the research agent in the last hour."
Track these core metrics to understand agent health.
Percentage of agent executions that complete without errors.
SELECT
agent_name,
COUNT(*) FILTER (WHERE action_type = 'complete') AS successful,
COUNT(*) FILTER (WHERE action_type = 'error') AS failed,
(COUNT(*) FILTER (WHERE action_type = 'complete')::float /
NULLIF(COUNT(*), 0) * 100) AS success_rate
FROM agent_traces
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY agent_name;
Alert threshold: Success rate <95% for any agent.
Track p50, p95, and p99 latency to catch outliers.
SELECT
agent_name,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p50_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p95_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p99_ms
FROM agent_traces
WHERE
action_type = 'complete'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY agent_name;
Alert thresholds:
Monitor daily token usage and costs to prevent budget overruns.
SELECT
DATE_TRUNC('day', timestamp) AS day,
agent_name,
SUM((metadata->>'tokens_used')::int) AS total_tokens,
SUM((metadata->>'cost_usd')::numeric) AS total_cost_usd
FROM agent_traces
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY day, agent_name
ORDER BY day DESC, agent_name;
Alert threshold: Daily cost >120% of 7-day moving average.
Understand which tools agents use most.
SELECT
input->>'tool' AS tool_name,
COUNT(*) AS invocations,
AVG((metadata->>'latency_ms')::int) AS avg_latency_ms
FROM agent_traces
WHERE
action_type = 'tool_call'
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY tool_name
ORDER BY invocations DESC;
High-frequency tools should be optimized for latency and caching.
Use a background job to aggregate metrics every 5 minutes.
// Run every 5 minutes
cron.schedule('*/5 * * * *', async () => {
const metrics = await db.query(`
SELECT
agent_name,
COUNT(*) FILTER (WHERE action_type = 'complete') AS successful,
COUNT(*) FILTER (WHERE action_type = 'error') AS failed,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p95_latency
FROM agent_traces
WHERE timestamp > NOW() - INTERVAL '5 minutes'
GROUP BY agent_name
`);
for (const row of metrics.rows) {
// Send to metrics platform (Datadog, Prometheus, etc.)
metrics.gauge('agent.success_rate', row.successful / (row.successful + row.failed), {
agent: row.agent_name,
});
metrics.histogram('agent.latency_p95', row.p95_latency, {
agent: row.agent_name,
});
}
});
Errors fall into categories: transient (timeouts, rate limits) vs persistent (hallucinations, logic bugs).
| Error type | Cause | Retry strategy |
|---|---|---|
| Timeout | Agent/tool exceeded time limit | Retry with exponential backoff |
| Rate limit | API quota exceeded | Retry after cooldown period |
| Tool failure | External service down | Retry 3x, then fallback |
| Hallucination | Agent generated invalid output | Re-prompt with stronger constraints |
| Logic error | Code bug in agent/tool | No retry, escalate to engineering |
Use Sentry to aggregate errors with stack traces and context.
import * as Sentry from '@sentry/node';
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
});
async function runAgentWithErrorTracking(sessionId: string, userMessage: string) {
try {
return await runAgent(sessionId, userMessage);
} catch (error) {
Sentry.captureException(error, {
tags: {
agent_type: 'orchestrator',
session_id: sessionId,
},
contexts: {
agent: {
input: userMessage,
trace_id: tracer.traceId,
},
},
});
throw error;
}
}
Sentry dashboard shows:
Detect hallucinations and invalid outputs with validation rules.
function validateAgentOutput(output: string, expectedFormat: string): boolean {
// Check for common hallucination markers
const hallucinationPatterns = [
/\[INSERT.*?\]/i,
/\[PLACEHOLDER\]/i,
/TODO:/i,
/\[Your.*?here\]/i,
];
for (const pattern of hallucinationPatterns) {
if (pattern.test(output)) {
Sentry.captureMessage('Agent hallucination detected', {
level: 'warning',
tags: { validation_type: 'hallucination' },
extra: { output, pattern: pattern.source },
});
return false;
}
}
// Validate expected format (JSON, markdown, etc.)
if (expectedFormat === 'json') {
try {
JSON.parse(output);
} catch {
Sentry.captureMessage('Invalid JSON output from agent', {
level: 'warning',
extra: { output },
});
return false;
}
}
return true;
}
At Athenic, this validation caught 40+ hallucination instances per week that would have reached users.
Build real-time dashboards to visualize agent health.
1. Active jobs panel
2. Performance metrics
3. Error log
4. Cost tracking
Stream live agent status to dashboard using Supabase Realtime.
// Backend: Publish agent status updates
const supabase = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_SERVICE_KEY);
async function publishAgentStatus(agentName: string, status: 'running' | 'completed' | 'error') {
await supabase.from('agent_status').upsert({
agent_name: agentName,
status,
updated_at: new Date(),
});
// Realtime broadcast
await supabase.channel('agent-updates').send({
type: 'broadcast',
event: 'status_change',
payload: { agent: agentName, status },
});
}
// Frontend: Subscribe to updates
const channel = supabase.channel('agent-updates');
channel.on('broadcast', { event: 'status_change' }, (payload) => {
console.log('Agent status changed:', payload);
updateDashboard(payload.agent, payload.status);
}).subscribe();
Define alerts for critical conditions.
interface AlertRule {
name: string;
condition: string;
threshold: number;
check_interval_minutes: number;
notification_channels: string[];
}
const alertRules: AlertRule[] = [
{
name: 'High error rate',
condition: 'error_rate > threshold',
threshold: 0.05, // 5%
check_interval_minutes: 5,
notification_channels: ['slack', 'pagerduty'],
},
{
name: 'Elevated latency',
condition: 'p95_latency_ms > threshold',
threshold: 10000, // 10s
check_interval_minutes: 5,
notification_channels: ['slack'],
},
{
name: 'Cost overrun',
condition: 'daily_cost_usd > threshold',
threshold: 100,
check_interval_minutes: 60,
notification_channels: ['slack', 'email'],
},
];
// Alert checking loop
cron.schedule('*/5 * * * *', async () => {
for (const rule of alertRules) {
const value = await evaluateMetric(rule.condition);
if (value > rule.threshold) {
await sendAlert(rule.name, value, rule.notification_channels);
}
}
});
Our production monitoring setup tracks 2,000+ agent executions daily with <5 minute incident detection time.
Architecture:
Key metrics tracked:
| Metric | Current value | Alert threshold |
|---|---|---|
| Overall success rate | 94.2% | <90% |
| p95 latency (orchestrator) | 3.8s | >8s |
| p95 latency (research) | 18.2s | >30s |
| Daily token cost | $42.15 | >$100 |
| Error rate (last hour) | 1.2% | >5% |
Incident example:
Last month, our GitHub MCP integration error rate spiked from 1% to 12% over 10 minutes. Our monitoring:
create_issue toolWithout monitoring, we'd have discovered this hours later via user reports.
Call-to-action (Activation stage) Clone our agent monitoring starter kit with pre-built trace logging, Sentry integration, and dashboard templates.
30 days for detailed traces, 90 days for aggregated metrics. Longer retention increases storage costs without much debugging value.
Yes for debugging, but scrub PII (emails, phone numbers, credentials) before storage. Use regex patterns or NER models to detect sensitive data.
Batch inserts every 5-10 seconds instead of individual writes. Use background workers to offload database writes from agent execution path.
Yes. Consider GlitchTip (open-source Sentry alternative), or custom logging to Elasticsearch/Loki for budget-conscious setups.
Use trace_id to group all agents involved in one workflow. Dashboard should show workflow-level success rates and latency, not just individual agents.
Production agent monitoring requires structured trace logging, performance metrics (success rate, latency, cost), error tracking with categorization, and real-time dashboards with alerting. These tools detect issues in minutes rather than hours.
Next steps:
Internal links:
External references:
Crosslinks: