TL;DR

Structured trace logging captures every agent decision, tool call, and handoff for debugging.
Track latency (p50/p95/p99), success rates, and token consumption as core metrics.
Use Sentry or similar platforms for error aggregation and alerting.
Build real-time dashboards showing agent health, active jobs, and performance trends.

Jump to Trace logging architecture · Jump to Performance metrics · Jump to Error tracking · Jump to Dashboards and alerts

Real-Time Agent Monitoring: Building Production Observability

AI agents fail in creative ways: they hallucinate, time out, invoke wrong tools, or get stuck in loops. Without proper observability, you discover issues hours later when users complain. Real-time monitoring surfaces problems immediately, often before they impact users.

This guide covers building a production monitoring stack for AI agents, drawing from our implementation at Athenic where we track 2,000+ agent executions daily across orchestration, research, development, and partnership workflows.

Key takeaways

Trace every agent action (thoughts, tool calls, handoffs) with structured logging.

Monitor success rates, latency percentiles, and token costs as primary health signals.

Aggregate errors by type (timeout, hallucination, tool failure) to prioritize fixes.

Alert on anomalies: sudden latency spikes, error rate increases, or cost overruns.

Trace logging architecture

Trace logging records every step of agent execution: what the agent thought, which tools it called, what results it received, and how it responded.

What to log

Every trace entry should capture:

Field	Type	Purpose
`trace_id`	UUID	Groups related actions in one execution
`session_id`	UUID	Links traces across multiple user interactions
`timestamp`	ISO 8601	When the action occurred
`agent_name`	string	Which agent performed the action
`action_type`	enum	`start`, `tool_call`, `handoff`, `complete`, `error`
`input`	JSON	Agent input (user message, context)
`output`	JSON	Agent output (response, tool results)
`metadata`	JSON	Latency, tokens used, cost, model

Schema example:

interface AgentTrace {
  trace_id: string;
  session_id: string;
  timestamp: string;
  agent_name: string;
  action_type: 'start' | 'tool_call' | 'handoff' | 'complete' | 'error';
  input: Record<string, any>;
  output?: Record<string, any>;
  metadata: {
    latency_ms?: number;
    tokens_used?: number;
    cost_usd?: number;
    model?: string;
    error_message?: string;
  };
}

Implementation with database storage

Store traces in a time-series optimized table (PostgreSQL with partitioning or TimescaleDB).

CREATE TABLE agent_traces (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  trace_id UUID NOT NULL,
  session_id UUID NOT NULL,
  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  agent_name TEXT NOT NULL,
  action_type TEXT NOT NULL,
  input JSONB,
  output JSONB,
  metadata JSONB,
  org_id TEXT NOT NULL
);

-- Index for fast trace reconstruction
CREATE INDEX idx_trace_id ON agent_traces(trace_id, timestamp);

-- Index for session-based queries
CREATE INDEX idx_session_id ON agent_traces(session_id, timestamp);

-- Partition by month for efficient pruning
CREATE INDEX idx_timestamp ON agent_traces(timestamp);

Logging wrapper:

class AgentTracer {
  private db: Database;
  private traceId: string;
  private sessionId: string;

  constructor(sessionId: string) {
    this.traceId = uuidv4();
    this.sessionId = sessionId;
  }

  async logStart(agentName: string, input: any) {
    await this.db.agentTraces.insert({
      trace_id: this.traceId,
      session_id: this.sessionId,
      agent_name: agentName,
      action_type: 'start',
      input,
    });
  }

  async logToolCall(agentName: string, toolName: string, params: any, result: any, latency: number) {
    await this.db.agentTraces.insert({
      trace_id: this.traceId,
      session_id: this.sessionId,
      agent_name: agentName,
      action_type: 'tool_call',
      input: { tool: toolName, params },
      output: result,
      metadata: { latency_ms: latency },
    });
  }

  async logComplete(agentName: string, output: any, metadata: any) {
    await this.db.agentTraces.insert({
      trace_id: this.traceId,
      session_id: this.sessionId,
      agent_name: agentName,
      action_type: 'complete',
      output,
      metadata,
    });
  }
}

Usage in agent:

async function runAgent(sessionId: string, userMessage: string) {
  const tracer = new AgentTracer(sessionId);

  await tracer.logStart('orchestrator', { message: userMessage });

  const startTime = Date.now();
  const result = await agent.run({ messages: [{ role: 'user', content: userMessage }] });
  const latency = Date.now() - startTime;

  await tracer.logComplete('orchestrator', result.content, {
    latency_ms: latency,
    tokens_used: result.usage.total_tokens,
    cost_usd: calculateCost(result.usage),
  });

  return result;
}

Structured vs unstructured logging

Don't do this:

console.log(`Agent ${agent} called tool ${tool} with params ${JSON.stringify(params)}`);

Unstructured logs are hard to query and analyze at scale.

Do this:

logger.info('agent.tool_call', {
  agent_name: agent,
  tool_name: tool,
  params,
  trace_id: traceId,
});

Structured logs enable fast filtering: "Show me all tool calls by the research agent in the last hour."

Performance metrics

Track these core metrics to understand agent health.

1. Success rate

Percentage of agent executions that complete without errors.

SELECT
  agent_name,
  COUNT(*) FILTER (WHERE action_type = 'complete') AS successful,
  COUNT(*) FILTER (WHERE action_type = 'error') AS failed,
  (COUNT(*) FILTER (WHERE action_type = 'complete')::float /
   NULLIF(COUNT(*), 0) * 100) AS success_rate
FROM agent_traces
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY agent_name;

Alert threshold: Success rate <95% for any agent.

2. Latency percentiles

Track p50, p95, and p99 latency to catch outliers.

SELECT
  agent_name,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p50_ms,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p95_ms,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p99_ms
FROM agent_traces
WHERE
  action_type = 'complete'
  AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY agent_name;

Alert thresholds:

p95 > 10s for orchestrator
p95 > 30s for research/developer agents
p99 > 60s for any agent

3. Token consumption and cost

Monitor daily token usage and costs to prevent budget overruns.

SELECT
  DATE_TRUNC('day', timestamp) AS day,
  agent_name,
  SUM((metadata->>'tokens_used')::int) AS total_tokens,
  SUM((metadata->>'cost_usd')::numeric) AS total_cost_usd
FROM agent_traces
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY day, agent_name
ORDER BY day DESC, agent_name;

Alert threshold: Daily cost >120% of 7-day moving average.

4. Tool invocation frequency

Understand which tools agents use most.

SELECT
  input->>'tool' AS tool_name,
  COUNT(*) AS invocations,
  AVG((metadata->>'latency_ms')::int) AS avg_latency_ms
FROM agent_traces
WHERE
  action_type = 'tool_call'
  AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY tool_name
ORDER BY invocations DESC;

High-frequency tools should be optimized for latency and caching.

Metrics collection pipeline

Use a background job to aggregate metrics every 5 minutes.

// Run every 5 minutes
cron.schedule('*/5 * * * *', async () => {
  const metrics = await db.query(`
    SELECT
      agent_name,
      COUNT(*) FILTER (WHERE action_type = 'complete') AS successful,
      COUNT(*) FILTER (WHERE action_type = 'error') AS failed,
      PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY (metadata->>'latency_ms')::int) AS p95_latency
    FROM agent_traces
    WHERE timestamp > NOW() - INTERVAL '5 minutes'
    GROUP BY agent_name
  `);

  for (const row of metrics.rows) {
    // Send to metrics platform (Datadog, Prometheus, etc.)
    metrics.gauge('agent.success_rate', row.successful / (row.successful + row.failed), {
      agent: row.agent_name,
    });

    metrics.histogram('agent.latency_p95', row.p95_latency, {
      agent: row.agent_name,
    });
  }
});

Error tracking

Errors fall into categories: transient (timeouts, rate limits) vs persistent (hallucinations, logic bugs).

Error taxonomy

Error type	Cause	Retry strategy
Timeout	Agent/tool exceeded time limit	Retry with exponential backoff
Rate limit	API quota exceeded	Retry after cooldown period
Tool failure	External service down	Retry 3x, then fallback
Hallucination	Agent generated invalid output	Re-prompt with stronger constraints
Logic error	Code bug in agent/tool	No retry, escalate to engineering

Sentry integration

Use Sentry to aggregate errors with stack traces and context.

import * as Sentry from '@sentry/node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
});

async function runAgentWithErrorTracking(sessionId: string, userMessage: string) {
  try {
    return await runAgent(sessionId, userMessage);
  } catch (error) {
    Sentry.captureException(error, {
      tags: {
        agent_type: 'orchestrator',
        session_id: sessionId,
      },
      contexts: {
        agent: {
          input: userMessage,
          trace_id: tracer.traceId,
        },
      },
    });

    throw error;
  }
}

Sentry dashboard shows:

Error frequency by agent type
Stack traces for debugging
User impact (how many sessions affected)
Release version correlation

Custom error detection

Detect hallucinations and invalid outputs with validation rules.

function validateAgentOutput(output: string, expectedFormat: string): boolean {
  // Check for common hallucination markers
  const hallucinationPatterns = [
    /\[INSERT.*?\]/i,
    /\[PLACEHOLDER\]/i,
    /TODO:/i,
    /\[Your.*?here\]/i,
  ];

  for (const pattern of hallucinationPatterns) {
    if (pattern.test(output)) {
      Sentry.captureMessage('Agent hallucination detected', {
        level: 'warning',
        tags: { validation_type: 'hallucination' },
        extra: { output, pattern: pattern.source },
      });

      return false;
    }
  }

  // Validate expected format (JSON, markdown, etc.)
  if (expectedFormat === 'json') {
    try {
      JSON.parse(output);
    } catch {
      Sentry.captureMessage('Invalid JSON output from agent', {
        level: 'warning',
        extra: { output },
      });

      return false;
    }
  }

  return true;
}

At Athenic, this validation caught 40+ hallucination instances per week that would have reached users.

Dashboards and alerts

Build real-time dashboards to visualize agent health.

Dashboard components

1. Active jobs panel

Currently running agents
Time elapsed for each
Progress indicators

2. Performance metrics

Success rate by agent (last hour, last 24h)
p95 latency trend (line chart)
Token consumption (bar chart by agent)

3. Error log

Recent errors with stack traces
Error rate trend
Top failing agents/tools

4. Cost tracking

Daily spend by agent
Projected monthly cost
Budget utilization percentage

Implementation with Supabase Realtime

Stream live agent status to dashboard using Supabase Realtime.

// Backend: Publish agent status updates
const supabase = createClient(process.env.SUPABASE_URL, process.env.SUPABASE_SERVICE_KEY);

async function publishAgentStatus(agentName: string, status: 'running' | 'completed' | 'error') {
  await supabase.from('agent_status').upsert({
    agent_name: agentName,
    status,
    updated_at: new Date(),
  });

  // Realtime broadcast
  await supabase.channel('agent-updates').send({
    type: 'broadcast',
    event: 'status_change',
    payload: { agent: agentName, status },
  });
}

// Frontend: Subscribe to updates
const channel = supabase.channel('agent-updates');

channel.on('broadcast', { event: 'status_change' }, (payload) => {
  console.log('Agent status changed:', payload);
  updateDashboard(payload.agent, payload.status);
}).subscribe();

Alerting rules

Define alerts for critical conditions.

interface AlertRule {
  name: string;
  condition: string;
  threshold: number;
  check_interval_minutes: number;
  notification_channels: string[];
}

const alertRules: AlertRule[] = [
  {
    name: 'High error rate',
    condition: 'error_rate > threshold',
    threshold: 0.05, // 5%
    check_interval_minutes: 5,
    notification_channels: ['slack', 'pagerduty'],
  },
  {
    name: 'Elevated latency',
    condition: 'p95_latency_ms > threshold',
    threshold: 10000, // 10s
    check_interval_minutes: 5,
    notification_channels: ['slack'],
  },
  {
    name: 'Cost overrun',
    condition: 'daily_cost_usd > threshold',
    threshold: 100,
    check_interval_minutes: 60,
    notification_channels: ['slack', 'email'],
  },
];

// Alert checking loop
cron.schedule('*/5 * * * *', async () => {
  for (const rule of alertRules) {
    const value = await evaluateMetric(rule.condition);

    if (value > rule.threshold) {
      await sendAlert(rule.name, value, rule.notification_channels);
    }
  }
});

Real-world case study: Athenic monitoring stack

Our production monitoring setup tracks 2,000+ agent executions daily with <5 minute incident detection time.

Architecture:

Trace storage: Supabase (PostgreSQL) with 30-day retention
Error tracking: Sentry for exception aggregation
Metrics: Custom dashboard built with Next.js + Recharts
Alerting: Slack webhooks for warnings, PagerDuty for critical

Key metrics tracked:

Metric	Current value	Alert threshold
Overall success rate	94.2%	<90%
p95 latency (orchestrator)	3.8s	>8s
p95 latency (research)	18.2s	>30s
Daily token cost	$42.15	>$100
Error rate (last hour)	1.2%	>5%

Incident example:

Last month, our GitHub MCP integration error rate spiked from 1% to 12% over 10 minutes. Our monitoring:

Detected anomaly within 5 minutes (first alert check)
Sent Slack alert to on-call engineer
Dashboard showed pattern: all errors on create_issue tool
Engineer identified GitHub API outage via status page
Implemented fallback (queue issues, retry when service recovered)
No user impact due to fast detection and mitigation

Without monitoring, we'd have discovered this hours later via user reports.

Call-to-action (Activation stage) Clone our agent monitoring starter kit with pre-built trace logging, Sentry integration, and dashboard templates.

FAQs

How long should I retain traces?

30 days for detailed traces, 90 days for aggregated metrics. Longer retention increases storage costs without much debugging value.

Should I log user inputs and agent outputs?

Yes for debugging, but scrub PII (emails, phone numbers, credentials) before storage. Use regex patterns or NER models to detect sensitive data.

How do I handle high-volume trace logging?

Batch inserts every 5-10 seconds instead of individual writes. Use background workers to offload database writes from agent execution path.

Can I use open-source alternatives to Sentry?

Yes. Consider GlitchTip (open-source Sentry alternative), or custom logging to Elasticsearch/Loki for budget-conscious setups.

How do I monitor multi-agent workflows?

Use trace_id to group all agents involved in one workflow. Dashboard should show workflow-level success rates and latency, not just individual agents.

Summary and next steps

Production agent monitoring requires structured trace logging, performance metrics (success rate, latency, cost), error tracking with categorization, and real-time dashboards with alerting. These tools detect issues in minutes rather than hours.

Next steps:

Implement structured trace logging for all agent executions.
Track success rate, p95 latency, and daily cost as core metrics.
Integrate Sentry or similar error tracking platform.
Build a simple dashboard showing active jobs and recent errors.
Set up Slack/email alerts for critical conditions (high error rate, cost overruns).

Internal links:

External references:

Sentry Documentation – error tracking and monitoring
OpenTelemetry for AI – distributed tracing standard
Datadog AI Observability – commercial monitoring platform
Prometheus Best Practices – metrics collection patterns

Crosslinks: