TL;DR

Traditional logging isn't enough - AI agents need trace-level observability showing each decision, tool call, and LLM interaction.
Log prompts, responses, latencies, and token counts for every LLM call. You'll need these when debugging.
Build debugging workflows: reproduce failures locally, trace full execution paths, identify prompt regressions.
Track costs per agent, per user, per feature. LLM costs can spiral without visibility.

Jump to Why observability is different for AI · Jump to Trace logging architecture · Jump to Metrics and dashboards · Jump to Debugging workflows

AI Agent Logging and Observability: Debugging Production Failures

An agent that worked perfectly yesterday starts producing wrong answers today. Users report issues, but you can't reproduce them. Was it a prompt change? A model update? Different input data? Without proper observability, you're guessing.

AI agent observability goes beyond traditional application monitoring. You need to understand not just that a request failed, but why the model made the decisions it made, what context it had, and how slight variations in input led to different outputs.

This guide covers building observability systems for AI agents, from basic logging to sophisticated debugging workflows. We'll implement the patterns we use at Athenic to debug issues across thousands of daily agent executions.

Key takeaways

AI failures are often non-deterministic. Log everything so you can replay exact conditions.

Trace IDs tie together multi-step agent executions. Without them, debugging is impossible.

Metrics need both operational views (latency, errors) and quality views (task success, user satisfaction).

Cost tracking is observability too. Know where your money goes before it becomes a problem.

Why observability is different for AI

Traditional software is deterministic. Same input, same output. When something breaks, you check the error message, trace the stack, find the bug.

AI agents are probabilistic. Same input might produce different outputs. The "bug" might be a subtle prompt issue, a context length problem, a tool failure, or just model randomness. You need observability that captures:

What the model saw: Full prompts, including system messages, context, history
What the model did: Each decision, tool call, reasoning step
What the model produced: Complete outputs, not just final answers
What context influenced it: Retrieved documents, conversation history, user metadata

The debugging challenge

Consider debugging this user complaint: "The agent gave me wrong information about our pricing."

Without observability, you're stuck:

What did the user actually ask?
What context did the agent retrieve?
Was the pricing info in the knowledge base correct?
Did the agent hallucinate or misread the data?
Has this happened before with similar queries?

With proper observability, you can reconstruct the exact execution, see every prompt and response, and identify whether the issue was retrieval, reasoning, or data.

"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital

Trace logging architecture

Build logging around traces - units of work that span multiple steps.

Trace schema

interface AgentTrace {
  // Identification
  traceId: string;          // Unique ID for this execution
  parentTraceId?: string;   // For sub-agent calls
  sessionId: string;        // Conversation session
  userId: string;
  orgId: string;

  // Timing
  startedAt: Date;
  completedAt?: Date;
  durationMs?: number;

  // Execution context
  agentName: string;
  agentVersion: string;
  modelId: string;
  temperature: number;

  // Input/Output
  input: {
    userMessage: string;
    systemPrompt: string;
    contextDocuments: string[];
    conversationHistory: Message[];
  };
  output?: {
    response: string;
    toolCalls: ToolCall[];
    finishReason: string;
  };

  // Metrics
  tokens: {
    input: number;
    output: number;
    total: number;
  };
  cost: {
    inputCost: number;
    outputCost: number;
    totalCost: number;
  };

  // Status
  status: 'running' | 'completed' | 'failed' | 'timeout';
  error?: {
    code: string;
    message: string;
    stack?: string;
  };

  // Nested events
  events: TraceEvent[];
}

interface TraceEvent {
  eventId: string;
  traceId: string;
  timestamp: Date;
  eventType: 'llm_call' | 'tool_call' | 'retrieval' | 'handoff' | 'error' | 'custom';
  name: string;
  input: any;
  output?: any;
  durationMs: number;
  metadata: Record<string, any>;
}

Logging implementation

class TraceLogger {
  private db: Database;
  private currentTrace: AgentTrace | null = null;

  startTrace(input: {
    sessionId: string;
    userId: string;
    orgId: string;
    agentName: string;
    userMessage: string;
    systemPrompt: string;
    contextDocuments: string[];
    conversationHistory: Message[];
    modelId: string;
  }): string {
    const traceId = generateId();

    this.currentTrace = {
      traceId,
      sessionId: input.sessionId,
      userId: input.userId,
      orgId: input.orgId,
      agentName: input.agentName,
      agentVersion: process.env.AGENT_VERSION || 'unknown',
      modelId: input.modelId,
      temperature: 0.7,
      startedAt: new Date(),
      input: {
        userMessage: input.userMessage,
        systemPrompt: input.systemPrompt,
        contextDocuments: input.contextDocuments,
        conversationHistory: input.conversationHistory
      },
      tokens: { input: 0, output: 0, total: 0 },
      cost: { inputCost: 0, outputCost: 0, totalCost: 0 },
      status: 'running',
      events: []
    };

    // Write initial trace record
    this.db.traces.insert(this.currentTrace);

    return traceId;
  }

  logEvent(event: Omit<TraceEvent, 'eventId' | 'timestamp'>): void {
    if (!this.currentTrace) {
      console.warn('No active trace for event');
      return;
    }

    const fullEvent: TraceEvent = {
      ...event,
      eventId: generateId(),
      traceId: this.currentTrace.traceId,
      timestamp: new Date()
    };

    this.currentTrace.events.push(fullEvent);

    // Write event to database
    this.db.traceEvents.insert(fullEvent);
  }

  logLLMCall(input: {
    prompt: string;
    response: string;
    model: string;
    tokens: { input: number; output: number };
    durationMs: number;
  }): void {
    this.logEvent({
      eventType: 'llm_call',
      name: 'chat_completion',
      input: { prompt: input.prompt, model: input.model },
      output: { response: input.response },
      durationMs: input.durationMs,
      metadata: {
        inputTokens: input.tokens.input,
        outputTokens: input.tokens.output
      }
    });

    // Update trace totals
    if (this.currentTrace) {
      this.currentTrace.tokens.input += input.tokens.input;
      this.currentTrace.tokens.output += input.tokens.output;
      this.currentTrace.tokens.total += input.tokens.input + input.tokens.output;

      // Calculate costs
      const costs = calculateCosts(input.model, input.tokens);
      this.currentTrace.cost.inputCost += costs.input;
      this.currentTrace.cost.outputCost += costs.output;
      this.currentTrace.cost.totalCost += costs.total;
    }
  }

  logToolCall(input: {
    toolName: string;
    arguments: any;
    result: any;
    success: boolean;
    durationMs: number;
  }): void {
    this.logEvent({
      eventType: 'tool_call',
      name: input.toolName,
      input: input.arguments,
      output: { result: input.result, success: input.success },
      durationMs: input.durationMs,
      metadata: {}
    });
  }

  completeTrace(output: {
    response: string;
    toolCalls: ToolCall[];
    finishReason: string;
  }): void {
    if (!this.currentTrace) return;

    this.currentTrace.completedAt = new Date();
    this.currentTrace.durationMs =
      this.currentTrace.completedAt.getTime() - this.currentTrace.startedAt.getTime();
    this.currentTrace.status = 'completed';
    this.currentTrace.output = output;

    // Update database
    this.db.traces.update(this.currentTrace.traceId, this.currentTrace);

    this.currentTrace = null;
  }

  failTrace(error: Error): void {
    if (!this.currentTrace) return;

    this.currentTrace.completedAt = new Date();
    this.currentTrace.durationMs =
      this.currentTrace.completedAt.getTime() - this.currentTrace.startedAt.getTime();
    this.currentTrace.status = 'failed';
    this.currentTrace.error = {
      code: error.name,
      message: error.message,
      stack: error.stack
    };

    this.db.traces.update(this.currentTrace.traceId, this.currentTrace);

    this.currentTrace = null;
  }
}

Integration with agent execution

async function executeAgent(
  userMessage: string,
  context: ExecutionContext
): Promise<AgentResponse> {
  const logger = new TraceLogger(db);

  const traceId = logger.startTrace({
    sessionId: context.sessionId,
    userId: context.userId,
    orgId: context.orgId,
    agentName: 'research',
    userMessage,
    systemPrompt: RESEARCH_AGENT_PROMPT,
    contextDocuments: context.retrievedDocs,
    conversationHistory: context.history,
    modelId: 'gpt-4o'
  });

  try {
    // Log retrieval
    const startRetrieval = Date.now();
    const docs = await retrieveDocuments(userMessage);
    logger.logEvent({
      eventType: 'retrieval',
      name: 'vector_search',
      input: { query: userMessage },
      output: { documentCount: docs.length },
      durationMs: Date.now() - startRetrieval,
      metadata: {}
    });

    // Log LLM call
    const startLLM = Date.now();
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: buildMessages(userMessage, docs, context.history)
    });

    logger.logLLMCall({
      prompt: JSON.stringify(buildMessages(userMessage, docs, context.history)),
      response: response.choices[0].message.content,
      model: 'gpt-4o',
      tokens: {
        input: response.usage.prompt_tokens,
        output: response.usage.completion_tokens
      },
      durationMs: Date.now() - startLLM
    });

    // Handle tool calls if any
    if (response.choices[0].message.tool_calls) {
      for (const toolCall of response.choices[0].message.tool_calls) {
        const startTool = Date.now();
        const result = await executeTool(
          toolCall.function.name,
          JSON.parse(toolCall.function.arguments)
        );

        logger.logToolCall({
          toolName: toolCall.function.name,
          arguments: JSON.parse(toolCall.function.arguments),
          result,
          success: true,
          durationMs: Date.now() - startTool
        });
      }
    }

    logger.completeTrace({
      response: response.choices[0].message.content,
      toolCalls: response.choices[0].message.tool_calls || [],
      finishReason: response.choices[0].finish_reason
    });

    return { content: response.choices[0].message.content, traceId };

  } catch (error) {
    logger.failTrace(error);
    throw error;
  }
}

Metrics and dashboards

Transform trace data into actionable metrics.

Core metrics to track

Metric	Type	Purpose
Request count	Counter	Volume tracking
Latency p50/p95/p99	Histogram	Performance monitoring
Error rate	Ratio	Reliability
Token usage	Counter	Cost projection
Cost per request	Histogram	Budget management
Tool call success rate	Ratio	Integration health
Task completion rate	Ratio	Agent quality

Implementation

class AgentMetrics {
  private metrics: MetricsClient;

  recordRequest(trace: AgentTrace): void {
    // Request count
    this.metrics.increment('agent.requests', {
      agent: trace.agentName,
      status: trace.status
    });

    // Latency
    if (trace.durationMs) {
      this.metrics.histogram('agent.latency_ms', trace.durationMs, {
        agent: trace.agentName
      });
    }

    // Token usage
    this.metrics.increment('agent.tokens.input', trace.tokens.input, {
      agent: trace.agentName,
      model: trace.modelId
    });
    this.metrics.increment('agent.tokens.output', trace.tokens.output, {
      agent: trace.agentName,
      model: trace.modelId
    });

    // Cost
    this.metrics.histogram('agent.cost_usd', trace.cost.totalCost, {
      agent: trace.agentName
    });

    // Error tracking
    if (trace.status === 'failed') {
      this.metrics.increment('agent.errors', {
        agent: trace.agentName,
        error_code: trace.error?.code || 'unknown'
      });
    }
  }

  recordToolCall(event: TraceEvent): void {
    this.metrics.increment('agent.tool_calls', {
      tool: event.name,
      success: String(event.output?.success ?? true)
    });

    this.metrics.histogram('agent.tool_latency_ms', event.durationMs, {
      tool: event.name
    });
  }
}

Dashboard queries

-- Request volume and latency by agent (last 24h)
SELECT
  agent_name,
  COUNT(*) as request_count,
  AVG(duration_ms) as avg_latency,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_latency,
  SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END)::float / COUNT(*) as error_rate
FROM agent_traces
WHERE started_at > NOW() - INTERVAL '24 hours'
GROUP BY agent_name
ORDER BY request_count DESC;

-- Cost breakdown by agent and model (last 7 days)
SELECT
  agent_name,
  model_id,
  SUM(total_cost) as total_cost,
  SUM(input_tokens) as total_input_tokens,
  SUM(output_tokens) as total_output_tokens,
  COUNT(*) as request_count,
  AVG(total_cost) as avg_cost_per_request
FROM agent_traces
WHERE started_at > NOW() - INTERVAL '7 days'
GROUP BY agent_name, model_id
ORDER BY total_cost DESC;

-- Tool reliability (last 24h)
SELECT
  name as tool_name,
  COUNT(*) as call_count,
  SUM(CASE WHEN (output->>'success')::boolean THEN 1 ELSE 0 END)::float / COUNT(*) as success_rate,
  AVG(duration_ms) as avg_latency
FROM trace_events
WHERE event_type = 'tool_call'
  AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY name
ORDER BY call_count DESC;

Debugging workflows

Metrics tell you something is wrong. Debugging workflows help you fix it.

Workflow 1: Reproduce a failure

When a user reports an issue, reproduce the exact conditions.

async function reproduceTrace(traceId: string): Promise<void> {
  // Fetch the original trace
  const trace = await db.traces.findById(traceId);

  if (!trace) {
    throw new Error(`Trace ${traceId} not found`);
  }

  console.log('=== Original Execution ===');
  console.log('User message:', trace.input.userMessage);
  console.log('System prompt length:', trace.input.systemPrompt.length);
  console.log('Context documents:', trace.input.contextDocuments.length);
  console.log('History length:', trace.input.conversationHistory.length);
  console.log('Model:', trace.modelId);
  console.log('Output:', trace.output?.response);
  console.log('Status:', trace.status);

  // Replay with same inputs
  console.log('\n=== Replaying ===');
  const replayResult = await executeAgentWithInputs({
    userMessage: trace.input.userMessage,
    systemPrompt: trace.input.systemPrompt,
    contextDocuments: trace.input.contextDocuments,
    conversationHistory: trace.input.conversationHistory,
    modelId: trace.modelId
  });

  console.log('Replay output:', replayResult.content);
  console.log('Match:', replayResult.content === trace.output?.response ? 'YES' : 'NO');
}

Workflow 2: Compare prompt versions

When you suspect a prompt change caused issues:

async function comparePromptVersions(
  userMessage: string,
  promptVersions: { version: string; prompt: string }[]
): Promise<void> {
  console.log('=== Prompt Comparison ===');
  console.log('User message:', userMessage);

  for (const { version, prompt } of promptVersions) {
    console.log(`\n--- Version: ${version} ---`);

    const result = await executeAgentWithInputs({
      userMessage,
      systemPrompt: prompt,
      modelId: 'gpt-4o'
    });

    console.log('Response:', result.content);
    console.log('Tokens:', result.tokens);
    console.log('Cost:', result.cost);
  }
}

Workflow 3: Trace analysis for patterns

Find patterns in failures:

async function analyseFailures(
  agentName: string,
  hours: number = 24
): Promise<void> {
  const failures = await db.traces.find({
    agentName,
    status: 'failed',
    startedAt: { $gt: new Date(Date.now() - hours * 3600000) }
  });

  console.log(`Found ${failures.length} failures in last ${hours} hours`);

  // Group by error code
  const byError = groupBy(failures, t => t.error?.code || 'unknown');
  console.log('\nFailures by error code:');
  Object.entries(byError).forEach(([code, traces]) => {
    console.log(`  ${code}: ${traces.length}`);
  });

  // Common input patterns
  const inputLengths = failures.map(t => t.input.userMessage.length);
  console.log('\nInput message length stats:');
  console.log(`  Min: ${Math.min(...inputLengths)}`);
  console.log(`  Max: ${Math.max(...inputLengths)}`);
  console.log(`  Avg: ${average(inputLengths).toFixed(0)}`);

  // Look for common keywords in failed inputs
  const allWords = failures
    .flatMap(t => t.input.userMessage.toLowerCase().split(/\s+/))
    .filter(w => w.length > 4);
  const wordCounts = countBy(allWords);
  const topWords = Object.entries(wordCounts)
    .sort((a, b) => b[1] - a[1])
    .slice(0, 10);

  console.log('\nMost common words in failed inputs:');
  topWords.forEach(([word, count]) => {
    console.log(`  ${word}: ${count}`);
  });
}

Workflow 4: Cost anomaly detection

Alert when costs spike unexpectedly:

async function detectCostAnomalies(): Promise<void> {
  // Get hourly costs for last 48 hours
  const hourlyCosts = await db.query(`
    SELECT
      DATE_TRUNC('hour', started_at) as hour,
      SUM(total_cost) as cost,
      COUNT(*) as requests
    FROM agent_traces
    WHERE started_at > NOW() - INTERVAL '48 hours'
    GROUP BY 1
    ORDER BY 1
  `);

  // Calculate baseline (first 24 hours)
  const baseline = hourlyCosts.slice(0, 24);
  const avgCost = average(baseline.map(h => h.cost));
  const stdDev = standardDeviation(baseline.map(h => h.cost));

  // Check recent hours against baseline
  const recent = hourlyCosts.slice(24);

  for (const hour of recent) {
    const zScore = (hour.cost - avgCost) / stdDev;

    if (zScore > 3) {  // More than 3 standard deviations
      console.log(`ALERT: Cost anomaly detected`);
      console.log(`  Hour: ${hour.hour}`);
      console.log(`  Cost: $${hour.cost.toFixed(2)}`);
      console.log(`  Baseline: $${avgCost.toFixed(2)} ± $${stdDev.toFixed(2)}`);
      console.log(`  Z-score: ${zScore.toFixed(2)}`);
      console.log(`  Requests: ${hour.requests}`);

      // Find what's driving the cost
      const topAgents = await db.query(`
        SELECT agent_name, SUM(total_cost) as cost
        FROM agent_traces
        WHERE DATE_TRUNC('hour', started_at) = $1
        GROUP BY agent_name
        ORDER BY cost DESC
        LIMIT 5
      `, [hour.hour]);

      console.log('  Top cost drivers:');
      topAgents.forEach(a => {
        console.log(`    ${a.agent_name}: $${a.cost.toFixed(2)}`);
      });
    }
  }
}

FAQs

How much logging is too much?

Log everything in development. In production, sample non-error traces if volume is high. Always log full traces for errors.

Should I log full prompts? What about PII?

Yes, log full prompts - you need them for debugging. But implement PII scrubbing and access controls. Only authorised personnel should access trace data.

How long should I retain traces?

Keep full traces for 30-90 days. Keep aggregated metrics longer. For compliance-sensitive applications, follow your retention policies.

What's the performance overhead of logging?

Async logging has minimal overhead (<5ms per request). Write to fast storage (memory queue → background worker → database), not synchronously.

Should I use commercial observability tools?

Tools like LangSmith, Helicone, and Langfuse are excellent and save development time. Build custom if you have specific requirements or compliance constraints.

Summary and next steps

AI agent observability requires thinking beyond traditional logging. Trace-level visibility, full prompt capture, and debugging workflows let you understand and fix issues in probabilistic systems.

Implementation checklist:

Design trace schema capturing full execution context
Instrument agent code with trace logging
Build metrics dashboards for operational health
Create debugging workflows for common scenarios
Add cost tracking and anomaly detection
Implement access controls for trace data

Quick wins:

Start logging full prompts and responses (even before infrastructure)
Add trace IDs to all agent executions
Track token usage and costs from day one

Internal links:

External references:

LangSmith - LangChain's observability platform
Helicone - LLM observability
OpenTelemetry - General observability framework