TL;DR
- Traditional logging isn't enough - AI agents need trace-level observability showing each decision, tool call, and LLM interaction.
- Log prompts, responses, latencies, and token counts for every LLM call. You'll need these when debugging.
- Build debugging workflows: reproduce failures locally, trace full execution paths, identify prompt regressions.
- Track costs per agent, per user, per feature. LLM costs can spiral without visibility.
Jump to Why observability is different for AI · Jump to Trace logging architecture · Jump to Metrics and dashboards · Jump to Debugging workflows
AI Agent Logging and Observability: Debugging Production Failures
An agent that worked perfectly yesterday starts producing wrong answers today. Users report issues, but you can't reproduce them. Was it a prompt change? A model update? Different input data? Without proper observability, you're guessing.
AI agent observability goes beyond traditional application monitoring. You need to understand not just that a request failed, but why the model made the decisions it made, what context it had, and how slight variations in input led to different outputs.
This guide covers building observability systems for AI agents, from basic logging to sophisticated debugging workflows. We'll implement the patterns we use at Athenic to debug issues across thousands of daily agent executions.
Key takeaways
- AI failures are often non-deterministic. Log everything so you can replay exact conditions.
- Trace IDs tie together multi-step agent executions. Without them, debugging is impossible.
- Metrics need both operational views (latency, errors) and quality views (task success, user satisfaction).
- Cost tracking is observability too. Know where your money goes before it becomes a problem.
Why observability is different for AI
Traditional software is deterministic. Same input, same output. When something breaks, you check the error message, trace the stack, find the bug.
AI agents are probabilistic. Same input might produce different outputs. The "bug" might be a subtle prompt issue, a context length problem, a tool failure, or just model randomness. You need observability that captures:
- What the model saw: Full prompts, including system messages, context, history
- What the model did: Each decision, tool call, reasoning step
- What the model produced: Complete outputs, not just final answers
- What context influenced it: Retrieved documents, conversation history, user metadata
The debugging challenge
Consider debugging this user complaint: "The agent gave me wrong information about our pricing."
Without observability, you're stuck:
- What did the user actually ask?
- What context did the agent retrieve?
- Was the pricing info in the knowledge base correct?
- Did the agent hallucinate or misread the data?
- Has this happened before with similar queries?
With proper observability, you can reconstruct the exact execution, see every prompt and response, and identify whether the issue was retrieval, reasoning, or data.
Trace logging architecture
Build logging around traces - units of work that span multiple steps.
Trace schema
interface AgentTrace {
// Identification
traceId: string; // Unique ID for this execution
parentTraceId?: string; // For sub-agent calls
sessionId: string; // Conversation session
userId: string;
orgId: string;
// Timing
startedAt: Date;
completedAt?: Date;
durationMs?: number;
// Execution context
agentName: string;
agentVersion: string;
modelId: string;
temperature: number;
// Input/Output
input: {
userMessage: string;
systemPrompt: string;
contextDocuments: string[];
conversationHistory: Message[];
};
output?: {
response: string;
toolCalls: ToolCall[];
finishReason: string;
};
// Metrics
tokens: {
input: number;
output: number;
total: number;
};
cost: {
inputCost: number;
outputCost: number;
totalCost: number;
};
// Status
status: 'running' | 'completed' | 'failed' | 'timeout';
error?: {
code: string;
message: string;
stack?: string;
};
// Nested events
events: TraceEvent[];
}
interface TraceEvent {
eventId: string;
traceId: string;
timestamp: Date;
eventType: 'llm_call' | 'tool_call' | 'retrieval' | 'handoff' | 'error' | 'custom';
name: string;
input: any;
output?: any;
durationMs: number;
metadata: Record<string, any>;
}
Logging implementation
class TraceLogger {
private db: Database;
private currentTrace: AgentTrace | null = null;
startTrace(input: {
sessionId: string;
userId: string;
orgId: string;
agentName: string;
userMessage: string;
systemPrompt: string;
contextDocuments: string[];
conversationHistory: Message[];
modelId: string;
}): string {
const traceId = generateId();
this.currentTrace = {
traceId,
sessionId: input.sessionId,
userId: input.userId,
orgId: input.orgId,
agentName: input.agentName,
agentVersion: process.env.AGENT_VERSION || 'unknown',
modelId: input.modelId,
temperature: 0.7,
startedAt: new Date(),
input: {
userMessage: input.userMessage,
systemPrompt: input.systemPrompt,
contextDocuments: input.contextDocuments,
conversationHistory: input.conversationHistory
},
tokens: { input: 0, output: 0, total: 0 },
cost: { inputCost: 0, outputCost: 0, totalCost: 0 },
status: 'running',
events: []
};
// Write initial trace record
this.db.traces.insert(this.currentTrace);
return traceId;
}
logEvent(event: Omit<TraceEvent, 'eventId' | 'timestamp'>): void {
if (!this.currentTrace) {
console.warn('No active trace for event');
return;
}
const fullEvent: TraceEvent = {
...event,
eventId: generateId(),
traceId: this.currentTrace.traceId,
timestamp: new Date()
};
this.currentTrace.events.push(fullEvent);
// Write event to database
this.db.traceEvents.insert(fullEvent);
}
logLLMCall(input: {
prompt: string;
response: string;
model: string;
tokens: { input: number; output: number };
durationMs: number;
}): void {
this.logEvent({
eventType: 'llm_call',
name: 'chat_completion',
input: { prompt: input.prompt, model: input.model },
output: { response: input.response },
durationMs: input.durationMs,
metadata: {
inputTokens: input.tokens.input,
outputTokens: input.tokens.output
}
});
// Update trace totals
if (this.currentTrace) {
this.currentTrace.tokens.input += input.tokens.input;
this.currentTrace.tokens.output += input.tokens.output;
this.currentTrace.tokens.total += input.tokens.input + input.tokens.output;
// Calculate costs
const costs = calculateCosts(input.model, input.tokens);
this.currentTrace.cost.inputCost += costs.input;
this.currentTrace.cost.outputCost += costs.output;
this.currentTrace.cost.totalCost += costs.total;
}
}
logToolCall(input: {
toolName: string;
arguments: any;
result: any;
success: boolean;
durationMs: number;
}): void {
this.logEvent({
eventType: 'tool_call',
name: input.toolName,
input: input.arguments,
output: { result: input.result, success: input.success },
durationMs: input.durationMs,
metadata: {}
});
}
completeTrace(output: {
response: string;
toolCalls: ToolCall[];
finishReason: string;
}): void {
if (!this.currentTrace) return;
this.currentTrace.completedAt = new Date();
this.currentTrace.durationMs =
this.currentTrace.completedAt.getTime() - this.currentTrace.startedAt.getTime();
this.currentTrace.status = 'completed';
this.currentTrace.output = output;
// Update database
this.db.traces.update(this.currentTrace.traceId, this.currentTrace);
this.currentTrace = null;
}
failTrace(error: Error): void {
if (!this.currentTrace) return;
this.currentTrace.completedAt = new Date();
this.currentTrace.durationMs =
this.currentTrace.completedAt.getTime() - this.currentTrace.startedAt.getTime();
this.currentTrace.status = 'failed';
this.currentTrace.error = {
code: error.name,
message: error.message,
stack: error.stack
};
this.db.traces.update(this.currentTrace.traceId, this.currentTrace);
this.currentTrace = null;
}
}
Integration with agent execution
async function executeAgent(
userMessage: string,
context: ExecutionContext
): Promise<AgentResponse> {
const logger = new TraceLogger(db);
const traceId = logger.startTrace({
sessionId: context.sessionId,
userId: context.userId,
orgId: context.orgId,
agentName: 'research',
userMessage,
systemPrompt: RESEARCH_AGENT_PROMPT,
contextDocuments: context.retrievedDocs,
conversationHistory: context.history,
modelId: 'gpt-4o'
});
try {
// Log retrieval
const startRetrieval = Date.now();
const docs = await retrieveDocuments(userMessage);
logger.logEvent({
eventType: 'retrieval',
name: 'vector_search',
input: { query: userMessage },
output: { documentCount: docs.length },
durationMs: Date.now() - startRetrieval,
metadata: {}
});
// Log LLM call
const startLLM = Date.now();
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: buildMessages(userMessage, docs, context.history)
});
logger.logLLMCall({
prompt: JSON.stringify(buildMessages(userMessage, docs, context.history)),
response: response.choices[0].message.content,
model: 'gpt-4o',
tokens: {
input: response.usage.prompt_tokens,
output: response.usage.completion_tokens
},
durationMs: Date.now() - startLLM
});
// Handle tool calls if any
if (response.choices[0].message.tool_calls) {
for (const toolCall of response.choices[0].message.tool_calls) {
const startTool = Date.now();
const result = await executeTool(
toolCall.function.name,
JSON.parse(toolCall.function.arguments)
);
logger.logToolCall({
toolName: toolCall.function.name,
arguments: JSON.parse(toolCall.function.arguments),
result,
success: true,
durationMs: Date.now() - startTool
});
}
}
logger.completeTrace({
response: response.choices[0].message.content,
toolCalls: response.choices[0].message.tool_calls || [],
finishReason: response.choices[0].finish_reason
});
return { content: response.choices[0].message.content, traceId };
} catch (error) {
logger.failTrace(error);
throw error;
}
}
Metrics and dashboards
Transform trace data into actionable metrics.
Core metrics to track
| Metric | Type | Purpose |
|---|
| Request count | Counter | Volume tracking |
| Latency p50/p95/p99 | Histogram | Performance monitoring |
| Error rate | Ratio | Reliability |
| Token usage | Counter | Cost projection |
| Cost per request | Histogram | Budget management |
| Tool call success rate | Ratio | Integration health |
| Task completion rate | Ratio | Agent quality |
Implementation
class AgentMetrics {
private metrics: MetricsClient;
recordRequest(trace: AgentTrace): void {
// Request count
this.metrics.increment('agent.requests', {
agent: trace.agentName,
status: trace.status
});
// Latency
if (trace.durationMs) {
this.metrics.histogram('agent.latency_ms', trace.durationMs, {
agent: trace.agentName
});
}
// Token usage
this.metrics.increment('agent.tokens.input', trace.tokens.input, {
agent: trace.agentName,
model: trace.modelId
});
this.metrics.increment('agent.tokens.output', trace.tokens.output, {
agent: trace.agentName,
model: trace.modelId
});
// Cost
this.metrics.histogram('agent.cost_usd', trace.cost.totalCost, {
agent: trace.agentName
});
// Error tracking
if (trace.status === 'failed') {
this.metrics.increment('agent.errors', {
agent: trace.agentName,
error_code: trace.error?.code || 'unknown'
});
}
}
recordToolCall(event: TraceEvent): void {
this.metrics.increment('agent.tool_calls', {
tool: event.name,
success: String(event.output?.success ?? true)
});
this.metrics.histogram('agent.tool_latency_ms', event.durationMs, {
tool: event.name
});
}
}
Dashboard queries
-- Request volume and latency by agent (last 24h)
SELECT
agent_name,
COUNT(*) as request_count,
AVG(duration_ms) as avg_latency,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95_latency,
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END)::float / COUNT(*) as error_rate
FROM agent_traces
WHERE started_at > NOW() - INTERVAL '24 hours'
GROUP BY agent_name
ORDER BY request_count DESC;
-- Cost breakdown by agent and model (last 7 days)
SELECT
agent_name,
model_id,
SUM(total_cost) as total_cost,
SUM(input_tokens) as total_input_tokens,
SUM(output_tokens) as total_output_tokens,
COUNT(*) as request_count,
AVG(total_cost) as avg_cost_per_request
FROM agent_traces
WHERE started_at > NOW() - INTERVAL '7 days'
GROUP BY agent_name, model_id
ORDER BY total_cost DESC;
-- Tool reliability (last 24h)
SELECT
name as tool_name,
COUNT(*) as call_count,
SUM(CASE WHEN (output->>'success')::boolean THEN 1 ELSE 0 END)::float / COUNT(*) as success_rate,
AVG(duration_ms) as avg_latency
FROM trace_events
WHERE event_type = 'tool_call'
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY name
ORDER BY call_count DESC;
Debugging workflows
Metrics tell you something is wrong. Debugging workflows help you fix it.
Workflow 1: Reproduce a failure
When a user reports an issue, reproduce the exact conditions.
async function reproduceTrace(traceId: string): Promise<void> {
// Fetch the original trace
const trace = await db.traces.findById(traceId);
if (!trace) {
throw new Error(`Trace ${traceId} not found`);
}
console.log('=== Original Execution ===');
console.log('User message:', trace.input.userMessage);
console.log('System prompt length:', trace.input.systemPrompt.length);
console.log('Context documents:', trace.input.contextDocuments.length);
console.log('History length:', trace.input.conversationHistory.length);
console.log('Model:', trace.modelId);
console.log('Output:', trace.output?.response);
console.log('Status:', trace.status);
// Replay with same inputs
console.log('\n=== Replaying ===');
const replayResult = await executeAgentWithInputs({
userMessage: trace.input.userMessage,
systemPrompt: trace.input.systemPrompt,
contextDocuments: trace.input.contextDocuments,
conversationHistory: trace.input.conversationHistory,
modelId: trace.modelId
});
console.log('Replay output:', replayResult.content);
console.log('Match:', replayResult.content === trace.output?.response ? 'YES' : 'NO');
}
Workflow 2: Compare prompt versions
When you suspect a prompt change caused issues:
async function comparePromptVersions(
userMessage: string,
promptVersions: { version: string; prompt: string }[]
): Promise<void> {
console.log('=== Prompt Comparison ===');
console.log('User message:', userMessage);
for (const { version, prompt } of promptVersions) {
console.log(`\n--- Version: ${version} ---`);
const result = await executeAgentWithInputs({
userMessage,
systemPrompt: prompt,
modelId: 'gpt-4o'
});
console.log('Response:', result.content);
console.log('Tokens:', result.tokens);
console.log('Cost:', result.cost);
}
}
Workflow 3: Trace analysis for patterns
Find patterns in failures:
async function analyseFailures(
agentName: string,
hours: number = 24
): Promise<void> {
const failures = await db.traces.find({
agentName,
status: 'failed',
startedAt: { $gt: new Date(Date.now() - hours * 3600000) }
});
console.log(`Found ${failures.length} failures in last ${hours} hours`);
// Group by error code
const byError = groupBy(failures, t => t.error?.code || 'unknown');
console.log('\nFailures by error code:');
Object.entries(byError).forEach(([code, traces]) => {
console.log(` ${code}: ${traces.length}`);
});
// Common input patterns
const inputLengths = failures.map(t => t.input.userMessage.length);
console.log('\nInput message length stats:');
console.log(` Min: ${Math.min(...inputLengths)}`);
console.log(` Max: ${Math.max(...inputLengths)}`);
console.log(` Avg: ${average(inputLengths).toFixed(0)}`);
// Look for common keywords in failed inputs
const allWords = failures
.flatMap(t => t.input.userMessage.toLowerCase().split(/\s+/))
.filter(w => w.length > 4);
const wordCounts = countBy(allWords);
const topWords = Object.entries(wordCounts)
.sort((a, b) => b[1] - a[1])
.slice(0, 10);
console.log('\nMost common words in failed inputs:');
topWords.forEach(([word, count]) => {
console.log(` ${word}: ${count}`);
});
}
Workflow 4: Cost anomaly detection
Alert when costs spike unexpectedly:
async function detectCostAnomalies(): Promise<void> {
// Get hourly costs for last 48 hours
const hourlyCosts = await db.query(`
SELECT
DATE_TRUNC('hour', started_at) as hour,
SUM(total_cost) as cost,
COUNT(*) as requests
FROM agent_traces
WHERE started_at > NOW() - INTERVAL '48 hours'
GROUP BY 1
ORDER BY 1
`);
// Calculate baseline (first 24 hours)
const baseline = hourlyCosts.slice(0, 24);
const avgCost = average(baseline.map(h => h.cost));
const stdDev = standardDeviation(baseline.map(h => h.cost));
// Check recent hours against baseline
const recent = hourlyCosts.slice(24);
for (const hour of recent) {
const zScore = (hour.cost - avgCost) / stdDev;
if (zScore > 3) { // More than 3 standard deviations
console.log(`ALERT: Cost anomaly detected`);
console.log(` Hour: ${hour.hour}`);
console.log(` Cost: $${hour.cost.toFixed(2)}`);
console.log(` Baseline: $${avgCost.toFixed(2)} ± $${stdDev.toFixed(2)}`);
console.log(` Z-score: ${zScore.toFixed(2)}`);
console.log(` Requests: ${hour.requests}`);
// Find what's driving the cost
const topAgents = await db.query(`
SELECT agent_name, SUM(total_cost) as cost
FROM agent_traces
WHERE DATE_TRUNC('hour', started_at) = $1
GROUP BY agent_name
ORDER BY cost DESC
LIMIT 5
`, [hour.hour]);
console.log(' Top cost drivers:');
topAgents.forEach(a => {
console.log(` ${a.agent_name}: $${a.cost.toFixed(2)}`);
});
}
}
}
FAQs
How much logging is too much?
Log everything in development. In production, sample non-error traces if volume is high. Always log full traces for errors.
Should I log full prompts? What about PII?
Yes, log full prompts - you need them for debugging. But implement PII scrubbing and access controls. Only authorised personnel should access trace data.
How long should I retain traces?
Keep full traces for 30-90 days. Keep aggregated metrics longer. For compliance-sensitive applications, follow your retention policies.
What's the performance overhead of logging?
Async logging has minimal overhead (<5ms per request). Write to fast storage (memory queue → background worker → database), not synchronously.
Should I use commercial observability tools?
Tools like LangSmith, Helicone, and Langfuse are excellent and save development time. Build custom if you have specific requirements or compliance constraints.
Summary and next steps
AI agent observability requires thinking beyond traditional logging. Trace-level visibility, full prompt capture, and debugging workflows let you understand and fix issues in probabilistic systems.
Implementation checklist:
- Design trace schema capturing full execution context
- Instrument agent code with trace logging
- Build metrics dashboards for operational health
- Create debugging workflows for common scenarios
- Add cost tracking and anomaly detection
- Implement access controls for trace data
Quick wins:
- Start logging full prompts and responses (even before infrastructure)
- Add trace IDs to all agent executions
- Track token usage and costs from day one
Internal links:
External references: