TL;DR

Context windows are finite; conversations are not. Managing this mismatch is the core challenge.
Allocate token budgets: system prompt (fixed), context/RAG (variable), history (managed), response (reserved).
Sliding windows drop old messages; summarisation compresses them. Combine both for best results.
Track what matters: user preferences, decisions made, commitments given. Losing these breaks coherence.

Jump to The context management problem · Jump to Token budget allocation · Jump to History management strategies · Jump to Implementation guide

Building Conversational AI Agents: Context Management for Multi-Turn Dialogues

A user has been chatting with your agent for 45 minutes. They've explained their project, made decisions, discussed alternatives, and are now asking a follow-up question. Your agent responds as if they've never met - because the conversation exceeded the context window and you truncated the wrong parts.

Context management determines whether conversational agents feel intelligent or amnesia-afflicted. Done well, agents maintain coherence across dozens of turns, remember important details, and handle long sessions gracefully. Done poorly, users repeat themselves endlessly and lose trust.

This guide covers production patterns for context management, from simple sliding windows to sophisticated summarisation pipelines. We'll build systems that handle hour-long conversations without losing the plot.

Key takeaways

Context isn't just "previous messages" - it's system prompt, retrieved knowledge, conversation history, and reserved response space.

Summarise strategically: compress routine exchanges, preserve decisions and commitments verbatim.

Implement graceful degradation: when context fills, warn users and offer to start fresh.

Monitor context quality, not just quantity. 10K tokens of noise is worse than 5K of signal.

The context management problem

Every LLM call has a context window limit. GPT-4o supports 128K tokens; Claude supports 200K. These sound massive until you realise where tokens go:

Typical agent prompt composition:

System prompt: 1,500 tokens
Retrieved documents (RAG): 4,000-15,000 tokens
Conversation history: 500-50,000+ tokens
User's current message: 50-500 tokens
Reserved for response: 2,000-4,000 tokens

A 128K context window fills surprisingly fast when RAG retrieves multiple documents and conversations run long. Without management, you hit limits, truncate randomly, and break coherence.

Why simple truncation fails

The naive approach: when context exceeds limits, drop the oldest messages.

// Don't do this
function truncateHistory(messages: Message[], maxTokens: number) {
  while (countTokens(messages) > maxTokens) {
    messages.shift();  // Remove oldest
  }
  return messages;
}

Problems:

Losing context setup: Early messages often contain crucial context ("I'm building a marketing dashboard for my SaaS")
Losing decisions: User agreed to approach A in message 4; you drop it and suggest approach B again
Losing corrections: User corrected your misunderstanding; you make the same mistake
Abrupt transitions: Message 12 references message 3; dropping message 3 makes message 12 nonsensical

"The companies winning with AI agents aren't the ones with the most sophisticated models. They're the ones who've figured out the governance and handoff patterns between human and machine." - Dr. Elena Rodriguez, VP of Applied AI at Google DeepMind

Token budget allocation

Before managing history, establish how you'll allocate your token budget.

Budget framework

interface TokenBudget {
  total: number;            // Model's context limit
  systemPrompt: number;     // Fixed allocation
  ragContext: number;       // Variable, based on query
  conversationHistory: number;  // Managed, compressed
  currentTurn: number;      // User message + tools
  responseReserve: number;  // Never exceed
}

function calculateBudget(modelLimit: number): TokenBudget {
  return {
    total: modelLimit,
    systemPrompt: 2000,       // Fixed overhead
    ragContext: 8000,         // Typical retrieval
    conversationHistory: Math.floor(modelLimit * 0.5),  // Half for history
    currentTurn: 1000,        // User message + tool calls
    responseReserve: 4000     // Model output space
  };
}

// For GPT-4o (128K)
const budget = calculateBudget(128000);
// Available for history: 64,000 tokens
// But practical limit is often lower to leave headroom

Dynamic allocation

Not every request needs the same allocation. Adjust based on context:

function dynamicBudget(
  modelLimit: number,
  ragNeeded: boolean,
  conversationLength: number
): TokenBudget {
  const base = {
    total: modelLimit,
    systemPrompt: 2000,
    responseReserve: 4000,
    currentTurn: 1000
  };

  // Short conversations: allocate more to RAG
  if (conversationLength < 10) {
    return {
      ...base,
      ragContext: 15000,
      conversationHistory: modelLimit - 22000
    };
  }

  // Long conversations: prioritise history
  if (conversationLength > 50) {
    return {
      ...base,
      ragContext: 5000,
      conversationHistory: modelLimit - 12000
    };
  }

  // Balanced default
  return {
    ...base,
    ragContext: ragNeeded ? 10000 : 2000,
    conversationHistory: modelLimit - (ragNeeded ? 17000 : 9000)
  };
}

History management strategies

Three core strategies exist for managing conversation history within token budgets.

Strategy 1: Sliding window

Keep the N most recent messages. Simple but loses early context.

interface SlidingWindowConfig {
  maxMessages: number;
  preserveFirst: number;  // Keep first N messages (context setup)
  preserveLast: number;   // Keep last N messages (recent context)
}

function slidingWindow(
  messages: Message[],
  config: SlidingWindowConfig
): Message[] {
  if (messages.length <= config.maxMessages) {
    return messages;
  }

  const first = messages.slice(0, config.preserveFirst);
  const last = messages.slice(-config.preserveLast);

  return [...first, ...last];
}

// Usage: Keep first 2 and last 10 messages
const windowed = slidingWindow(history, {
  maxMessages: 20,
  preserveFirst: 2,
  preserveLast: 10
});

Best for: Quick interactions, technical support with independent questions.

Limitations: Loses middle context, can't handle long-form discussions.

Strategy 2: Summarisation

Compress old messages into summaries, preserving key information in fewer tokens.

async function summariseHistory(
  messages: Message[],
  maxOutputTokens: number = 500
): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',  // Cheaper model for summarisation
    messages: [
      {
        role: 'system',
        content: `Summarise this conversation history concisely. Preserve:
- User's goals and requirements
- Decisions made and reasons
- Commitments from either party
- Corrections and clarifications
- Technical details and preferences

Format as bullet points. Be concise but complete.`
      },
      {
        role: 'user',
        content: messages.map(m => `${m.role}: ${m.content}`).join('\n\n')
      }
    ],
    max_tokens: maxOutputTokens
  });

  return response.choices[0].message.content;
}

Best for: Long-running conversations, customer support, advisory interactions.

Limitations: Summarisation can lose nuance; costs extra API calls.

Strategy 3: Hybrid approach (recommended)

Combine sliding window with summarisation for best of both worlds.

interface ConversationContext {
  summary: string | null;
  recentMessages: Message[];
  importantMessages: Message[];
}

class HybridContextManager {
  private maxRecentMessages = 10;
  private summariseThreshold = 20;
  private importantPatterns = [
    /decision|decided|agree|let's go with/i,
    /commitment|promise|will do|I'll/i,
    /correction|actually|I meant/i,
    /preference|prefer|always want/i
  ];

  async buildContext(
    fullHistory: Message[],
    tokenBudget: number
  ): Promise<ConversationContext> {
    // Always keep recent messages
    const recentMessages = fullHistory.slice(-this.maxRecentMessages);

    // Find important messages throughout history
    const importantMessages = this.extractImportantMessages(
      fullHistory.slice(0, -this.maxRecentMessages)
    );

    // Calculate tokens so far
    const usedTokens = countTokens([...recentMessages, ...importantMessages]);

    // Summarise if needed
    let summary: string | null = null;
    if (fullHistory.length > this.summariseThreshold) {
      const toSummarise = fullHistory.slice(0, -this.maxRecentMessages)
        .filter(m => !importantMessages.includes(m));

      if (toSummarise.length > 0) {
        summary = await summariseHistory(toSummarise);
      }
    }

    return { summary, recentMessages, importantMessages };
  }

  private extractImportantMessages(messages: Message[]): Message[] {
    return messages.filter(msg =>
      this.importantPatterns.some(pattern => pattern.test(msg.content))
    );
  }

  formatForPrompt(context: ConversationContext): string {
    let formatted = '';

    if (context.summary) {
      formatted += `## Conversation Summary\n${context.summary}\n\n`;
    }

    if (context.importantMessages.length > 0) {
      formatted += `## Key Points from Earlier\n`;
      for (const msg of context.importantMessages) {
        formatted += `[${msg.role}]: ${msg.content}\n`;
      }
      formatted += '\n';
    }

    formatted += `## Recent Messages\n`;
    for (const msg of context.recentMessages) {
      formatted += `[${msg.role}]: ${msg.content}\n`;
    }

    return formatted;
  }
}

Implementation guide

Let's build a complete context management system step by step.

Step 1: Message storage with metadata

Store messages with metadata that enables smart filtering.

interface StoredMessage {
  id: string;
  sessionId: string;
  role: 'user' | 'assistant' | 'system';
  content: string;
  tokenCount: number;
  createdAt: Date;
  metadata: {
    importance: 'low' | 'medium' | 'high';
    containsDecision: boolean;
    containsCommitment: boolean;
    referencedBy: string[];  // IDs of messages that reference this
  };
}

async function storeMessage(
  sessionId: string,
  role: string,
  content: string
): Promise<StoredMessage> {
  const message: StoredMessage = {
    id: generateId(),
    sessionId,
    role: role as any,
    content,
    tokenCount: countTokens(content),
    createdAt: new Date(),
    metadata: await analyseMessage(content)
  };

  await db.messages.insert(message);
  return message;
}

async function analyseMessage(content: string): Promise<StoredMessage['metadata']> {
  // Quick heuristic analysis (or use LLM for higher quality)
  return {
    importance: calculateImportance(content),
    containsDecision: /\b(decided|agree|let's|going with)\b/i.test(content),
    containsCommitment: /\b(will|promise|commit|guarantee)\b/i.test(content),
    referencedBy: []
  };
}

function calculateImportance(content: string): 'low' | 'medium' | 'high' {
  const highPatterns = [/decision|agreed|confirmed|approved/i];
  const mediumPatterns = [/prefer|suggest|recommend|consider/i];

  if (highPatterns.some(p => p.test(content))) return 'high';
  if (mediumPatterns.some(p => p.test(content))) return 'medium';
  return 'low';
}

Step 2: Context building pipeline

class ConversationContextBuilder {
  private contextManager: HybridContextManager;
  private tokenEncoder: TokenEncoder;

  async buildContext(
    sessionId: string,
    currentMessage: string,
    budget: TokenBudget
  ): Promise<{
    systemPrompt: string;
    ragContext: string;
    conversationContext: string;
    availableForResponse: number;
  }> {
    // 1. Get full history
    const history = await db.messages.findBySession(sessionId);

    // 2. Build managed context
    const managedContext = await this.contextManager.buildContext(
      history,
      budget.conversationHistory
    );

    // 3. Get RAG context if needed
    const ragContext = await this.retrieveRelevantDocs(
      currentMessage,
      managedContext.summary,
      budget.ragContext
    );

    // 4. Format everything
    const conversationContext = this.contextManager.formatForPrompt(managedContext);

    // 5. Calculate remaining budget
    const usedTokens = countTokens([
      this.systemPrompt,
      ragContext,
      conversationContext,
      currentMessage
    ]);

    const availableForResponse = budget.total - usedTokens;

    // 6. Warn if tight
    if (availableForResponse < 1000) {
      console.warn('Context budget tight, consider starting new session');
    }

    return {
      systemPrompt: this.systemPrompt,
      ragContext,
      conversationContext,
      availableForResponse
    };
  }

  private async retrieveRelevantDocs(
    query: string,
    conversationSummary: string | null,
    tokenBudget: number
  ): Promise<string> {
    // Combine current query with conversation context for better retrieval
    const enhancedQuery = conversationSummary
      ? `${query}\n\nConversation context: ${conversationSummary}`
      : query;

    const docs = await vectorSearch(enhancedQuery, {
      maxTokens: tokenBudget,
      topK: 5
    });

    return docs.map(d => d.content).join('\n\n---\n\n');
  }
}

Step 3: Graceful overflow handling

When context truly fills up, handle it gracefully.

interface OverflowHandling {
  strategy: 'warn' | 'compress' | 'new_session';
  threshold: number;  // Percentage of budget used
}

class ContextOverflowHandler {
  private thresholds = {
    warn: 0.85,
    compress: 0.95,
    newSession: 0.98
  };

  async handleOverflow(
    used: number,
    total: number,
    sessionId: string
  ): Promise<{ action: string; message?: string }> {
    const utilisation = used / total;

    if (utilisation >= this.thresholds.newSession) {
      return {
        action: 'new_session',
        message: `This conversation has been quite detailed. Would you like me to summarise what we've covered and start a fresh session? I'll preserve all important decisions and context.`
      };
    }

    if (utilisation >= this.thresholds.compress) {
      // Force aggressive summarisation
      await this.aggressiveCompress(sessionId);
      return {
        action: 'compressed',
        message: `I've compressed our earlier conversation to make room. All important points are preserved.`
      };
    }

    if (utilisation >= this.thresholds.warn) {
      return {
        action: 'warn',
        message: `Just a note: we're approaching the limit of what I can hold in context for this conversation. If we need to continue much longer, I may need to summarise earlier parts.`
      };
    }

    return { action: 'none' };
  }

  private async aggressiveCompress(sessionId: string): Promise<void> {
    const history = await db.messages.findBySession(sessionId);

    // Keep only high-importance messages verbatim
    const highImportance = history.filter(m => m.metadata.importance === 'high');

    // Summarise everything else
    const toSummarise = history.filter(m => m.metadata.importance !== 'high');
    const summary = await summariseHistory(toSummarise, 300);

    // Store compressed version
    await db.conversationSummaries.upsert({
      sessionId,
      summary,
      preservedMessages: highImportance.map(m => m.id),
      compressedAt: new Date()
    });
  }
}

Step 4: Session continuity

When starting new sessions, carry over relevant context.

async function createContinuationSession(
  previousSessionId: string,
  userId: string
): Promise<{ sessionId: string; welcomeMessage: string }> {
  // Get summary of previous session
  const previousHistory = await db.messages.findBySession(previousSessionId);
  const summary = await summariseHistory(previousHistory);

  // Extract key points
  const keyPoints = await extractKeyPoints(previousHistory);

  // Create new session with context
  const newSessionId = generateId();

  // Store continuation context
  await db.sessionContext.insert({
    sessionId: newSessionId,
    previousSessionId,
    carryoverSummary: summary,
    carryoverKeyPoints: keyPoints
  });

  const welcomeMessage = `I remember our previous conversation. Here's what we covered:

${keyPoints.map(p => `• ${p}`).join('\n')}

${summary.length > 200 ? `\nMore detail: ${summary.slice(0, 200)}...` : ''}

What would you like to continue with?`;

  return { sessionId: newSessionId, welcomeMessage };
}

async function extractKeyPoints(history: Message[]): Promise<string[]> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: 'Extract 3-5 key points from this conversation that would be essential to remember in a follow-up session. Focus on decisions, preferences, and outcomes.'
      },
      {
        role: 'user',
        content: history.map(m => `${m.role}: ${m.content}`).join('\n')
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content).keyPoints;
}

Monitoring and quality

Context management quality directly impacts conversation quality. Monitor both.

Key metrics

Metric	Target	Indicates
Avg context utilisation	60-80%	Budget sizing
Overflow events/day	<5% of sessions	Headroom adequacy
Summarisation accuracy	>85%	Summary quality
User repetition rate	<10%	Context preservation

Monitoring implementation

const contextMetrics = {
  recordContextBuild(sessionId: string, utilisation: number, components: {
    systemPrompt: number;
    ragContext: number;
    history: number;
    available: number;
  }) {
    metrics.gauge('context.utilisation', utilisation, { sessionId });
    metrics.gauge('context.system_prompt_tokens', components.systemPrompt);
    metrics.gauge('context.rag_tokens', components.ragContext);
    metrics.gauge('context.history_tokens', components.history);
    metrics.gauge('context.available_tokens', components.available);

    if (utilisation > 0.9) {
      metrics.increment('context.near_overflow');
    }
  },

  recordSummarisation(originalTokens: number, summaryTokens: number) {
    const compressionRatio = summaryTokens / originalTokens;
    metrics.histogram('context.compression_ratio', compressionRatio);
  }
};

FAQs

How do I handle tool call results in context?

Tool call results can be verbose. Summarise them before storing:

async function storeToolResult(result: any): Promise<string> {
  const resultString = JSON.stringify(result);

  if (countTokens(resultString) < 200) {
    return resultString;  // Keep short results verbatim
  }

  // Summarise long results
  const summary = await summarise(resultString, 150);
  return `[Tool result summary: ${summary}]`;
}

Should I include system messages in history?

Generally no. System prompts are injected fresh each turn. Including them in history wastes tokens and can cause instruction drift.

How do I handle multi-user conversations?

Track speaker identity and include in context:

const message = {
  role: 'user',
  content: `[${userName}]: ${content}`
};

What's the right summarisation frequency?

Summarise when history exceeds 30-40 messages or when context utilisation exceeds 70%. More frequent summarisation loses nuance; less frequent risks overflow.

Can I use smaller models for summarisation?

Yes, and you should. GPT-4o-mini or Claude Haiku handle summarisation well at a fraction of the cost. Reserve larger models for the main conversation.

Summary and next steps

Context management separates coherent conversational agents from forgetful ones. The hybrid approach - combining sliding windows with smart summarisation - handles most production scenarios effectively.

Implementation checklist:

Define token budget allocation strategy
Implement sliding window with first/last preservation
Add summarisation for long conversations
Build importance detection for message preservation
Handle overflow gracefully with user communication
Monitor utilisation and quality metrics

Quick wins:

Add importance tagging to messages (decisions, commitments)
Implement basic sliding window with first-2/last-10 preservation
Add utilisation warnings at 85% threshold

Internal links:

External references: