TL;DR
- Context windows are finite; conversations are not. Managing this mismatch is the core challenge.
- Allocate token budgets: system prompt (fixed), context/RAG (variable), history (managed), response (reserved).
- Sliding windows drop old messages; summarisation compresses them. Combine both for best results.
- Track what matters: user preferences, decisions made, commitments given. Losing these breaks coherence.
Jump to The context management problem · Jump to Token budget allocation · Jump to History management strategies · Jump to Implementation guide
Building Conversational AI Agents: Context Management for Multi-Turn Dialogues
A user has been chatting with your agent for 45 minutes. They've explained their project, made decisions, discussed alternatives, and are now asking a follow-up question. Your agent responds as if they've never met - because the conversation exceeded the context window and you truncated the wrong parts.
Context management determines whether conversational agents feel intelligent or amnesia-afflicted. Done well, agents maintain coherence across dozens of turns, remember important details, and handle long sessions gracefully. Done poorly, users repeat themselves endlessly and lose trust.
This guide covers production patterns for context management, from simple sliding windows to sophisticated summarisation pipelines. We'll build systems that handle hour-long conversations without losing the plot.
Key takeaways
- Context isn't just "previous messages" - it's system prompt, retrieved knowledge, conversation history, and reserved response space.
- Summarise strategically: compress routine exchanges, preserve decisions and commitments verbatim.
- Implement graceful degradation: when context fills, warn users and offer to start fresh.
- Monitor context quality, not just quantity. 10K tokens of noise is worse than 5K of signal.
The context management problem
Every LLM call has a context window limit. GPT-4o supports 128K tokens; Claude supports 200K. These sound massive until you realise where tokens go:
Typical agent prompt composition:
- System prompt: 1,500 tokens
- Retrieved documents (RAG): 4,000-15,000 tokens
- Conversation history: 500-50,000+ tokens
- User's current message: 50-500 tokens
- Reserved for response: 2,000-4,000 tokens
A 128K context window fills surprisingly fast when RAG retrieves multiple documents and conversations run long. Without management, you hit limits, truncate randomly, and break coherence.
Why simple truncation fails
The naive approach: when context exceeds limits, drop the oldest messages.
// Don't do this
function truncateHistory(messages: Message[], maxTokens: number) {
while (countTokens(messages) > maxTokens) {
messages.shift(); // Remove oldest
}
return messages;
}
Problems:
- Losing context setup: Early messages often contain crucial context ("I'm building a marketing dashboard for my SaaS")
- Losing decisions: User agreed to approach A in message 4; you drop it and suggest approach B again
- Losing corrections: User corrected your misunderstanding; you make the same mistake
- Abrupt transitions: Message 12 references message 3; dropping message 3 makes message 12 nonsensical
Token budget allocation
Before managing history, establish how you'll allocate your token budget.
Budget framework
interface TokenBudget {
total: number; // Model's context limit
systemPrompt: number; // Fixed allocation
ragContext: number; // Variable, based on query
conversationHistory: number; // Managed, compressed
currentTurn: number; // User message + tools
responseReserve: number; // Never exceed
}
function calculateBudget(modelLimit: number): TokenBudget {
return {
total: modelLimit,
systemPrompt: 2000, // Fixed overhead
ragContext: 8000, // Typical retrieval
conversationHistory: Math.floor(modelLimit * 0.5), // Half for history
currentTurn: 1000, // User message + tool calls
responseReserve: 4000 // Model output space
};
}
// For GPT-4o (128K)
const budget = calculateBudget(128000);
// Available for history: 64,000 tokens
// But practical limit is often lower to leave headroom
Dynamic allocation
Not every request needs the same allocation. Adjust based on context:
function dynamicBudget(
modelLimit: number,
ragNeeded: boolean,
conversationLength: number
): TokenBudget {
const base = {
total: modelLimit,
systemPrompt: 2000,
responseReserve: 4000,
currentTurn: 1000
};
// Short conversations: allocate more to RAG
if (conversationLength < 10) {
return {
...base,
ragContext: 15000,
conversationHistory: modelLimit - 22000
};
}
// Long conversations: prioritise history
if (conversationLength > 50) {
return {
...base,
ragContext: 5000,
conversationHistory: modelLimit - 12000
};
}
// Balanced default
return {
...base,
ragContext: ragNeeded ? 10000 : 2000,
conversationHistory: modelLimit - (ragNeeded ? 17000 : 9000)
};
}
History management strategies
Three core strategies exist for managing conversation history within token budgets.
Strategy 1: Sliding window
Keep the N most recent messages. Simple but loses early context.
interface SlidingWindowConfig {
maxMessages: number;
preserveFirst: number; // Keep first N messages (context setup)
preserveLast: number; // Keep last N messages (recent context)
}
function slidingWindow(
messages: Message[],
config: SlidingWindowConfig
): Message[] {
if (messages.length <= config.maxMessages) {
return messages;
}
const first = messages.slice(0, config.preserveFirst);
const last = messages.slice(-config.preserveLast);
return [...first, ...last];
}
// Usage: Keep first 2 and last 10 messages
const windowed = slidingWindow(history, {
maxMessages: 20,
preserveFirst: 2,
preserveLast: 10
});
Best for: Quick interactions, technical support with independent questions.
Limitations: Loses middle context, can't handle long-form discussions.
Strategy 2: Summarisation
Compress old messages into summaries, preserving key information in fewer tokens.
async function summariseHistory(
messages: Message[],
maxOutputTokens: number = 500
): Promise<string> {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini', // Cheaper model for summarisation
messages: [
{
role: 'system',
content: `Summarise this conversation history concisely. Preserve:
- User's goals and requirements
- Decisions made and reasons
- Commitments from either party
- Corrections and clarifications
- Technical details and preferences
Format as bullet points. Be concise but complete.`
},
{
role: 'user',
content: messages.map(m => `${m.role}: ${m.content}`).join('\n\n')
}
],
max_tokens: maxOutputTokens
});
return response.choices[0].message.content;
}
Best for: Long-running conversations, customer support, advisory interactions.
Limitations: Summarisation can lose nuance; costs extra API calls.
Strategy 3: Hybrid approach (recommended)
Combine sliding window with summarisation for best of both worlds.
interface ConversationContext {
summary: string | null;
recentMessages: Message[];
importantMessages: Message[];
}
class HybridContextManager {
private maxRecentMessages = 10;
private summariseThreshold = 20;
private importantPatterns = [
/decision|decided|agree|let's go with/i,
/commitment|promise|will do|I'll/i,
/correction|actually|I meant/i,
/preference|prefer|always want/i
];
async buildContext(
fullHistory: Message[],
tokenBudget: number
): Promise<ConversationContext> {
// Always keep recent messages
const recentMessages = fullHistory.slice(-this.maxRecentMessages);
// Find important messages throughout history
const importantMessages = this.extractImportantMessages(
fullHistory.slice(0, -this.maxRecentMessages)
);
// Calculate tokens so far
const usedTokens = countTokens([...recentMessages, ...importantMessages]);
// Summarise if needed
let summary: string | null = null;
if (fullHistory.length > this.summariseThreshold) {
const toSummarise = fullHistory.slice(0, -this.maxRecentMessages)
.filter(m => !importantMessages.includes(m));
if (toSummarise.length > 0) {
summary = await summariseHistory(toSummarise);
}
}
return { summary, recentMessages, importantMessages };
}
private extractImportantMessages(messages: Message[]): Message[] {
return messages.filter(msg =>
this.importantPatterns.some(pattern => pattern.test(msg.content))
);
}
formatForPrompt(context: ConversationContext): string {
let formatted = '';
if (context.summary) {
formatted += `## Conversation Summary\n${context.summary}\n\n`;
}
if (context.importantMessages.length > 0) {
formatted += `## Key Points from Earlier\n`;
for (const msg of context.importantMessages) {
formatted += `[${msg.role}]: ${msg.content}\n`;
}
formatted += '\n';
}
formatted += `## Recent Messages\n`;
for (const msg of context.recentMessages) {
formatted += `[${msg.role}]: ${msg.content}\n`;
}
return formatted;
}
}
Implementation guide
Let's build a complete context management system step by step.
Step 1: Message storage with metadata
Store messages with metadata that enables smart filtering.
interface StoredMessage {
id: string;
sessionId: string;
role: 'user' | 'assistant' | 'system';
content: string;
tokenCount: number;
createdAt: Date;
metadata: {
importance: 'low' | 'medium' | 'high';
containsDecision: boolean;
containsCommitment: boolean;
referencedBy: string[]; // IDs of messages that reference this
};
}
async function storeMessage(
sessionId: string,
role: string,
content: string
): Promise<StoredMessage> {
const message: StoredMessage = {
id: generateId(),
sessionId,
role: role as any,
content,
tokenCount: countTokens(content),
createdAt: new Date(),
metadata: await analyseMessage(content)
};
await db.messages.insert(message);
return message;
}
async function analyseMessage(content: string): Promise<StoredMessage['metadata']> {
// Quick heuristic analysis (or use LLM for higher quality)
return {
importance: calculateImportance(content),
containsDecision: /\b(decided|agree|let's|going with)\b/i.test(content),
containsCommitment: /\b(will|promise|commit|guarantee)\b/i.test(content),
referencedBy: []
};
}
function calculateImportance(content: string): 'low' | 'medium' | 'high' {
const highPatterns = [/decision|agreed|confirmed|approved/i];
const mediumPatterns = [/prefer|suggest|recommend|consider/i];
if (highPatterns.some(p => p.test(content))) return 'high';
if (mediumPatterns.some(p => p.test(content))) return 'medium';
return 'low';
}
Step 2: Context building pipeline
class ConversationContextBuilder {
private contextManager: HybridContextManager;
private tokenEncoder: TokenEncoder;
async buildContext(
sessionId: string,
currentMessage: string,
budget: TokenBudget
): Promise<{
systemPrompt: string;
ragContext: string;
conversationContext: string;
availableForResponse: number;
}> {
// 1. Get full history
const history = await db.messages.findBySession(sessionId);
// 2. Build managed context
const managedContext = await this.contextManager.buildContext(
history,
budget.conversationHistory
);
// 3. Get RAG context if needed
const ragContext = await this.retrieveRelevantDocs(
currentMessage,
managedContext.summary,
budget.ragContext
);
// 4. Format everything
const conversationContext = this.contextManager.formatForPrompt(managedContext);
// 5. Calculate remaining budget
const usedTokens = countTokens([
this.systemPrompt,
ragContext,
conversationContext,
currentMessage
]);
const availableForResponse = budget.total - usedTokens;
// 6. Warn if tight
if (availableForResponse < 1000) {
console.warn('Context budget tight, consider starting new session');
}
return {
systemPrompt: this.systemPrompt,
ragContext,
conversationContext,
availableForResponse
};
}
private async retrieveRelevantDocs(
query: string,
conversationSummary: string | null,
tokenBudget: number
): Promise<string> {
// Combine current query with conversation context for better retrieval
const enhancedQuery = conversationSummary
? `${query}\n\nConversation context: ${conversationSummary}`
: query;
const docs = await vectorSearch(enhancedQuery, {
maxTokens: tokenBudget,
topK: 5
});
return docs.map(d => d.content).join('\n\n---\n\n');
}
}
Step 3: Graceful overflow handling
When context truly fills up, handle it gracefully.
interface OverflowHandling {
strategy: 'warn' | 'compress' | 'new_session';
threshold: number; // Percentage of budget used
}
class ContextOverflowHandler {
private thresholds = {
warn: 0.85,
compress: 0.95,
newSession: 0.98
};
async handleOverflow(
used: number,
total: number,
sessionId: string
): Promise<{ action: string; message?: string }> {
const utilisation = used / total;
if (utilisation >= this.thresholds.newSession) {
return {
action: 'new_session',
message: `This conversation has been quite detailed. Would you like me to summarise what we've covered and start a fresh session? I'll preserve all important decisions and context.`
};
}
if (utilisation >= this.thresholds.compress) {
// Force aggressive summarisation
await this.aggressiveCompress(sessionId);
return {
action: 'compressed',
message: `I've compressed our earlier conversation to make room. All important points are preserved.`
};
}
if (utilisation >= this.thresholds.warn) {
return {
action: 'warn',
message: `Just a note: we're approaching the limit of what I can hold in context for this conversation. If we need to continue much longer, I may need to summarise earlier parts.`
};
}
return { action: 'none' };
}
private async aggressiveCompress(sessionId: string): Promise<void> {
const history = await db.messages.findBySession(sessionId);
// Keep only high-importance messages verbatim
const highImportance = history.filter(m => m.metadata.importance === 'high');
// Summarise everything else
const toSummarise = history.filter(m => m.metadata.importance !== 'high');
const summary = await summariseHistory(toSummarise, 300);
// Store compressed version
await db.conversationSummaries.upsert({
sessionId,
summary,
preservedMessages: highImportance.map(m => m.id),
compressedAt: new Date()
});
}
}
Step 4: Session continuity
When starting new sessions, carry over relevant context.
async function createContinuationSession(
previousSessionId: string,
userId: string
): Promise<{ sessionId: string; welcomeMessage: string }> {
// Get summary of previous session
const previousHistory = await db.messages.findBySession(previousSessionId);
const summary = await summariseHistory(previousHistory);
// Extract key points
const keyPoints = await extractKeyPoints(previousHistory);
// Create new session with context
const newSessionId = generateId();
// Store continuation context
await db.sessionContext.insert({
sessionId: newSessionId,
previousSessionId,
carryoverSummary: summary,
carryoverKeyPoints: keyPoints
});
const welcomeMessage = `I remember our previous conversation. Here's what we covered:
${keyPoints.map(p => `• ${p}`).join('\n')}
${summary.length > 200 ? `\nMore detail: ${summary.slice(0, 200)}...` : ''}
What would you like to continue with?`;
return { sessionId: newSessionId, welcomeMessage };
}
async function extractKeyPoints(history: Message[]): Promise<string[]> {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: 'Extract 3-5 key points from this conversation that would be essential to remember in a follow-up session. Focus on decisions, preferences, and outcomes.'
},
{
role: 'user',
content: history.map(m => `${m.role}: ${m.content}`).join('\n')
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content).keyPoints;
}
Monitoring and quality
Context management quality directly impacts conversation quality. Monitor both.
Key metrics
| Metric | Target | Indicates |
|---|
| Avg context utilisation | 60-80% | Budget sizing |
| Overflow events/day | <5% of sessions | Headroom adequacy |
| Summarisation accuracy | >85% | Summary quality |
| User repetition rate | <10% | Context preservation |
Monitoring implementation
const contextMetrics = {
recordContextBuild(sessionId: string, utilisation: number, components: {
systemPrompt: number;
ragContext: number;
history: number;
available: number;
}) {
metrics.gauge('context.utilisation', utilisation, { sessionId });
metrics.gauge('context.system_prompt_tokens', components.systemPrompt);
metrics.gauge('context.rag_tokens', components.ragContext);
metrics.gauge('context.history_tokens', components.history);
metrics.gauge('context.available_tokens', components.available);
if (utilisation > 0.9) {
metrics.increment('context.near_overflow');
}
},
recordSummarisation(originalTokens: number, summaryTokens: number) {
const compressionRatio = summaryTokens / originalTokens;
metrics.histogram('context.compression_ratio', compressionRatio);
}
};
FAQs
How do I handle tool call results in context?
Tool call results can be verbose. Summarise them before storing:
async function storeToolResult(result: any): Promise<string> {
const resultString = JSON.stringify(result);
if (countTokens(resultString) < 200) {
return resultString; // Keep short results verbatim
}
// Summarise long results
const summary = await summarise(resultString, 150);
return `[Tool result summary: ${summary}]`;
}
Should I include system messages in history?
Generally no. System prompts are injected fresh each turn. Including them in history wastes tokens and can cause instruction drift.
How do I handle multi-user conversations?
Track speaker identity and include in context:
const message = {
role: 'user',
content: `[${userName}]: ${content}`
};
What's the right summarisation frequency?
Summarise when history exceeds 30-40 messages or when context utilisation exceeds 70%. More frequent summarisation loses nuance; less frequent risks overflow.
Can I use smaller models for summarisation?
Yes, and you should. GPT-4o-mini or Claude Haiku handle summarisation well at a fraction of the cost. Reserve larger models for the main conversation.
Summary and next steps
Context management separates coherent conversational agents from forgetful ones. The hybrid approach - combining sliding windows with smart summarisation - handles most production scenarios effectively.
Implementation checklist:
- Define token budget allocation strategy
- Implement sliding window with first/last preservation
- Add summarisation for long conversations
- Build importance detection for message preservation
- Handle overflow gracefully with user communication
- Monitor utilisation and quality metrics
Quick wins:
- Add importance tagging to messages (decisions, commitments)
- Implement basic sliding window with first-2/last-10 preservation
- Add utilisation warnings at 85% threshold
Internal links:
External references: