TL;DR
- Agent memory splits into three tiers: working memory (current context window), short-term memory (session-level), and long-term memory (cross-session persistence).
- Use vector databases with semantic search for long-term memory retrieval - keyword matching fails when users phrase things differently.
- Implement memory consolidation to move important short-term memories to long-term storage; let trivial context decay.
- Production systems need memory quotas per user/org to prevent unbounded storage growth.
Jump to Why agents need memory · Jump to Memory architecture patterns · Jump to Implementation guide · Jump to Production considerations
Agent Memory Architecture: Building Persistent Context Systems That Scale
Stateless agents forget everything the moment a conversation ends. Ask an agent to remember your preference for British English spelling on Monday, and by Tuesday it has no idea. This isn't just annoying - it wastes tokens re-explaining context and makes agents feel robotic rather than collaborative.
Agent memory systems solve this by persisting relevant context across sessions, enabling agents to build understanding over time. Done well, memory transforms agents from single-use tools into genuine assistants that know your codebase, your preferences, and your business context.
This guide covers memory architecture patterns we've deployed in production, including the specific trade-offs between retrieval accuracy, storage costs, and latency. We'll build a working implementation step by step.
Key takeaways
- Memory is not just "saving chat history" - it requires intentional design around what to remember, how long to keep it, and when to surface it.
- Three-tier architecture (working, short-term, long-term) maps to how humans actually process information.
- Retrieval quality matters more than storage volume - 100 well-indexed memories outperform 10,000 poorly organised ones.
- Memory systems need maintenance: consolidation, decay, and quota management prevent unbounded growth.
Why agents need memory
Consider a product manager using an AI agent for competitive research. On day one, they explain their market segment, key competitors, and what signals matter. Without memory, day two starts from zero. The agent doesn't remember that Competitor X launched a new feature last week, or that the PM prefers bullet points over paragraphs.
Memory enables three capabilities that stateless agents cannot achieve:
1. Preference learning
Users have implicit preferences: communication style, formatting choices, areas of focus. An agent with memory adapts to these over time rather than requiring explicit instruction every session.
Example: After three interactions where a user edits agent outputs to remove emojis, a memory-enabled agent stores this preference and stops including them.
2. Contextual continuity
Business workflows span multiple sessions. A sales agent tracking deal progress needs to remember previous conversations, commitments made, and follow-up actions - not just within a single chat, but across weeks.
3. Knowledge accumulation
Agents can build domain expertise specific to your organisation. When you correct an agent's misunderstanding about your product architecture, that correction should persist for future queries.
Research from Stanford's Human-Centered AI group found that memory-enabled assistants reduced task completion time by 34% compared to stateless alternatives, primarily through reduced context re-establishment (HAI, 2024).
Memory architecture patterns
Human memory isn't a single system - it comprises working memory, short-term memory, and long-term memory with different characteristics. Agent memory benefits from similar stratification.
Three-tier memory model
Tier 1: Working Memory
The agent's current context window. Limited by model constraints (128K tokens for GPT-4, 200K for Claude). This is what the agent actively reasons about in a single turn.
- Scope: Current conversation turn
- Capacity: Model context limit
- Persistence: None (cleared after response)
- Retrieval: Automatic (in prompt)
Tier 2: Short-term Memory
Session-level context that persists within a conversation but not across sessions. Includes recent messages, intermediate results, and temporary state.
- Scope: Current session/conversation
- Capacity: Configurable (typically 50-200 messages)
- Persistence: Session duration
- Retrieval: Sliding window, summarisation
Tier 3: Long-term Memory
Cross-session persistence for facts, preferences, and important interactions. Requires explicit storage and retrieval mechanisms.
- Scope: All time
- Capacity: Storage-bound (quotas recommended)
- Persistence: Indefinite until deleted
- Retrieval: Semantic search, explicit queries
| Memory Tier | Latency | Accuracy | Cost | Use Case |
|---|
| Working | 0ms | 100% | High (tokens) | Current reasoning |
| Short-term | 5-20ms | 95%+ | Low | Session context |
| Long-term | 50-200ms | 80-95% | Medium | Cross-session knowledge |
Memory types within long-term storage
Long-term memory further subdivides into semantic, episodic, and procedural memory:
Semantic memory: Facts and knowledge. "The user works at Acme Corp." "Their product uses Next.js."
Episodic memory: Specific interactions and events. "On 15 November, we discussed the Q4 roadmap." "User mentioned competitor launched feature X."
Procedural memory: Learned behaviours and preferences. "User prefers concise responses." "Always include code examples for technical questions."
Each type benefits from different storage and retrieval approaches, which we'll cover in implementation.
Implementation guide
We'll build a production-ready memory system using TypeScript, PostgreSQL with pgvector, and OpenAI embeddings. The same patterns apply regardless of your specific stack.
Step 1: Database schema design
Start with a schema that supports all three memory types and efficient retrieval.
-- Enable vector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Core memories table
CREATE TABLE agent_memories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id TEXT NOT NULL,
user_id TEXT NOT NULL,
-- Memory classification
memory_type TEXT NOT NULL CHECK (memory_type IN ('semantic', 'episodic', 'procedural')),
-- Content
content TEXT NOT NULL,
embedding vector(1536), -- OpenAI text-embedding-3-small dimensions
-- Metadata for retrieval and filtering
importance FLOAT DEFAULT 0.5 CHECK (importance >= 0 AND importance <= 1),
access_count INTEGER DEFAULT 0,
last_accessed_at TIMESTAMPTZ,
-- Source tracking
source_type TEXT, -- 'conversation', 'document', 'user_correction'
source_id TEXT,
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ -- For automatic decay
);
-- Index for semantic search
CREATE INDEX ON agent_memories
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Index for filtering
CREATE INDEX ON agent_memories (org_id, user_id, memory_type);
CREATE INDEX ON agent_memories (importance DESC);
CREATE INDEX ON agent_memories (expires_at) WHERE expires_at IS NOT NULL;
Why this schema works:
- Separate importance scoring: Not all memories are equal. A user's name matters more than what they asked about last Tuesday.
- Access tracking: Frequently accessed memories are likely important; this feeds into consolidation logic.
- Expiry support: Procedural memories might update, episodic memories might decay.
- Source tracking: Knowing where a memory came from helps with trust and updates.
Step 2: Memory creation pipeline
When should agents create memories? Not every message deserves persistence. Implement extraction logic that identifies memorable content.
import { OpenAI } from 'openai';
interface ExtractedMemory {
content: string;
type: 'semantic' | 'episodic' | 'procedural';
importance: number;
}
async function extractMemories(
conversation: Message[],
existingMemories: Memory[]
): Promise<ExtractedMemory[]> {
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: `Analyse this conversation and extract memories worth persisting.
Types:
- semantic: Facts about the user, their work, their organisation
- episodic: Specific events, decisions, or notable interactions
- procedural: Preferences, communication style, learned behaviours
Only extract genuinely useful memories. Skip:
- Trivial small talk
- One-off questions unlikely to recur
- Information already in existing memories
Existing memories (don't duplicate):
${existingMemories.map(m => `- ${m.content}`).join('\n')}
Return JSON array: [{ content, type, importance (0-1) }]`
},
{
role: 'user',
content: conversation.map(m => `${m.role}: ${m.content}`).join('\n')
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content).memories;
}
Extraction triggers:
Run memory extraction at conversation end, not after every message. Real-time extraction burns tokens and creates noise. Batch processing lets you identify patterns across the full conversation.
async function onConversationEnd(
conversationId: string,
messages: Message[]
) {
// Get existing memories for deduplication
const existing = await getMemories(userId, { limit: 100 });
// Extract new memories
const newMemories = await extractMemories(messages, existing);
// Generate embeddings and store
for (const memory of newMemories) {
const embedding = await generateEmbedding(memory.content);
await storeMemory({
...memory,
embedding,
sourceType: 'conversation',
sourceId: conversationId
});
}
}
Step 3: Memory retrieval system
Retrieval quality determines whether memory helps or hurts. Poor retrieval surfaces irrelevant context, wasting tokens and confusing the agent.
interface RetrievalOptions {
query: string;
userId: string;
orgId: string;
limit?: number;
minImportance?: number;
types?: MemoryType[];
recencyBoost?: boolean;
}
async function retrieveMemories(
options: RetrievalOptions
): Promise<Memory[]> {
const {
query,
userId,
orgId,
limit = 10,
minImportance = 0.3,
types,
recencyBoost = true
} = options;
// Generate query embedding
const queryEmbedding = await generateEmbedding(query);
// Build retrieval query with hybrid scoring
const memories = await db.query(`
WITH semantic_matches AS (
SELECT
*,
1 - (embedding <=> $1) AS similarity
FROM agent_memories
WHERE org_id = $2
AND user_id = $3
AND importance >= $4
${types ? `AND memory_type = ANY($5)` : ''}
ORDER BY embedding <=> $1
LIMIT $6
)
SELECT
*,
-- Combined score: similarity + importance + recency
similarity * 0.6
+ importance * 0.25
+ ${recencyBoost ? `(1.0 / (1 + EXTRACT(EPOCH FROM NOW() - created_at) / 86400)) * 0.15` : '0'}
AS relevance_score
FROM semantic_matches
ORDER BY relevance_score DESC
`, [queryEmbedding, orgId, userId, minImportance, types, limit * 2]);
// Update access tracking
await updateAccessCounts(memories.map(m => m.id));
return memories.slice(0, limit);
}
Hybrid scoring explained:
Pure semantic similarity isn't enough. A memory from yesterday about the current project should rank higher than a memory from six months ago with similar vector distance.
Our scoring weights:
- Similarity (60%): Semantic relevance to current query
- Importance (25%): Pre-computed significance score
- Recency (15%): Decay function for older memories
Tune these weights based on your use case. Customer support might weight recency higher; knowledge bases might weight importance higher.
Step 4: Context injection
Retrieved memories need formatting that helps the agent use them effectively.
function formatMemoriesForContext(
memories: Memory[],
query: string
): string {
if (memories.length === 0) return '';
const grouped = groupBy(memories, 'memory_type');
let context = `## Relevant Context from Memory\n\n`;
if (grouped.semantic?.length) {
context += `### Facts\n`;
context += grouped.semantic
.map(m => `- ${m.content}`)
.join('\n');
context += '\n\n';
}
if (grouped.procedural?.length) {
context += `### User Preferences\n`;
context += grouped.procedural
.map(m => `- ${m.content}`)
.join('\n');
context += '\n\n';
}
if (grouped.episodic?.length) {
context += `### Previous Interactions\n`;
context += grouped.episodic
.map(m => `- [${formatDate(m.created_at)}] ${m.content}`)
.join('\n');
context += '\n\n';
}
return context;
}
// Usage in agent prompt construction
async function buildAgentPrompt(
userMessage: string,
conversationHistory: Message[]
): Promise<string> {
const memories = await retrieveMemories({
query: userMessage,
userId,
orgId,
limit: 15
});
const memoryContext = formatMemoriesForContext(memories, userMessage);
return `${systemPrompt}
${memoryContext}
${formatConversationHistory(conversationHistory)}
User: ${userMessage}`;
}
Step 5: Memory maintenance
Without maintenance, memory systems bloat with stale, redundant, or low-value content. Implement three maintenance processes:
Consolidation: Merge similar memories and increase importance of frequently accessed ones.
async function consolidateMemories(userId: string) {
// Find similar memories that could merge
const candidates = await db.query(`
SELECT
m1.id as id1,
m2.id as id2,
m1.content as content1,
m2.content as content2,
1 - (m1.embedding <=> m2.embedding) as similarity
FROM agent_memories m1
JOIN agent_memories m2
ON m1.user_id = m2.user_id
AND m1.id < m2.id
AND m1.memory_type = m2.memory_type
WHERE m1.user_id = $1
AND 1 - (m1.embedding <=> m2.embedding) > 0.92
`, [userId]);
for (const pair of candidates) {
// Use LLM to merge
const merged = await mergeMemories(pair.content1, pair.content2);
// Keep higher importance, combine access counts
await db.query(`
UPDATE agent_memories
SET content = $1,
importance = GREATEST(importance, (
SELECT importance FROM agent_memories WHERE id = $2
)),
access_count = access_count + (
SELECT access_count FROM agent_memories WHERE id = $2
)
WHERE id = $3
`, [merged, pair.id2, pair.id1]);
await db.query(`DELETE FROM agent_memories WHERE id = $1`, [pair.id2]);
}
}
Decay: Reduce importance of untouched memories over time; delete expired ones.
async function applyMemoryDecay() {
// Reduce importance of memories not accessed in 30+ days
await db.query(`
UPDATE agent_memories
SET importance = importance * 0.9
WHERE last_accessed_at < NOW() - INTERVAL '30 days'
AND importance > 0.1
`);
// Delete expired memories
await db.query(`
DELETE FROM agent_memories
WHERE expires_at < NOW()
`);
// Delete low-importance, old, unaccessed memories
await db.query(`
DELETE FROM agent_memories
WHERE importance < 0.2
AND access_count < 2
AND created_at < NOW() - INTERVAL '90 days'
`);
}
Quota enforcement: Prevent runaway storage per user/org.
const MEMORY_QUOTA_PER_USER = 500;
async function enforceQuota(userId: string) {
const count = await db.query(
`SELECT COUNT(*) FROM agent_memories WHERE user_id = $1`,
[userId]
);
if (count > MEMORY_QUOTA_PER_USER) {
// Delete lowest-value memories until under quota
await db.query(`
DELETE FROM agent_memories
WHERE id IN (
SELECT id FROM agent_memories
WHERE user_id = $1
ORDER BY importance ASC, access_count ASC, created_at ASC
LIMIT $2
)
`, [userId, count - MEMORY_QUOTA_PER_USER]);
}
}
Production considerations
Memory systems introduce failure modes that don't exist in stateless agents.
Retrieval latency budget
Memory retrieval adds latency to every agent response. Set a budget and enforce it.
async function retrieveWithTimeout(
options: RetrievalOptions,
timeoutMs: number = 200
): Promise<Memory[]> {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), timeoutMs);
try {
return await retrieveMemories(options, controller.signal);
} catch (error) {
if (error.name === 'AbortError') {
console.warn('Memory retrieval timed out, proceeding without memories');
return [];
}
throw error;
} finally {
clearTimeout(timeout);
}
}
At Athenic, we set a 150ms retrieval budget. If memory lookup takes longer, we proceed without it rather than blocking the user.
Privacy and data retention
Memories may contain sensitive information. Implement:
- User deletion: When users request data deletion, purge all their memories.
- Retention policies: Automatically expire memories after configurable periods.
- Access controls: Memories should respect the same permissions as the source data.
async function deleteUserMemories(userId: string) {
await db.query(`DELETE FROM agent_memories WHERE user_id = $1`, [userId]);
}
async function setRetentionPolicy(
orgId: string,
maxAgeDays: number
) {
await db.query(`
UPDATE agent_memories
SET expires_at = created_at + INTERVAL '1 day' * $2
WHERE org_id = $1 AND expires_at IS NULL
`, [orgId, maxAgeDays]);
}
Memory accuracy and trust
Agents can misremember. Retrieved memories might be outdated or incorrectly extracted. Build in correction mechanisms:
// Allow users to view and correct memories
async function getUserMemories(userId: string): Promise<Memory[]> {
return db.query(`
SELECT id, content, memory_type, importance, created_at
FROM agent_memories
WHERE user_id = $1
ORDER BY importance DESC, created_at DESC
`, [userId]);
}
async function correctMemory(
memoryId: string,
newContent: string,
userId: string
) {
// Verify ownership
const memory = await db.query(
`SELECT * FROM agent_memories WHERE id = $1 AND user_id = $2`,
[memoryId, userId]
);
if (!memory) throw new Error('Memory not found');
// Update content and re-embed
const embedding = await generateEmbedding(newContent);
await db.query(`
UPDATE agent_memories
SET content = $1, embedding = $2, updated_at = NOW()
WHERE id = $3
`, [newContent, embedding, memoryId]);
}
Monitoring and debugging
Track memory system health:
| Metric | Target | Alert Threshold |
|---|
| Retrieval p50 latency | <100ms | >200ms |
| Retrieval p99 latency | <300ms | >500ms |
| Memory hit rate | >70% | <50% |
| Avg memories per user | 50-200 | >400 |
| Storage per org | <1GB | >5GB |
// Log retrieval metrics
async function instrumentedRetrieval(
options: RetrievalOptions
): Promise<Memory[]> {
const start = Date.now();
const memories = await retrieveMemories(options);
const duration = Date.now() - start;
metrics.histogram('memory.retrieval_duration_ms', duration);
metrics.increment('memory.retrieval_count');
metrics.gauge('memory.results_returned', memories.length);
if (memories.length === 0) {
metrics.increment('memory.retrieval_miss');
}
return memories;
}
Real-world case study: Customer success agent
We deployed memory architecture for a customer success agent handling renewal conversations. Before memory, agents asked customers to re-explain their use case every call. After memory:
Setup:
- Semantic memories for customer context (company size, use cases, key contacts)
- Episodic memories for interaction history (previous calls, feature requests, issues)
- Procedural memories for communication preferences
Results over 6 months:
- Customer satisfaction (CSAT) improved from 4.1 to 4.6 out of 5
- Average call duration dropped 23% (less context re-establishment)
- Renewal rate increased 8% (agents referenced previous value discussions)
- Memory retrieval added 85ms p50 latency (acceptable)
Lesson learned: The biggest win wasn't fact recall - it was the agent remembering customer frustrations and proactively addressing them. "Last time you mentioned the dashboard was slow - we've shipped three performance updates since then" closed more renewals than any feature pitch.
FAQs
How much storage should I budget per user?
Start with 500 memories per user, approximately 2-5MB including embeddings. Monitor actual usage and adjust. Power users might need 1,000+; casual users might only generate 50.
Should I store raw conversation history or extracted memories?
Both. Store raw history for compliance and debugging, but retrieve from extracted memories for agent context. Raw history is too noisy for effective retrieval.
How do I handle memory conflicts?
When new information contradicts stored memories (e.g., user changes jobs), the new information wins. Implement update-on-conflict logic in extraction, or let users manually correct.
Can I share memories across agents?
Yes, if they serve the same user. Use org_id/user_id scoping. Cross-user memory sharing requires explicit consent and careful access control.
What's the right embedding model?
OpenAI's text-embedding-3-small offers the best balance of quality and cost for most cases. Cohere Embed V4 excels for multilingual deployments. Avoid deprecated models like text-embedding-ada-002.
Summary and next steps
Agent memory transforms single-use tools into persistent collaborators. The three-tier architecture (working, short-term, long-term) mirrors human cognition and provides clear implementation boundaries.
Key implementation steps:
- Design schema with importance scoring and expiry support
- Build extraction pipeline that identifies genuinely memorable content
- Implement hybrid retrieval combining similarity, importance, and recency
- Add maintenance processes: consolidation, decay, quota enforcement
- Instrument for latency and hit rate monitoring
Next steps:
- Review your current agent architecture for memory integration points
- Start with semantic memory for user facts - it's the highest-value, lowest-complexity tier
- Implement quota enforcement before you have a storage problem, not after
- Build user-facing memory management to enable corrections and deletions
Internal links:
External references: