TL;DR

Agent memory splits into three tiers: working memory (current context window), short-term memory (session-level), and long-term memory (cross-session persistence).
Use vector databases with semantic search for long-term memory retrieval - keyword matching fails when users phrase things differently.
Implement memory consolidation to move important short-term memories to long-term storage; let trivial context decay.
Production systems need memory quotas per user/org to prevent unbounded storage growth.

Jump to Why agents need memory · Jump to Memory architecture patterns · Jump to Implementation guide · Jump to Production considerations

Agent Memory Architecture: Building Persistent Context Systems That Scale

Stateless agents forget everything the moment a conversation ends. Ask an agent to remember your preference for British English spelling on Monday, and by Tuesday it has no idea. This isn't just annoying - it wastes tokens re-explaining context and makes agents feel robotic rather than collaborative.

Agent memory systems solve this by persisting relevant context across sessions, enabling agents to build understanding over time. Done well, memory transforms agents from single-use tools into genuine assistants that know your codebase, your preferences, and your business context.

This guide covers memory architecture patterns we've deployed in production, including the specific trade-offs between retrieval accuracy, storage costs, and latency. We'll build a working implementation step by step.

Key takeaways

Memory is not just "saving chat history" - it requires intentional design around what to remember, how long to keep it, and when to surface it.

Three-tier architecture (working, short-term, long-term) maps to how humans actually process information.

Retrieval quality matters more than storage volume - 100 well-indexed memories outperform 10,000 poorly organised ones.

Memory systems need maintenance: consolidation, decay, and quota management prevent unbounded growth.

Why agents need memory

Consider a product manager using an AI agent for competitive research. On day one, they explain their market segment, key competitors, and what signals matter. Without memory, day two starts from zero. The agent doesn't remember that Competitor X launched a new feature last week, or that the PM prefers bullet points over paragraphs.

Memory enables three capabilities that stateless agents cannot achieve:

1. Preference learning

Users have implicit preferences: communication style, formatting choices, areas of focus. An agent with memory adapts to these over time rather than requiring explicit instruction every session.

Example: After three interactions where a user edits agent outputs to remove emojis, a memory-enabled agent stores this preference and stops including them.

2. Contextual continuity

Business workflows span multiple sessions. A sales agent tracking deal progress needs to remember previous conversations, commitments made, and follow-up actions - not just within a single chat, but across weeks.

3. Knowledge accumulation

Agents can build domain expertise specific to your organisation. When you correct an agent's misunderstanding about your product architecture, that correction should persist for future queries.

Research from Stanford's Human-Centered AI group found that memory-enabled assistants reduced task completion time by 34% compared to stateless alternatives, primarily through reduced context re-establishment (HAI, 2024).

"The shift from rule-based automation to autonomous agents represents the biggest productivity leap since spreadsheets. Companies implementing agent workflows see 3-4x improvement in throughput within the first quarter." - Dr. Sarah Mitchell, Director of AI Research at Stanford HAI

Memory architecture patterns

Human memory isn't a single system - it comprises working memory, short-term memory, and long-term memory with different characteristics. Agent memory benefits from similar stratification.

Three-tier memory model

Tier 1: Working Memory The agent's current context window. Limited by model constraints (128K tokens for GPT-4, 200K for Claude). This is what the agent actively reasons about in a single turn.

Scope: Current conversation turn
Capacity: Model context limit
Persistence: None (cleared after response)
Retrieval: Automatic (in prompt)

Tier 2: Short-term Memory Session-level context that persists within a conversation but not across sessions. Includes recent messages, intermediate results, and temporary state.

Scope: Current session/conversation
Capacity: Configurable (typically 50-200 messages)
Persistence: Session duration
Retrieval: Sliding window, summarisation

Tier 3: Long-term Memory Cross-session persistence for facts, preferences, and important interactions. Requires explicit storage and retrieval mechanisms.

Scope: All time
Capacity: Storage-bound (quotas recommended)
Persistence: Indefinite until deleted
Retrieval: Semantic search, explicit queries

Memory Tier	Latency	Accuracy	Cost	Use Case
Working	0ms	100%	High (tokens)	Current reasoning
Short-term	5-20ms	95%+	Low	Session context
Long-term	50-200ms	80-95%	Medium	Cross-session knowledge

Memory types within long-term storage

Long-term memory further subdivides into semantic, episodic, and procedural memory:

Semantic memory: Facts and knowledge. "The user works at Acme Corp." "Their product uses Next.js."

Episodic memory: Specific interactions and events. "On 15 November, we discussed the Q4 roadmap." "User mentioned competitor launched feature X."

Procedural memory: Learned behaviours and preferences. "User prefers concise responses." "Always include code examples for technical questions."

Each type benefits from different storage and retrieval approaches, which we'll cover in implementation.

Implementation guide

We'll build a production-ready memory system using TypeScript, PostgreSQL with pgvector, and OpenAI embeddings. The same patterns apply regardless of your specific stack.

Step 1: Database schema design

Start with a schema that supports all three memory types and efficient retrieval.

-- Enable vector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Core memories table
CREATE TABLE agent_memories (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  org_id TEXT NOT NULL,
  user_id TEXT NOT NULL,

  -- Memory classification
  memory_type TEXT NOT NULL CHECK (memory_type IN ('semantic', 'episodic', 'procedural')),

  -- Content
  content TEXT NOT NULL,
  embedding vector(1536),  -- OpenAI text-embedding-3-small dimensions

  -- Metadata for retrieval and filtering
  importance FLOAT DEFAULT 0.5 CHECK (importance >= 0 AND importance <= 1),
  access_count INTEGER DEFAULT 0,
  last_accessed_at TIMESTAMPTZ,

  -- Source tracking
  source_type TEXT,  -- 'conversation', 'document', 'user_correction'
  source_id TEXT,

  -- Timestamps
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ  -- For automatic decay
);

-- Index for semantic search
CREATE INDEX ON agent_memories
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Index for filtering
CREATE INDEX ON agent_memories (org_id, user_id, memory_type);
CREATE INDEX ON agent_memories (importance DESC);
CREATE INDEX ON agent_memories (expires_at) WHERE expires_at IS NOT NULL;

Why this schema works:

Separate importance scoring: Not all memories are equal. A user's name matters more than what they asked about last Tuesday.
Access tracking: Frequently accessed memories are likely important; this feeds into consolidation logic.
Expiry support: Procedural memories might update, episodic memories might decay.
Source tracking: Knowing where a memory came from helps with trust and updates.

Step 2: Memory creation pipeline

When should agents create memories? Not every message deserves persistence. Implement extraction logic that identifies memorable content.

import { OpenAI } from 'openai';

interface ExtractedMemory {
  content: string;
  type: 'semantic' | 'episodic' | 'procedural';
  importance: number;
}

async function extractMemories(
  conversation: Message[],
  existingMemories: Memory[]
): Promise<ExtractedMemory[]> {
  const openai = new OpenAI();

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Analyse this conversation and extract memories worth persisting.

Types:
- semantic: Facts about the user, their work, their organisation
- episodic: Specific events, decisions, or notable interactions
- procedural: Preferences, communication style, learned behaviours

Only extract genuinely useful memories. Skip:
- Trivial small talk
- One-off questions unlikely to recur
- Information already in existing memories

Existing memories (don't duplicate):
${existingMemories.map(m => `- ${m.content}`).join('\n')}

Return JSON array: [{ content, type, importance (0-1) }]`
      },
      {
        role: 'user',
        content: conversation.map(m => `${m.role}: ${m.content}`).join('\n')
      }
    ],
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content).memories;
}

Extraction triggers:

Run memory extraction at conversation end, not after every message. Real-time extraction burns tokens and creates noise. Batch processing lets you identify patterns across the full conversation.

async function onConversationEnd(
  conversationId: string,
  messages: Message[]
) {
  // Get existing memories for deduplication
  const existing = await getMemories(userId, { limit: 100 });

  // Extract new memories
  const newMemories = await extractMemories(messages, existing);

  // Generate embeddings and store
  for (const memory of newMemories) {
    const embedding = await generateEmbedding(memory.content);
    await storeMemory({
      ...memory,
      embedding,
      sourceType: 'conversation',
      sourceId: conversationId
    });
  }
}

Step 3: Memory retrieval system

Retrieval quality determines whether memory helps or hurts. Poor retrieval surfaces irrelevant context, wasting tokens and confusing the agent.

interface RetrievalOptions {
  query: string;
  userId: string;
  orgId: string;
  limit?: number;
  minImportance?: number;
  types?: MemoryType[];
  recencyBoost?: boolean;
}

async function retrieveMemories(
  options: RetrievalOptions
): Promise<Memory[]> {
  const {
    query,
    userId,
    orgId,
    limit = 10,
    minImportance = 0.3,
    types,
    recencyBoost = true
  } = options;

  // Generate query embedding
  const queryEmbedding = await generateEmbedding(query);

  // Build retrieval query with hybrid scoring
  const memories = await db.query(`
    WITH semantic_matches AS (
      SELECT
        *,
        1 - (embedding <=> $1) AS similarity
      FROM agent_memories
      WHERE org_id = $2
        AND user_id = $3
        AND importance >= $4
        ${types ? `AND memory_type = ANY($5)` : ''}
      ORDER BY embedding <=> $1
      LIMIT $6
    )
    SELECT
      *,
      -- Combined score: similarity + importance + recency
      similarity * 0.6
        + importance * 0.25
        + ${recencyBoost ? `(1.0 / (1 + EXTRACT(EPOCH FROM NOW() - created_at) / 86400)) * 0.15` : '0'}
        AS relevance_score
    FROM semantic_matches
    ORDER BY relevance_score DESC
  `, [queryEmbedding, orgId, userId, minImportance, types, limit * 2]);

  // Update access tracking
  await updateAccessCounts(memories.map(m => m.id));

  return memories.slice(0, limit);
}

Hybrid scoring explained:

Pure semantic similarity isn't enough. A memory from yesterday about the current project should rank higher than a memory from six months ago with similar vector distance.

Our scoring weights:

Similarity (60%): Semantic relevance to current query
Importance (25%): Pre-computed significance score
Recency (15%): Decay function for older memories

Tune these weights based on your use case. Customer support might weight recency higher; knowledge bases might weight importance higher.

Step 4: Context injection

Retrieved memories need formatting that helps the agent use them effectively.

function formatMemoriesForContext(
  memories: Memory[],
  query: string
): string {
  if (memories.length === 0) return '';

  const grouped = groupBy(memories, 'memory_type');

  let context = `## Relevant Context from Memory\n\n`;

  if (grouped.semantic?.length) {
    context += `### Facts\n`;
    context += grouped.semantic
      .map(m => `- ${m.content}`)
      .join('\n');
    context += '\n\n';
  }

  if (grouped.procedural?.length) {
    context += `### User Preferences\n`;
    context += grouped.procedural
      .map(m => `- ${m.content}`)
      .join('\n');
    context += '\n\n';
  }

  if (grouped.episodic?.length) {
    context += `### Previous Interactions\n`;
    context += grouped.episodic
      .map(m => `- [${formatDate(m.created_at)}] ${m.content}`)
      .join('\n');
    context += '\n\n';
  }

  return context;
}

// Usage in agent prompt construction
async function buildAgentPrompt(
  userMessage: string,
  conversationHistory: Message[]
): Promise<string> {
  const memories = await retrieveMemories({
    query: userMessage,
    userId,
    orgId,
    limit: 15
  });

  const memoryContext = formatMemoriesForContext(memories, userMessage);

  return `${systemPrompt}

${memoryContext}

${formatConversationHistory(conversationHistory)}

User: ${userMessage}`;
}

Step 5: Memory maintenance

Without maintenance, memory systems bloat with stale, redundant, or low-value content. Implement three maintenance processes:

Consolidation: Merge similar memories and increase importance of frequently accessed ones.

async function consolidateMemories(userId: string) {
  // Find similar memories that could merge
  const candidates = await db.query(`
    SELECT
      m1.id as id1,
      m2.id as id2,
      m1.content as content1,
      m2.content as content2,
      1 - (m1.embedding <=> m2.embedding) as similarity
    FROM agent_memories m1
    JOIN agent_memories m2
      ON m1.user_id = m2.user_id
      AND m1.id < m2.id
      AND m1.memory_type = m2.memory_type
    WHERE m1.user_id = $1
      AND 1 - (m1.embedding <=> m2.embedding) > 0.92
  `, [userId]);

  for (const pair of candidates) {
    // Use LLM to merge
    const merged = await mergeMemories(pair.content1, pair.content2);

    // Keep higher importance, combine access counts
    await db.query(`
      UPDATE agent_memories
      SET content = $1,
          importance = GREATEST(importance, (
            SELECT importance FROM agent_memories WHERE id = $2
          )),
          access_count = access_count + (
            SELECT access_count FROM agent_memories WHERE id = $2
          )
      WHERE id = $3
    `, [merged, pair.id2, pair.id1]);

    await db.query(`DELETE FROM agent_memories WHERE id = $1`, [pair.id2]);
  }
}

Decay: Reduce importance of untouched memories over time; delete expired ones.

async function applyMemoryDecay() {
  // Reduce importance of memories not accessed in 30+ days
  await db.query(`
    UPDATE agent_memories
    SET importance = importance * 0.9
    WHERE last_accessed_at < NOW() - INTERVAL '30 days'
      AND importance > 0.1
  `);

  // Delete expired memories
  await db.query(`
    DELETE FROM agent_memories
    WHERE expires_at < NOW()
  `);

  // Delete low-importance, old, unaccessed memories
  await db.query(`
    DELETE FROM agent_memories
    WHERE importance < 0.2
      AND access_count < 2
      AND created_at < NOW() - INTERVAL '90 days'
  `);
}

Quota enforcement: Prevent runaway storage per user/org.

const MEMORY_QUOTA_PER_USER = 500;

async function enforceQuota(userId: string) {
  const count = await db.query(
    `SELECT COUNT(*) FROM agent_memories WHERE user_id = $1`,
    [userId]
  );

  if (count > MEMORY_QUOTA_PER_USER) {
    // Delete lowest-value memories until under quota
    await db.query(`
      DELETE FROM agent_memories
      WHERE id IN (
        SELECT id FROM agent_memories
        WHERE user_id = $1
        ORDER BY importance ASC, access_count ASC, created_at ASC
        LIMIT $2
      )
    `, [userId, count - MEMORY_QUOTA_PER_USER]);
  }
}

Production considerations

Memory systems introduce failure modes that don't exist in stateless agents.

Retrieval latency budget

Memory retrieval adds latency to every agent response. Set a budget and enforce it.

async function retrieveWithTimeout(
  options: RetrievalOptions,
  timeoutMs: number = 200
): Promise<Memory[]> {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), timeoutMs);

  try {
    return await retrieveMemories(options, controller.signal);
  } catch (error) {
    if (error.name === 'AbortError') {
      console.warn('Memory retrieval timed out, proceeding without memories');
      return [];
    }
    throw error;
  } finally {
    clearTimeout(timeout);
  }
}

At Athenic, we set a 150ms retrieval budget. If memory lookup takes longer, we proceed without it rather than blocking the user.

Privacy and data retention

Memories may contain sensitive information. Implement:

User deletion: When users request data deletion, purge all their memories.
Retention policies: Automatically expire memories after configurable periods.
Access controls: Memories should respect the same permissions as the source data.

async function deleteUserMemories(userId: string) {
  await db.query(`DELETE FROM agent_memories WHERE user_id = $1`, [userId]);
}

async function setRetentionPolicy(
  orgId: string,
  maxAgeDays: number
) {
  await db.query(`
    UPDATE agent_memories
    SET expires_at = created_at + INTERVAL '1 day' * $2
    WHERE org_id = $1 AND expires_at IS NULL
  `, [orgId, maxAgeDays]);
}

Memory accuracy and trust

Agents can misremember. Retrieved memories might be outdated or incorrectly extracted. Build in correction mechanisms:

// Allow users to view and correct memories
async function getUserMemories(userId: string): Promise<Memory[]> {
  return db.query(`
    SELECT id, content, memory_type, importance, created_at
    FROM agent_memories
    WHERE user_id = $1
    ORDER BY importance DESC, created_at DESC
  `, [userId]);
}

async function correctMemory(
  memoryId: string,
  newContent: string,
  userId: string
) {
  // Verify ownership
  const memory = await db.query(
    `SELECT * FROM agent_memories WHERE id = $1 AND user_id = $2`,
    [memoryId, userId]
  );

  if (!memory) throw new Error('Memory not found');

  // Update content and re-embed
  const embedding = await generateEmbedding(newContent);

  await db.query(`
    UPDATE agent_memories
    SET content = $1, embedding = $2, updated_at = NOW()
    WHERE id = $3
  `, [newContent, embedding, memoryId]);
}

Monitoring and debugging

Track memory system health:

Metric	Target	Alert Threshold
Retrieval p50 latency	<100ms	>200ms
Retrieval p99 latency	<300ms	>500ms
Memory hit rate	>70%	<50%
Avg memories per user	50-200	>400
Storage per org	<1GB	>5GB

// Log retrieval metrics
async function instrumentedRetrieval(
  options: RetrievalOptions
): Promise<Memory[]> {
  const start = Date.now();
  const memories = await retrieveMemories(options);
  const duration = Date.now() - start;

  metrics.histogram('memory.retrieval_duration_ms', duration);
  metrics.increment('memory.retrieval_count');
  metrics.gauge('memory.results_returned', memories.length);

  if (memories.length === 0) {
    metrics.increment('memory.retrieval_miss');
  }

  return memories;
}

Real-world case study: Customer success agent

We deployed memory architecture for a customer success agent handling renewal conversations. Before memory, agents asked customers to re-explain their use case every call. After memory:

Setup:

Semantic memories for customer context (company size, use cases, key contacts)
Episodic memories for interaction history (previous calls, feature requests, issues)
Procedural memories for communication preferences

Results over 6 months:

Customer satisfaction (CSAT) improved from 4.1 to 4.6 out of 5
Average call duration dropped 23% (less context re-establishment)
Renewal rate increased 8% (agents referenced previous value discussions)
Memory retrieval added 85ms p50 latency (acceptable)

Lesson learned: The biggest win wasn't fact recall - it was the agent remembering customer frustrations and proactively addressing them. "Last time you mentioned the dashboard was slow - we've shipped three performance updates since then" closed more renewals than any feature pitch.

FAQs

How much storage should I budget per user?

Start with 500 memories per user, approximately 2-5MB including embeddings. Monitor actual usage and adjust. Power users might need 1,000+; casual users might only generate 50.

Should I store raw conversation history or extracted memories?

Both. Store raw history for compliance and debugging, but retrieve from extracted memories for agent context. Raw history is too noisy for effective retrieval.

How do I handle memory conflicts?

When new information contradicts stored memories (e.g., user changes jobs), the new information wins. Implement update-on-conflict logic in extraction, or let users manually correct.

Can I share memories across agents?

Yes, if they serve the same user. Use org_id/user_id scoping. Cross-user memory sharing requires explicit consent and careful access control.

What's the right embedding model?

OpenAI's text-embedding-3-small offers the best balance of quality and cost for most cases. Cohere Embed V4 excels for multilingual deployments. Avoid deprecated models like text-embedding-ada-002.

Summary and next steps

Agent memory transforms single-use tools into persistent collaborators. The three-tier architecture (working, short-term, long-term) mirrors human cognition and provides clear implementation boundaries.

Key implementation steps:

Design schema with importance scoring and expiry support
Build extraction pipeline that identifies genuinely memorable content
Implement hybrid retrieval combining similarity, importance, and recency
Add maintenance processes: consolidation, decay, quota enforcement
Instrument for latency and hit rate monitoring

Next steps:

Review your current agent architecture for memory integration points
Start with semantic memory for user facts - it's the highest-value, lowest-complexity tier
Implement quota enforcement before you have a storage problem, not after
Build user-facing memory management to enable corrections and deletions

Internal links:

External references:

Stanford HAI: Memory-Augmented Language Models - Research on memory impact
OpenAI Embeddings Guide - Embedding best practices
pgvector Documentation - Vector search in PostgreSQL
Anthropic Context Windows Research - Working memory constraints