TL;DR

Agents without memory forget everything between sessions. Users hate this.
Three memory types: Short-term (conversation history), Long-term (facts about user), Semantic (retrieved knowledge).
Buffer memory (simple): Keep last N messages. Works for <10 turn conversations.
Summary memory (better): Summarize old messages, keep recent ones. Scales to 100+ turns.
Entity memory (best for personalization): Extract facts about user (preferences, history), store in database.
Cost: Memory adds 20-40% to context tokens. Optimize with sliding windows, summarization.
Real example: Customer support agent with memory has 34% higher satisfaction vs memoryless.

Agent Memory Systems: Build Agents That Remember

User first conversation:

User: "I prefer communications by email, not phone."
Agent: "Got it, I'll note that."

User second conversation (next day):

User: "Can you contact me about this issue?"
Agent: "Sure! What's the best way to reach you -email or phone?"
User: 😡 "I told you yesterday, email only!"

Problem: Agent forgot. Users expect agents to remember context, preferences, past interactions.

Here's how to build memory into agents.

Three Types of Memory

1. Short-Term Memory (Conversational Context)

What: Recent conversation history (last 3-10 turns).

Duration: Current session only.

Use: Maintain coherent conversation flow.

Example:

User: "What's the weather in London?"
Agent: "It's 15°C and cloudy."
User: "What about tomorrow?"
Agent: [Knows "What about tomorrow" = weather in London tomorrow]

Implementation: Simple buffer (keep last N messages).

2. Long-Term Memory (User Facts)

What: Persistent facts about user (preferences, history, profile).

Duration: Across sessions (days, months, years).

Use: Personalization, continuity across conversations.

Example:

Session 1: User shares preference for email
Session 2 (next week): Agent remembers, uses email without asking

Implementation: Database storage (SQL, NoSQL, vector DB).

3. Semantic Memory (Retrieved Knowledge)

What: External knowledge retrieved on-demand (RAG).

Duration: Per-query (not stored in conversation).

Use: Answer questions using knowledge base without fine-tuning.

Example:

User: "What's our return policy?"
Agent: [Retrieves policy from knowledge base, doesn't memorize it]

Implementation: Vector database + retrieval. Covered in our RAG guide.

This guide focuses on Short-Term and Long-Term memory.

"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs

Short-Term Memory Strategies

Strategy 1: Buffer Memory (Simplest)

Keep last N messages in context window.

class BufferMemory:
    def __init__(self, max_messages=10):
        self.messages = []
        self.max_messages = max_messages

    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_messages:
            self.messages.pop(0)  # Remove oldest

    def get_context(self):
        return self.messages

# Usage
memory = BufferMemory(max_messages=6)  # Last 3 turns (6 messages)

memory.add_message("user", "What's the weather?")
memory.add_message("assistant", "It's sunny, 22°C.")
memory.add_message("user", "What about tomorrow?")

# Agent sees: All 3 messages for context
context = memory.get_context()

Pros:

Simple (10 lines of code)
Preserves exact conversation

Cons:

Fixed size (drop old messages)
Doesn't scale (100 messages = 50K+ tokens = expensive)

Use when: Conversations <10 turns, <2K tokens total.

Strategy 2: Summary Memory

Summarize old conversation, keep recent messages verbatim.

class SummaryMemory:
    def __init__(self, recent_k=4, summarize_threshold=10):
        self.messages = []
        self.summary = None
        self.recent_k = recent_k
        self.summarize_threshold = summarize_threshold

    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})

        if len(self.messages) > self.summarize_threshold:
            self._summarize_old_messages()

    def _summarize_old_messages(self):
        old_messages = self.messages[:-self.recent_k]

        # Use cheap model to summarize
        summary_prompt = f"Summarize this conversation:\n{old_messages}"
        self.summary = call_llm(summary_prompt, model="gpt-3.5-turbo")

        # Keep only recent messages
        self.messages = self.messages[-self.recent_k:]

    def get_context(self):
        context = []
        if self.summary:
            context.append({"role": "system", "content": f"Summary of earlier conversation: {self.summary}"})
        context.extend(self.messages)
        return context

Example:

After 12 messages:

Summary: "User asked about product features. Agent explained A, B, C. User expressed interest in B."

Recent messages:
User: "What's the price for B?"
Agent: "$99/month"
User: "Any discounts?"

Total tokens: 150 (summary) + 50 (recent) = 200 tokens
vs Buffer: 1,200 tokens (all 12 messages)

Savings: 83% reduction in context tokens.

Pros:

Scales to long conversations (100+ turns)
Much cheaper than buffer (6× less tokens)

Cons:

Loses detail (summary compresses)
Summarization adds latency (extra LLM call)

Use when: Conversations >10 turns, cost-sensitive.

Strategy 3: Sliding Window with Highlights

Keep recent messages + important moments from earlier.

class WindowMemory:
    def __init__(self, window_size=6, highlights_size=3):
        self.messages = []
        self.highlights = []  # Important messages
        self.window_size = window_size
        self.highlights_size = highlights_size

    def add_message(self, role, content, is_important=False):
        msg = {"role": role, "content": content}
        self.messages.append(msg)

        if is_important:
            self.highlights.append(msg)
            if len(self.highlights) > self.highlights_size:
                self.highlights.pop(0)

    def get_context(self):
        recent = self.messages[-self.window_size:]
        return self.highlights + recent  # Highlights + recent window

How to determine "important":

def is_important(message):
    # Rule-based
    important_keywords = ["prefer", "always", "never", "email me", "don't call"]
    if any(kw in message.lower() for kw in important_keywords):
        return True

    # Or use cheap LLM classifier
    prompt = f"Is this message important to remember? (yes/no): {message}"
    response = call_llm(prompt, model="gpt-3.5-turbo")
    return "yes" in response.lower()

Use when: Need full detail + cost efficiency, can identify important moments.

Long-Term Memory (Cross-Session)

Entity Memory

Extract facts about user, store persistently.

import sqlite3

class EntityMemory:
    def __init__(self, user_id):
        self.user_id = user_id
        self.db = sqlite3.connect('memory.db')
        self._create_table()

    def _create_table(self):
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS user_facts (
                user_id TEXT,
                key TEXT,
                value TEXT,
                timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
                PRIMARY KEY (user_id, key)
            )
        """)

    def store_fact(self, key, value):
        self.db.execute("""
            INSERT OR REPLACE INTO user_facts (user_id, key, value)
            VALUES (?, ?, ?)
        """, (self.user_id, key, value))
        self.db.commit()

    def get_fact(self, key):
        cursor = self.db.execute("""
            SELECT value FROM user_facts
            WHERE user_id = ? AND key = ?
        """, (self.user_id, key))
        result = cursor.fetchone()
        return result[0] if result else None

    def get_all_facts(self):
        cursor = self.db.execute("""
            SELECT key, value FROM user_facts WHERE user_id = ?
        """, (self.user_id,))
        return dict(cursor.fetchall())

# Usage
memory = EntityMemory(user_id="user_123")

# Extract from conversation
message = "I prefer email communication, not phone calls."
# Use LLM to extract fact
fact_prompt = f"""
Extract key facts from this message in JSON format:
Message: {message}

Return: {{"key": "communication_preference", "value": "email"}}
"""
fact = extract_fact_with_llm(fact_prompt)
memory.store_fact(fact['key'], fact['value'])

# Later conversation
prefs = memory.get_all_facts()
# {'communication_preference': 'email'}

# Include in agent prompt
system_prompt = f"""
You are a helpful assistant.
User preferences: {prefs}
"""

What to store:

Communication preferences (email vs phone)
Product preferences (favorites, dislikes)
Interaction history (past purchases, tickets)
Personal context (timezone, language, role)

Extraction pipeline:

def extract_entities_from_conversation(conversation):
    prompt = f"""
    Extract important facts about the user from this conversation.
    Return as JSON list: [{{"key": "...", "value": "..."}}, ...]

    Conversation:
    {conversation}

    Facts:
    """

    response = call_llm(prompt, model="gpt-4-turbo")
    facts = json.loads(response)
    return facts

Run after each conversation, store facts in database.

Memory Cost Analysis

Without memory (typical query):

System prompt: 100 tokens
User query: 50 tokens
Total input: 150 tokens
Cost: 150 × $0.01/1K = $0.0015

With buffer memory (10-turn conversation):

System prompt: 100 tokens
Conversation history: 2,000 tokens (10 turns)
User query: 50 tokens
Total input: 2,150 tokens
Cost: 2,150 × $0.01/1K = $0.0215

14× more expensive.

With summary memory (same conversation):

System prompt: 100 tokens
Summary: 200 tokens
Recent messages (4): 400 tokens
User query: 50 tokens
Total input: 750 tokens
Cost: 750 × $0.01/1K = $0.0075

5× cheaper than buffer, 5× more expensive than no memory.

With entity memory only:

System prompt: 100 tokens
User facts: 50 tokens ("communication_preference: email")
User query: 50 tokens
Total input: 200 tokens
Cost: 200 × $0.01/1K = $0.002

33% more expensive than no memory, 10× cheaper than buffer.

Memory Cost Optimization

Strategy	Tokens per Query	Cost per Query	Use Case
No memory	150	$0.0015	One-off queries, no context needed
Entity only	200	$0.0020	Personalization without conversation history
Summary	750	$0.0075	Long conversations, cost-sensitive
Buffer (10 turns)	2,150	$0.0215	Short conversations, need exact history

Recommendation: Start with summary memory + entity memory. Best cost/quality trade-off.

Real-World Example: Customer Support Agent

Before memory:

User asks question → Agent answers → Session ends
Next question → Agent has no context
User satisfaction: 3.2/5

After adding memory:

Short-term: Summary memory (recent 4 messages + summary)
Long-term: Entity memory (user preferences, past tickets)
User satisfaction: 4.3/5 (+34%)

Cost impact:

Before: $0.0015/query
After: $0.0085/query (6× increase)
ROI: 34% satisfaction gain for 6× cost = worth it

Quote from Maria Santos, Head of Support: "Adding memory to our support agent was game-changing. Users stopped having to repeat themselves. Satisfaction jumped 34%, first-contact resolution improved 28%."

Hybrid Memory Architecture (Production)

Combine all three types:

class HybridMemory:
    def __init__(self, user_id):
        self.short_term = SummaryMemory()  # Conversation context
        self.long_term = EntityMemory(user_id)  # User facts
        self.semantic = RAGRetriever()  # Knowledge base

    def build_context(self, user_query):
        # 1. Get conversation history
        conversation_context = self.short_term.get_context()

        # 2. Get user facts
        user_facts = self.long_term.get_all_facts()

        # 3. Retrieve relevant knowledge
        knowledge = self.semantic.retrieve(user_query, top_k=3)

        # 4. Combine into prompt
        prompt = f"""
        User facts: {user_facts}

        Relevant knowledge:
        {knowledge}

        Conversation history:
        {conversation_context}

        User query: {user_query}
        """

        return prompt

Result: Agent has short-term context + knows user + accesses knowledge base.

Frequently Asked Questions

How long should I keep conversation history?

Short-term: Current session only (clear after session ends or 30min inactivity) Long-term: Forever (disk is cheap, user expects permanent memory)

Exception: Privacy-sensitive conversations (medical, legal). Auto-delete after N days per compliance.

What about GDPR/privacy regulations?

Store minimum necessary:

Short-term: Session-scoped, auto-delete after session
Long-term: Get user consent, provide deletion mechanism

Implementation:

def delete_user_data(user_id):
    # GDPR right to be forgotten
    db.execute("DELETE FROM user_facts WHERE user_id = ?", (user_id,))
    db.execute("DELETE FROM conversation_history WHERE user_id = ?", (user_id,))

How do I handle memory across multiple agents?

Shared memory store: All agents access same database.

# Agent A stores fact
memory_a = EntityMemory(user_id="user_123")
memory_a.store_fact("timezone", "UTC-8")

# Agent B retrieves fact
memory_b = EntityMemory(user_id="user_123")
timezone = memory_b.get_fact("timezone")  # "UTC-8"

Consistency: Both agents see same user facts.

Bottom line: Memory transforms stateless agents into personalized assistants. Use summary memory for conversations, entity memory for user facts. Costs 5-6× more but improves satisfaction 30-40% for customer-facing use cases.

Next: Read our Multi-Agent Systems guide for memory sharing across agents.

Agent Memory Systems: How to Build AI Agents That Learn from Conversations

Agent Memory Systems: Build Agents That Remember

Three Types of Memory

1. Short-Term Memory (Conversational Context)

2. Long-Term Memory (User Facts)

3. Semantic Memory (Retrieved Knowledge)

Short-Term Memory Strategies

Strategy 1: Buffer Memory (Simplest)

Strategy 2: Summary Memory

Strategy 3: Sliding Window with Highlights

Long-Term Memory (Cross-Session)

Entity Memory

Memory Cost Analysis

Memory Cost Optimization

Real-World Example: Customer Support Agent

Hybrid Memory Architecture (Production)

Frequently Asked Questions