How to Give Your AI Agent Memory (And Why Most Get It Wrong)

March 9, 2026 AI Agents

Most AI agents forget everything the moment a conversation ends. That's not an agent — that's a chatbot with extra steps.

Memory is what separates a truly autonomous agent from a glorified autocomplete. But implementing memory well is harder than it looks. Here's what actually works in production.

The Memory Problem

LLMs have a fixed context window. GPT-4 gives you 128K tokens. Claude gives you 200K. Sounds like a lot — until your agent has been running for a week and has processed thousands of interactions.

Without memory, your agent:

Asks the same questions repeatedly
Loses track of user preferences
Can't learn from past mistakes
Treats every conversation like a first date

With memory, your agent:

Builds a persistent understanding of users, context, and goals
Improves over time
Handles complex, multi-session workflows
Actually feels intelligent

The 4 Types of Agent Memory

1. Conversation History (Short-Term)

The simplest form: keep the last N messages in context. Every framework does this.

Pros: Easy to implement, good for single-session tasks.

Cons: Disappears when the session ends. Doesn't scale.

When to use: Always, as a baseline. But never rely on it alone.

# Simple sliding window memory
class ConversationMemory:
    def __init__(self, max_messages=50):
        self.messages = []
        self.max_messages = max_messages
    
    def add(self, role, content):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]
    
    def get_context(self):
        return self.messages

2. Summary Memory (Compressed)

Periodically summarize the conversation and inject the summary into new sessions. Reduces token usage while preserving key context.

Pros: Token-efficient, captures the gist.

Cons: Lossy — summaries miss nuance. The summarizer can hallucinate.

When to use: Long-running sessions where you need to stay within context limits.

import openai

def summarize_conversation(messages, client):
    """Compress conversation history into a summary."""
    conversation_text = "\n".join(
        f"{m['role']}: {m['content']}" for m in messages
    )
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Summarize this conversation. Preserve: key decisions, user preferences, action items, and unresolved questions."
        }, {
            "role": "user",
            "content": conversation_text
        }]
    )
    return response.choices[0].message.content

3. Episodic Memory (Structured Logs)

Store discrete events: "User asked about pricing on Feb 12," "Agent deployed code to staging on Feb 14." Retrieve relevant episodes when needed.

Pros: Precise, searchable, auditable.

Cons: Requires a retrieval system (embeddings + vector DB or keyword search).

When to use: Agents that need to recall specific past actions or decisions.

import chromadb
from datetime import datetime

class EpisodicMemory:
    def __init__(self):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection("episodes")
    
    def store_episode(self, event, metadata=None):
        """Store a discrete event with timestamp."""
        self.collection.add(
            documents=[event],
            metadatas=[{
                "timestamp": datetime.now().isoformat(),
                **(metadata or {})
            }],
            ids=[f"ep_{datetime.now().timestamp()}"]
        )
    
    def recall(self, query, n_results=5):
        """Retrieve relevant past episodes."""
        results = self.collection.query(
            query_texts=[query],
            n_results=n_results
        )
        return results["documents"][0]

4. Semantic Memory (Knowledge Base)

Curated, long-term knowledge: user preferences, company facts, learned procedures. Think of it as the agent's "long-term memory" — distilled from experience, not raw logs.

Pros: Highly relevant, compact, doesn't grow unboundedly.

Cons: Requires curation (manual or automated). Can become stale.

When to use: Always, for any agent that runs more than once.

The Architecture That Works

The best production agents combine all four:

┌─────────────────────────────────┐
│         Context Window          │
│                                 │
│  [System prompt]                │
│  [Semantic memory snapshot]     │
│  [Relevant episodic memories]   │
│  [Conversation summary]         │
│  [Recent messages]              │
│  [User message]                 │
└─────────────────────────────────┘

The key insight: Memory isn't just storage — it's retrieval. The hard part isn't saving information; it's knowing what to recall and when.

Common Mistakes

Mistake 1: Stuffing Everything Into Context

More context ≠ better performance. LLMs get confused with too much information. The "lost in the middle" problem is real — models pay less attention to information in the middle of long contexts.

Fix: Be selective. Retrieve only what's relevant to the current task.

Mistake 2: No Memory Hierarchy

Treating all memories equally means your agent wastes tokens on irrelevant details while missing critical context.

Fix: Layer your memory. Semantic memory (always loaded) → episodic memory (retrieved on demand) → conversation history (recent only).

Mistake 3: Never Pruning

Memory that grows forever becomes noise. Old, irrelevant memories dilute the signal.

Fix: Implement decay. Archive old episodic memories. Periodically review and update semantic memory. Delete what's no longer relevant.

Mistake 4: Ignoring Memory in Testing

You test your agent's responses but not its memory retrieval. In production, bad retrieval = bad responses.

Fix: Test memory separately. Verify that the right memories surface for the right queries.

Connecting Memory to the Real World

An agent with great memory but no access to live data is still limited. The most powerful agents combine persistent memory with real-time web access — remembering what they've seen and being able to check for updates.

For example, a price monitoring agent with episodic memory can:

Scrape competitor prices using the WebPerception API
Store each price check as an episodic memory with timestamp
Recall historical prices when asked "how has competitor X's pricing changed?"
Alert when it detects a significant change compared to stored memories

import requests

# Scrape current price (WebPerception API)
resp = requests.post(
    "https://api.mantisapi.com/v1/extract",
    json={
        "url": "https://competitor.com/pricing",
        "schema": {"plan_name": "string", "price": "number", "features": "string[]"}
    },
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
current_prices = resp.json()

# Store as episodic memory
memory.store_episode(
    f"Price check: {current_prices}",
    metadata={"type": "price_check", "source": "competitor.com"}
)

# Compare with last known prices
history = memory.recall("competitor.com pricing", n_results=1)
# ... detect changes and alert

Implementation Checklist

Short-term: Conversation history with a sliding window (last 20-50 messages)
Compressed: Automatic summarization every N messages or on session end
Episodic: Event logging with timestamps, tags, and embeddings for retrieval
Semantic: Curated knowledge file(s) loaded at session start
Retrieval: Semantic search (embeddings) or keyword search for episodic recall
Pruning: Scheduled cleanup of old/irrelevant memories
Testing: Memory retrieval tests alongside response quality tests

The Bottom Line

Memory is the single biggest differentiator between a toy demo and a production agent. Get it right, and your agent compounds in value over time — every interaction makes it smarter. Get it wrong, and you've built an expensive goldfish.

The agents that win in 2026 won't be the ones with the biggest models. They'll be the ones with the best memory.

Ready to try Mantis?

100 free API calls/month. No credit card required.

Get Your API Key →