How to Give Your AI Agent Memory (And Why Most Get It Wrong)
Most AI agents forget everything the moment a conversation ends. That's not an agent — that's a chatbot with extra steps.
Memory is what separates a truly autonomous agent from a glorified autocomplete. But implementing memory well is harder than it looks. Here's what actually works in production.
The Memory Problem
LLMs have a fixed context window. GPT-4 gives you 128K tokens. Claude gives you 200K. Sounds like a lot — until your agent has been running for a week and has processed thousands of interactions.
Without memory, your agent:
- Asks the same questions repeatedly
- Loses track of user preferences
- Can't learn from past mistakes
- Treats every conversation like a first date
With memory, your agent:
- Builds a persistent understanding of users, context, and goals
- Improves over time
- Handles complex, multi-session workflows
- Actually feels intelligent
The 4 Types of Agent Memory
1. Conversation History (Short-Term)
The simplest form: keep the last N messages in context. Every framework does this.
Pros: Easy to implement, good for single-session tasks.
Cons: Disappears when the session ends. Doesn't scale.
When to use: Always, as a baseline. But never rely on it alone.
# Simple sliding window memory
class ConversationMemory:
def __init__(self, max_messages=50):
self.messages = []
self.max_messages = max_messages
def add(self, role, content):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.max_messages:
self.messages = self.messages[-self.max_messages:]
def get_context(self):
return self.messages
2. Summary Memory (Compressed)
Periodically summarize the conversation and inject the summary into new sessions. Reduces token usage while preserving key context.
Pros: Token-efficient, captures the gist.
Cons: Lossy — summaries miss nuance. The summarizer can hallucinate.
When to use: Long-running sessions where you need to stay within context limits.
import openai
def summarize_conversation(messages, client):
"""Compress conversation history into a summary."""
conversation_text = "\n".join(
f"{m['role']}: {m['content']}" for m in messages
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Summarize this conversation. Preserve: key decisions, user preferences, action items, and unresolved questions."
}, {
"role": "user",
"content": conversation_text
}]
)
return response.choices[0].message.content
3. Episodic Memory (Structured Logs)
Store discrete events: "User asked about pricing on Feb 12," "Agent deployed code to staging on Feb 14." Retrieve relevant episodes when needed.
Pros: Precise, searchable, auditable.
Cons: Requires a retrieval system (embeddings + vector DB or keyword search).
When to use: Agents that need to recall specific past actions or decisions.
import chromadb
from datetime import datetime
class EpisodicMemory:
def __init__(self):
self.client = chromadb.Client()
self.collection = self.client.create_collection("episodes")
def store_episode(self, event, metadata=None):
"""Store a discrete event with timestamp."""
self.collection.add(
documents=[event],
metadatas=[{
"timestamp": datetime.now().isoformat(),
**(metadata or {})
}],
ids=[f"ep_{datetime.now().timestamp()}"]
)
def recall(self, query, n_results=5):
"""Retrieve relevant past episodes."""
results = self.collection.query(
query_texts=[query],
n_results=n_results
)
return results["documents"][0]
4. Semantic Memory (Knowledge Base)
Curated, long-term knowledge: user preferences, company facts, learned procedures. Think of it as the agent's "long-term memory" — distilled from experience, not raw logs.
Pros: Highly relevant, compact, doesn't grow unboundedly.
Cons: Requires curation (manual or automated). Can become stale.
When to use: Always, for any agent that runs more than once.
The Architecture That Works
The best production agents combine all four:
┌─────────────────────────────────┐
│ Context Window │
│ │
│ [System prompt] │
│ [Semantic memory snapshot] │
│ [Relevant episodic memories] │
│ [Conversation summary] │
│ [Recent messages] │
│ [User message] │
└─────────────────────────────────┘
The key insight: Memory isn't just storage — it's retrieval. The hard part isn't saving information; it's knowing what to recall and when.
Common Mistakes
Mistake 1: Stuffing Everything Into Context
More context ≠ better performance. LLMs get confused with too much information. The "lost in the middle" problem is real — models pay less attention to information in the middle of long contexts.
Fix: Be selective. Retrieve only what's relevant to the current task.
Mistake 2: No Memory Hierarchy
Treating all memories equally means your agent wastes tokens on irrelevant details while missing critical context.
Fix: Layer your memory. Semantic memory (always loaded) → episodic memory (retrieved on demand) → conversation history (recent only).
Mistake 3: Never Pruning
Memory that grows forever becomes noise. Old, irrelevant memories dilute the signal.
Fix: Implement decay. Archive old episodic memories. Periodically review and update semantic memory. Delete what's no longer relevant.
Mistake 4: Ignoring Memory in Testing
You test your agent's responses but not its memory retrieval. In production, bad retrieval = bad responses.
Fix: Test memory separately. Verify that the right memories surface for the right queries.
Connecting Memory to the Real World
An agent with great memory but no access to live data is still limited. The most powerful agents combine persistent memory with real-time web access — remembering what they've seen and being able to check for updates.
For example, a price monitoring agent with episodic memory can:
- Scrape competitor prices using the WebPerception API
- Store each price check as an episodic memory with timestamp
- Recall historical prices when asked "how has competitor X's pricing changed?"
- Alert when it detects a significant change compared to stored memories
import requests
# Scrape current price (WebPerception API)
resp = requests.post(
"https://api.mantisapi.com/v1/extract",
json={
"url": "https://competitor.com/pricing",
"schema": {"plan_name": "string", "price": "number", "features": "string[]"}
},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
current_prices = resp.json()
# Store as episodic memory
memory.store_episode(
f"Price check: {current_prices}",
metadata={"type": "price_check", "source": "competitor.com"}
)
# Compare with last known prices
history = memory.recall("competitor.com pricing", n_results=1)
# ... detect changes and alert
Implementation Checklist
- Short-term: Conversation history with a sliding window (last 20-50 messages)
- Compressed: Automatic summarization every N messages or on session end
- Episodic: Event logging with timestamps, tags, and embeddings for retrieval
- Semantic: Curated knowledge file(s) loaded at session start
- Retrieval: Semantic search (embeddings) or keyword search for episodic recall
- Pruning: Scheduled cleanup of old/irrelevant memories
- Testing: Memory retrieval tests alongside response quality tests
The Bottom Line
Memory is the single biggest differentiator between a toy demo and a production agent. Get it right, and your agent compounds in value over time — every interaction makes it smarter. Get it wrong, and you've built an expensive goldfish.
The agents that win in 2026 won't be the ones with the biggest models. They'll be the ones with the best memory.