AI agents are only as useful as the context they retain. Without persistent memory, every conversation starts from zero — agents forget preferences, lose track of multi-session projects, and repeat the same questions endlessly. In 2026, agent memory has matured from a research curiosity into critical infrastructure. Frameworks now compete on retrieval accuracy, deployment simplicity, latency, and data sovereignty.
This guide compares the five most relevant agent memory frameworks available today: Dakera, Mem0, Letta, Zep, and Hindsight. We evaluate each on benchmark performance, retrieval architecture, deployment model, dependency footprint, encryption, and pricing — then provide clear recommendations for when each framework makes sense.
Why Agent Memory Matters in 2026
The shift from single-turn chatbots to autonomous multi-step agents has made memory non-negotiable. Consider what breaks without it:
- Coding agents forget project conventions between sessions, generating inconsistent code
- Customer support agents re-ask for order numbers and preferences every interaction
- Research agents lose track of what they've already explored, duplicating work
- Personal assistants can't learn user preferences over time
The memory layer sits between the LLM and the application — ingesting conversation turns, extracting salient facts, storing them durably, and retrieving the right context at query time. The quality of this pipeline directly determines whether an agent feels intelligent or broken.
Evaluation Criteria
We use six criteria to compare frameworks:
- Benchmark accuracy — LoCoMo (Long Conversation Memory) scores across single-hop, multi-hop, and temporal reasoning categories
- Retrieval architecture — vector-only vs. hybrid (vector + keyword + reranking), graph enrichment, temporal awareness
- Deployment model — self-hosted vs. cloud-only, binary vs. container, operational overhead
- Dependency footprint — external services required (embedding APIs, databases, LLMs)
- Security and encryption — at-rest encryption, tenant isolation, data residency
- Pricing — open-source vs. proprietary, per-query costs, cloud markup
LoCoMo is a benchmark designed specifically for evaluating long-conversation memory systems. It tests three categories: Category 1 (single-hop factual recall), Category 2 (multi-hop reasoning across memories), and Category 3 (temporal reasoning — understanding that facts change over time). Category 3 is the hardest: it requires knowing that "Alice moved to Berlin in March" supersedes "Alice lives in London" from an earlier conversation.
Dakera
Overview
Dakera is a Rust-based memory engine distributed as a single 44 MB static binary. It runs entirely on-device with no external dependencies — no cloud embedding API, no separate database, no Python runtime. The HNSW vector index, BM25 full-text index, and cross-encoder reranker all execute locally within the same process.
Retrieval Architecture
Dakera uses a three-stage hybrid retrieval pipeline:
- Candidate generation — parallel HNSW vector search and BM25 keyword search produce initial candidate sets
- Fusion — candidates are merged using reciprocal rank fusion (RRF), eliminating duplicates while preserving signal from both retrieval paths
- Reranking — a cross-encoder model rescores the top candidates for semantic relevance, with temporal decay and importance weighting applied
This architecture handles the failure modes of vector-only retrieval. BM25 catches exact-match queries that embedding models fumble (names, IDs, specific dates), while the cross-encoder compensates for embedding space limitations on nuanced semantic queries.
On-Device Inference
Embeddings are generated locally using quantized ONNX models bundled with the binary. No data leaves the machine for inference — there are no API calls to OpenAI or any external embedding service. This eliminates network latency from the critical path and removes a recurring cost center.
Security
All memory data is encrypted at rest with AES-256-GCM. Namespace-level isolation enforces tenant boundaries at the storage layer. The binary runs without network access requirements — it can operate in air-gapped environments.
Integration
Dakera exposes 83 tools via the Model Context Protocol (MCP), plus gRPC and REST APIs. Native SDKs exist for Python, JavaScript, Rust, and Go. The MCP interface means any MCP-compatible agent can use Dakera as its memory backend without custom integration code:
{
"mcpServers": {
"dakera": {
"command": "dakera",
"args": ["mcp", "--namespace", "my-agent"]
}
}
}
When Dakera Excels
- Production deployments where benchmark accuracy matters
- Privacy-sensitive workloads (healthcare, legal, finance) that cannot send data to third-party APIs
- Self-hosted infrastructure where you need a single binary, not a Docker Compose stack
- High-throughput scenarios requiring low-latency retrieval without garbage collection pauses
- Multi-agent systems needing cross-agent knowledge sharing with tenant isolation
Mem0
Overview
Mem0 is a Python-based memory framework that has gained significant traction in the prototyping and startup community. It offers both a managed cloud platform and self-hosted deployment, with a clean API that makes integration straightforward. Mem0 focuses on simplicity — get memory working in your agent with minimal code.
Retrieval Architecture
Mem0 uses vector-only retrieval powered by external embedding models (typically OpenAI's text-embedding-3-small or text-embedding-3-large). Memories are stored in a vector database (Qdrant, Pinecone, or ChromaDB depending on configuration). Search is cosine similarity against the embedding space.
The vector-only approach works well for semantic similarity queries but has known weaknesses: exact-match failures (searching for a specific name or date), keyword-dependent queries, and temporal reasoning (no mechanism to prefer recent facts over stale ones without additional application logic).
Dependency Footprint
A Mem0 deployment requires: Python runtime, an embedding API (OpenAI or similar), a vector database (Qdrant/Pinecone/Chroma), and optionally an LLM for memory extraction. The cloud version abstracts these dependencies; self-hosted requires managing them yourself.
Strengths
- Developer experience — clean Python API, excellent documentation, fast time-to-prototype
- Cloud option — managed platform eliminates infrastructure concerns for early-stage projects
- Ecosystem — integrations with LangChain, LlamaIndex, CrewAI, and other popular agent frameworks
- Community — active open-source community with frequent releases
Limitations
- Vector-only retrieval misses keyword and temporal queries
- Dependent on external embedding APIs (latency + cost + data leaves your infrastructure)
- Self-hosted requires managing multiple services (vector DB + embedding API + application layer)
- No built-in encryption at rest in the open-source version
When Mem0 Excels
- Rapid prototyping where time-to-first-memory matters most
- Teams already using OpenAI embeddings who want to minimize new infrastructure
- Cloud-native deployments where managed services are preferred
- Simple use cases where semantic similarity is sufficient (preferences, general facts)
Letta (formerly MemGPT)
Overview
Letta takes a fundamentally different approach to agent memory. Instead of a traditional retrieval pipeline, Letta puts an LLM in the loop of memory management itself. The LLM decides what to remember, how to organize memories, and what to retrieve — treating memory as an LLM reasoning problem rather than an information retrieval problem.
This "LLM-as-memory-manager" paradigm is inspired by the MemGPT paper, which proposed using the LLM's own capabilities to manage a tiered memory system (core memory + archival memory + recall memory).
Architecture
Letta maintains three memory tiers:
- Core memory — always in the LLM's context window (persona, user preferences, key facts)
- Archival memory — long-term storage searched on demand via the LLM's tool calls
- Recall memory — recent conversation history with automatic summarization
The LLM itself issues memory operations (search, insert, update, delete) as tool calls during conversation. This means the quality of memory management depends heavily on the underlying LLM's capabilities.
Strengths
- Creative architecture — the LLM can reason about what's worth remembering, perform implicit deduplication, and summarize proactively
- Flexible memory organization — no fixed schema; the LLM organizes memories however makes sense for the use case
- Conversation continuity — excellent at maintaining narrative coherence across sessions
- Active development — well-funded team with a clear vision for autonomous agent infrastructure
Limitations
- Latency — every memory operation requires an LLM call, adding 500ms-2s per operation
- Cost — memory management consumes LLM tokens, which can be significant at scale
- LLM dependency — memory quality is bounded by the underlying model's capabilities
- Determinism — identical inputs may produce different memory states depending on LLM sampling
- Scale concerns — LLM-in-the-loop doesn't scale to thousands of concurrent agents as efficiently as traditional retrieval
When Letta Excels
- Conversational agents where narrative coherence matters more than raw retrieval speed
- Research and experimentation with novel memory architectures
- Use cases where memory organization is complex and benefits from LLM reasoning
- Small-scale deployments where per-query LLM cost is acceptable
Zep
Overview
Zep combines vector search with knowledge graph enrichment, automatically extracting entities and relationships from conversations and building a graph structure alongside the vector index. Originally open-source, Zep has transitioned to a cloud-first model — the managed Zep Cloud is the primary product, while the self-hosted community edition has been deprecated.
Architecture
Zep's retrieval pipeline enriches memories with structured entity data:
- Ingestion — conversations are processed for embedding generation and entity extraction simultaneously
- Graph construction — extracted entities (people, places, organizations, events) are linked into a knowledge graph
- Hybrid retrieval — queries search both the vector index and traverse the entity graph for related facts
The graph layer adds value for entity-centric queries ("What do I know about Alice?") that might scatter across many individual memory entries in a vector-only system.
Strengths
- Graph-enriched retrieval — entity extraction and relationship mapping improve recall for "tell me everything about X" queries
- Automatic summarization — conversations are summarized progressively, reducing storage and improving retrieval relevance
- Enterprise features — user management, audit logs, and compliance controls in the cloud version
- Structured data extraction — entities, relationships, and facts are extracted into queryable structures
Limitations
- Cloud lock-in — the OSS edition is deprecated; production use requires Zep Cloud
- No self-hosted path — organizations requiring data sovereignty have limited options
- External LLM dependency — entity extraction and summarization require LLM API calls
- Pricing opacity — cloud costs scale with usage in ways that are hard to predict upfront
When Zep Excels
- Enterprise teams who want managed infrastructure with graph capabilities out of the box
- Use cases heavily focused on entity relationships (CRM agents, people-centric assistants)
- Organizations that prefer cloud services over self-hosted infrastructure
- Teams needing structured entity extraction alongside unstructured memory
Hindsight
Overview
Hindsight is a newer entrant in the agent memory space, emerging from academic research into practical tooling. It focuses on reflective memory — the idea that agents should periodically review and reorganize their memories, identifying patterns and synthesizing insights that weren't apparent during initial storage.
Architecture
Hindsight introduces a "reflection" pass where stored memories are periodically re-examined by an LLM to generate higher-order insights. This is inspired by the Generative Agents paper's reflection mechanism, applied to persistent memory rather than in-context simulation.
Strengths
- Research-informed design — built on solid cognitive science and AI research foundations
- Insight generation — produces meta-memories that capture patterns across individual facts
- Novel approach — addresses a gap other frameworks ignore (memory consolidation and synthesis)
Limitations
- Early stage — fewer production deployments and less battle-tested than alternatives
- Limited documentation — community and docs are still maturing
- Performance unknown — no published LoCoMo or equivalent benchmark scores
- LLM cost for reflections — periodic re-processing of memory stores adds ongoing compute cost
When Hindsight Excels
- Research projects exploring novel memory architectures
- Use cases where pattern discovery across memories adds value (journaling agents, learning assistants)
- Teams comfortable with early-stage tooling who want to contribute upstream
Head-to-Head Comparison
| Framework | LoCoMo Score | Retrieval | Deployment | Dependencies | Encryption | Pricing |
|---|---|---|---|---|---|---|
| Dakera | 87.6% | HNSW + BM25 + cross-encoder | Self-hosted (single binary) | None (fully self-contained) | AES-256-GCM at rest | Open-core, free tier |
| Mem0 | ~70%* | Vector-only (cosine similarity) | Cloud + self-hosted | OpenAI API + vector DB | Cloud-managed TLS | Free OSS / Cloud pay-per-use |
| Letta | ~65%* | LLM-in-the-loop | Self-hosted (Python) | LLM API (GPT-4/Claude) | Application-level | Open-source / Cloud |
| Zep | ~72%* | Vector + knowledge graph | Cloud-only (OSS deprecated) | LLM API for extraction | Cloud-managed | Cloud pay-per-use |
| Hindsight | Not published | Vector + reflective synthesis | Self-hosted (Python) | LLM API for reflections | Not specified | Open-source |
* Estimated scores based on architecture analysis and community-reported results. Only Dakera publishes official LoCoMo scores from a reproducible benchmark suite run against the full 1,540 question set.
Architecture Deep Dive: Why Retrieval Method Matters
Vector-Only Limitations
Vector search excels at semantic similarity but fails predictably in several cases:
- Exact-match queries — "What is Alice's employee ID?" The answer (a number like "EMP-4892") has no semantic meaning in embedding space
- Temporal queries — "What did Bob say last Tuesday?" requires date awareness that embeddings don't capture
- Negation — "Which projects am I NOT involved in?" is semantically similar to "Which projects am I involved in?" in embedding space
- Keyword specificity — searching for a specific API name, error code, or technical term that embeddings smooth away
Hybrid Retrieval Advantages
Adding BM25 keyword search alongside vector search covers the exact-match and keyword-specificity gaps. The cross-encoder reranking layer then resolves conflicts between the two signal sources, promoting results that are both semantically relevant and lexically precise.
This three-stage pipeline is why Dakera's LoCoMo scores significantly exceed vector-only systems. Category 1 (single-hop) benefits from BM25 catching specific facts. Category 2 (multi-hop) benefits from broader candidate generation across both indices. Category 3 (temporal) benefits from the reranker's ability to weight recency signals.
Deployment and Operations Compared
Binary Simplicity vs. Service Orchestration
The operational difference between frameworks is dramatic:
# Dakera: one binary, one command
curl -sL https://get.dakera.ai | sh
dakera serve --port 3300
# Mem0 (self-hosted): Python + vector DB + embedding API
pip install mem0ai
# Also need: Qdrant running, OpenAI API key configured
docker run -p 6333:6333 qdrant/qdrant
export OPENAI_API_KEY="sk-..."
python -c "from mem0 import Memory; m = Memory()"
# Letta: Python + LLM API
pip install letta
export OPENAI_API_KEY="sk-..."
letta server --port 8283
For production deployments, the dependency count matters. Each external service is a potential failure point, a version to maintain, and a cost to monitor. Dakera's single-binary approach eliminates entire categories of operational incidents.
Resource Footprint
| Framework | RAM (100K memories) | Disk | CPU | Network |
|---|---|---|---|---|
| Dakera | ~400 MB | ~2 GB | Any (ARM/x64) | None required |
| Mem0 | ~1.5 GB (with Qdrant) | ~3 GB | x64 typical | Embedding API calls |
| Letta | ~800 MB | ~1 GB | x64 typical | LLM API calls per operation |
| Zep | Managed (cloud) | Managed (cloud) | Managed (cloud) | All operations via API |
Security and Data Sovereignty
For many organizations, where memory data lives is as important as how well it's retrieved. Agent memories contain sensitive information — user preferences, business context, personal details, and proprietary knowledge.
| Framework | Data Residency | Encryption at Rest | Air-Gap Capable | Tenant Isolation |
|---|---|---|---|---|
| Dakera | Your infrastructure | AES-256-GCM | Yes | Namespace-level |
| Mem0 | Your infra or Mem0 Cloud | Cloud-managed only | No (needs embedding API) | API key level |
| Letta | Your infrastructure | Application-level | No (needs LLM API) | Agent-level |
| Zep | Zep Cloud (AWS regions) | Cloud-managed | No | Project-level |
Only Dakera can operate in a fully air-gapped environment — no network required for any operation including embedding generation. This makes it the only viable option for classified environments, on-premises healthcare systems, and edge deployments without reliable internet.
When to Use Each Framework
Decision Guide
Common Migration Paths
Teams often start with one framework and migrate as requirements crystallize:
- Mem0 to Dakera — teams outgrow vector-only retrieval accuracy or want to eliminate the OpenAI embedding dependency. Dakera's import tools support migrating existing memory stores.
- Letta to Dakera — teams find LLM-in-the-loop latency unacceptable at scale and need deterministic, fast retrieval without per-query LLM cost.
- Zep to Dakera — organizations need self-hosting for data sovereignty or want to eliminate cloud vendor lock-in after Zep deprecated their OSS edition.
The State of Agent Memory in 2026
The field has consolidated around several clear approaches: traditional information retrieval (hybrid search), LLM-in-the-loop management, and graph-enriched memory. Each serves different trade-off preferences.
Key trends shaping the landscape:
- MCP as the standard interface — Model Context Protocol is becoming the de facto way agents communicate with memory systems. Frameworks that don't support MCP are increasingly friction-heavy to integrate.
- Self-hosting resurgence — after the initial rush to cloud-managed everything, organizations are pulling sensitive data back on-premises. Agent memories are particularly sensitive — they contain the distilled knowledge of every user interaction.
- Benchmark-driven development — LoCoMo and MTOB have given the field objective quality metrics. Teams can now make informed decisions based on measured accuracy rather than marketing claims.
- Temporal reasoning as differentiator — the hardest category in memory benchmarks (Category 3: temporal) separates production-ready systems from prototypes. Handling "facts change over time" requires architectural choices that can't be bolted on after the fact.
For teams building production agents today, the decision comes down to what you value most: raw accuracy and operational simplicity (Dakera), rapid prototyping speed (Mem0), creative LLM-driven memory (Letta), graph features with managed infrastructure (Zep), or research exploration (Hindsight). There's no wrong choice for a prototype — but for production, benchmark scores and deployment economics should drive the decision.
Ready to evaluate Dakera for your agent memory needs? Install the binary in under 30 seconds and run the full LoCoMo benchmark yourself. The benchmark suite is included — no separate download required. See the quickstart guide to begin.