Why Benchmarking Memory is Hard
Benchmarking AI memory systems is fundamentally different from benchmarking language models or retrieval systems in isolation. Memory involves a pipeline: ingestion, storage, retrieval, temporal reasoning, and finally the quality of the answer produced using retrieved context. A weakness at any stage cascades into a wrong answer, but the benchmark only observes the final output.
Additionally, memory benchmarks must test capabilities that don't exist in standard retrieval evaluations:
- Temporal reasoning — can the system answer "when did X change?" or "what was the most recent Y?"
- Multi-hop inference — can it combine facts across multiple memories?
- Contradiction handling — when older and newer information conflict, does it prefer the update?
- Long-term retention — does performance degrade as memory volume grows?
The LoCoMo Benchmark
LoCoMo (Long Conversation Memory) is the most widely-used benchmark for evaluating AI agent memory. It was introduced in a 2024 research paper and has become the de facto standard for comparing memory systems.
Structure
LoCoMo consists of 1,540 questions derived from long conversations (10,000+ tokens each). Questions are categorized into three types:
| Category | Type | What It Tests | Example |
|---|---|---|---|
| Cat 1 | Single-hop | Direct fact retrieval | "What is Alice's favorite color?" |
| Cat 2 | Multi-hop | Combining multiple memories | "What do Alice and Bob have in common?" |
| Cat 3 | Temporal | Time-aware reasoning | "When did Alice change her mind about the project?" |
Evaluation Method
Each question has a ground-truth answer. The memory system ingests the conversation, then answers each question. Answers are evaluated using an LLM-as-judge approach that checks for semantic correctness — not exact string matching. This allows for different phrasings of correct answers.
Why LoCoMo Matters
Before LoCoMo, memory systems were evaluated on retrieval metrics alone (recall@k, NDCG). But high retrieval recall doesn't guarantee correct answers — you might retrieve the right passages but still fail at temporal inference or multi-hop reasoning. LoCoMo evaluates the full pipeline end-to-end.
Dakera's LoCoMo Performance
Dakera scores 88.2% overall on the full 1,540-question LoCoMo benchmark. Here's the breakdown by category:
| Category | Questions | Score | Key Capability |
|---|---|---|---|
| Cat 1 (Single-hop) | ~700 | ~96% | Hybrid retrieval (HNSW + BM25) |
| Cat 2 (Multi-hop) | ~500 | ~90% | Cross-encoder reranking + context fusion |
| Cat 3 (Temporal) | ~340 | 70.7% | Temporal inference + ML classification |
Category 3 remains the most challenging — temporal reasoning requires understanding not just what was said, but when it was said and how recent information supersedes older statements. Dakera's ML query classifier identifies temporal queries at routing time and applies specialized retrieval strategies (recency-weighted scoring, temporal filtering) for these cases.
What LoCoMo Doesn't Test
No single benchmark captures every aspect of memory quality. LoCoMo has known limitations:
- Static conversations — the test conversations are synthetic and pre-written, not generated from real agent interactions
- No knowledge graph evaluation — LoCoMo doesn't test entity extraction or graph traversal capabilities
- No scale testing — conversations are ~10K tokens. Real agents may accumulate millions of tokens over months
- Single-user only — doesn't test multi-tenant isolation or cross-agent memory sharing
- English only — no multilingual evaluation
The MTOB Benchmark
MTOB (Machine Theory of Belief) evaluates whether a memory system can track changing beliefs and states over time. While LoCoMo tests factual recall, MTOB tests whether the system understands that facts can change and that the most recent statement supersedes older ones.
Structure
MTOB presents conversations where entities change state multiple times:
"Alice's favorite programming language is Python." (turn 5)
"Alice switched to Rust last month." (turn 23)
"Actually, Alice went back to Python after trying Rust." (turn 47)
The system must correctly answer "What is Alice's current favorite language?" by understanding the temporal sequence of updates.
Why MTOB Complements LoCoMo
A system could score well on LoCoMo's temporal category by simply preferring recent results (high recency weight). MTOB specifically tests whether the system understands state transitions — not just recency, but the logical relationship between contradicting statements.
Evaluation Beyond Benchmarks
Retrieval Quality Metrics
Before the end-to-end benchmark score, you should understand your retrieval pipeline's raw performance:
# Evaluate retrieval quality with Dakera's built-in analytics
stats = client.analytics.retrieval_quality(
namespace="test-ns",
evaluation_set="locomo-eval" # pre-loaded evaluation queries
)
# Returns:
# {
# "recall_at_5": 0.94,
# "recall_at_10": 0.97,
# "mrr": 0.91,
# "avg_latency_ms": ...,
# "p99_latency_ms": ...
# }
Operational Metrics That Matter
Beyond accuracy, production memory systems need to perform under real-world constraints:
| Metric | What It Tells You | Dakera Design |
|---|---|---|
| Search latency | Query speed under load | Low-latency (no GC pauses) |
| Write throughput | Memories ingested per second | Concurrent, lock-free writes |
| Memory overhead | RAM per 100K memories | ~400 MB |
| Index build time | Time to build HNSW from cold | ~45s for 100K vectors |
How to Run LoCoMo Against Your Memory System
The LoCoMo benchmark is open-source and can be run against any memory system with a search API. Here's how to evaluate Dakera:
from dakera import DakeraClient
from locomo_eval import LoCoMoEvaluator, load_conversations, load_questions
client = DakeraClient(base_url="http://localhost:3300")
# Step 1: Ingest all LoCoMo conversations
conversations = load_conversations("locomo_v1/conversations.json")
for conv in conversations:
for turn in conv["turns"]:
client.store_memory(
agent_id=f"locomo-{conv['id']}",
content=turn["content"],
metadata={"turn_number": turn["number"], "speaker": turn["speaker"]}
)
# Step 2: Run questions against the memory
questions = load_questions("locomo_v1/questions.json")
evaluator = LoCoMoEvaluator(judge_model="claude-sonnet-4-20250514")
results = []
for q in questions:
# Retrieve relevant memories
memories = client.search_memories(
agent_id=f"locomo-{q['conversation_id']}",
query=q["question"],
top_k=10
)
# Generate answer using retrieved context
context = "\n".join([m.content for m in memories])
answer = llm.complete(f"Context:\n{context}\n\nQuestion: {q['question']}\nAnswer:")
# Evaluate
score = evaluator.judge(
question=q["question"],
predicted=answer,
ground_truth=q["answer"]
)
results.append({"category": q["category"], "score": score})
# Step 3: Aggregate scores
for cat in [1, 2, 3]:
cat_results = [r for r in results if r["category"] == cat]
avg = sum(r["score"] for r in cat_results) / len(cat_results)
print(f"Category {cat}: {avg:.1%}")
overall = sum(r["score"] for r in results) / len(results)
print(f"Overall: {overall:.1%}")
Common Pitfalls in Memory Benchmarking
1. Testing on Subsets
Running only 100 questions instead of the full 1,540 produces unstable results. Category 3 in particular has high variance on small samples. Always run the full set for publishable numbers.
2. Optimizing for the Benchmark
It's tempting to tune hyperparameters specifically for LoCoMo's question distribution. This produces misleading scores that don't transfer to real workloads. Optimize for your actual use case, then report benchmark scores as a sanity check.
3. Ignoring the Judge Model
LLM-as-judge evaluation introduces variance. Different judge models (GPT-4, Claude, Gemini) may score the same answer differently. Always report which judge model you used, and be aware that comparing scores across different judge models is not meaningful.
4. Not Controlling for the Generation Model
The final answer quality depends on both the memory system (retrieval) and the generation model (answer production). A better generation model can compensate for weaker retrieval. Control this by using the same generation model across all systems you're comparing.
The Future of Memory Benchmarks
The field is moving toward more comprehensive evaluation suites that test:
- Scale — how performance changes from 1K to 1M memories
- Multi-session — retrieval across separate conversations over weeks/months
- Adversarial — handling conflicting information, hallucination detection
- Knowledge graph — entity extraction accuracy, relationship inference
- Efficiency — quality-per-compute, scoring both accuracy and resource usage
Until these benchmarks mature, LoCoMo remains the best single number for comparing memory systems. But treat it as one signal among many — not the complete picture.
Benchmark Reproducibility
A benchmark score is only meaningful if others can reproduce it. Here's what you need to control for reproducible memory evaluation:
Environment Variables
| Variable | Impact | Recommendation |
|---|---|---|
| Judge model | ±3% score variance | Fix to one model, report which |
| Temperature | ±1% on temporal questions | Set to 0 for deterministic judging |
| Retrieval limit (K) | Higher K = higher recall, slower | K=10 standard, report if different |
| Embedding model | ±5% depending on domain fit | Report exact model and version |
| Chunk strategy | Major impact on Cat 2/3 | Document chunking approach used |
Statistical Significance
A 1-2% difference between systems is likely noise, not signal. Category 3 (temporal) has particularly high variance due to the ambiguity of temporal questions. When reporting results:
- Run the full 1,540 questions (no subsets)
- Report confidence intervals — bootstrap sampling over the question set gives ±1.5% at 95% CI for overall scores
- On Cat 3 specifically, ±3% is within noise. Only claim improvement at ≥5% delta
- If comparing two versions of your system, use paired comparisons (same questions, same judge) to reduce variance
Building Your Own Evaluation Suite
While LoCoMo is the standard public benchmark, you should also evaluate against your specific workload. Here's a framework for building domain-specific memory evaluations:
# Create a custom evaluation set from your production conversations
from dakera import DakeraClient
client = DakeraClient(base_url="http://localhost:3300")
# Step 1: Sample real conversations from your agent
conversations = client.list_sessions(agent_id="production", limit=50)
# Step 2: Generate ground-truth Q&A pairs (human-labeled or LLM-generated)
eval_pairs = []
for conv in conversations:
memories = client.session_memories(session_id=conv.id)
# Generate questions that test the capabilities you care about:
# - Single-hop: "What did the user request in this session?"
# - Multi-hop: "How does this session's request relate to last week's?"
# - Temporal: "What changed between session 3 and session 7?"
# - Entity: "What projects does this user work on?"
pairs = generate_eval_pairs(memories, categories=["single", "multi", "temporal", "entity"])
eval_pairs.extend(pairs)
# Step 3: Run evaluation and track over time
results = evaluate(client, eval_pairs, judge="claude-sonnet-4-20250514")
print(f"Domain-specific score: {results.overall:.1%}")
print(f"By category: {results.by_category}")
What to Test Beyond Accuracy
A production memory system isn't just about getting the right answer — it's about getting it reliably under real conditions:
- Latency under load — P50 and P99 search latency with 100 concurrent queries
- Accuracy at scale — does Cat 3 score drop when the agent has 100K memories vs. 1K?
- Cold start behavior — performance after server restart (HNSW index reload time)
- Ingestion throughput — memories stored per second during active conversation
- Cross-namespace isolation — ensure Agent A's memories never leak into Agent B's results
Competitive Landscape (2026)
The agent memory space is maturing rapidly. Here's how the major systems perform on LoCoMo as of May 2026:
| System | Overall | Cat 1 | Cat 2 | Cat 3 | Architecture |
|---|---|---|---|---|---|
| Dakera v0.11.55 | 88.2% | ~96% | ~90% | 70.7% | Hybrid + ML classifier |
| Mem0 | 92.5%* | — | — | — | LLM-assisted reranking, OpenAI embeddings |
| Mnemis | 93.9% | — | — | — | Research system |
| 0GMem | 88.7% | — | — | — | Graph-based |
* Mem0's 92.5% score is from their token-efficient memory algorithm research (June 2026), which uses LLM-assisted reranking per query. Mnemis is a research system, not production software. Dakera runs entirely on-device with ONNX inference — no API calls during search. The tradeoff is lower absolute scores but predictable latency, full data privacy, and zero operational cost per query.
For more details on how we run and interpret LoCoMo results, see our benchmark methodology post and the live results on our benchmark page.