AI Agent Memory Benchmarks 2026: LoCoMo, MTOB, and Beyond

Why Benchmarking Memory is Hard

Benchmarking AI memory systems is fundamentally different from benchmarking language models or retrieval systems in isolation. Memory involves a pipeline: ingestion, storage, retrieval, temporal reasoning, and finally the quality of the answer produced using retrieved context. A weakness at any stage cascades into a wrong answer, but the benchmark only observes the final output.

LoCoMo benchmark comparison showing Dakera at 88.2% vs Mem0 at 91.6% and Zep at 74.6%

Additionally, memory benchmarks must test capabilities that don't exist in standard retrieval evaluations:

Temporal reasoning — can the system answer "when did X change?" or "what was the most recent Y?"
Multi-hop inference — can it combine facts across multiple memories?
Contradiction handling — when older and newer information conflict, does it prefer the update?
Long-term retention — does performance degrade as memory volume grows?

The LoCoMo Benchmark

LoCoMo (Long Conversation Memory) is the most widely-used benchmark for evaluating AI agent memory. It was introduced in a 2024 research paper and has become the de facto standard for comparing memory systems.

Structure

LoCoMo consists of 1,540 questions derived from long conversations (10,000+ tokens each). Questions are categorized into three types:

Category	Type	What It Tests	Example
Cat 1	Single-hop	Direct fact retrieval	"What is Alice's favorite color?"
Cat 2	Multi-hop	Combining multiple memories	"What do Alice and Bob have in common?"
Cat 3	Temporal	Time-aware reasoning	"When did Alice change her mind about the project?"

Evaluation Method

Each question has a ground-truth answer. The memory system ingests the conversation, then answers each question. Answers are evaluated using an LLM-as-judge approach that checks for semantic correctness — not exact string matching. This allows for different phrasings of correct answers.

Why LoCoMo Matters

Before LoCoMo, memory systems were evaluated on retrieval metrics alone (recall@k, NDCG). But high retrieval recall doesn't guarantee correct answers — you might retrieve the right passages but still fail at temporal inference or multi-hop reasoning. LoCoMo evaluates the full pipeline end-to-end.

Dakera's LoCoMo Performance

Dakera scores 88.2% overall on the full 1,540-question LoCoMo benchmark. Here's the breakdown by category:

Category	Questions	Score	Key Capability
Cat 1 (Single-hop)	~700	~96%	Hybrid retrieval (HNSW + BM25)
Cat 2 (Multi-hop)	~500	~90%	Cross-encoder reranking + context fusion
Cat 3 (Temporal)	~340	70.7%	Temporal inference + ML classification

Category 3 remains the most challenging — temporal reasoning requires understanding not just what was said, but when it was said and how recent information supersedes older statements. Dakera's ML query classifier identifies temporal queries at routing time and applies specialized retrieval strategies (recency-weighted scoring, temporal filtering) for these cases.

What LoCoMo Doesn't Test

No single benchmark captures every aspect of memory quality. LoCoMo has known limitations:

Static conversations — the test conversations are synthetic and pre-written, not generated from real agent interactions
No knowledge graph evaluation — LoCoMo doesn't test entity extraction or graph traversal capabilities
No scale testing — conversations are ~10K tokens. Real agents may accumulate millions of tokens over months
Single-user only — doesn't test multi-tenant isolation or cross-agent memory sharing
English only — no multilingual evaluation

The MTOB Benchmark

MTOB (Machine Theory of Belief) evaluates whether a memory system can track changing beliefs and states over time. While LoCoMo tests factual recall, MTOB tests whether the system understands that facts can change and that the most recent statement supersedes older ones.

Structure

MTOB presents conversations where entities change state multiple times:

"Alice's favorite programming language is Python." (turn 5)
"Alice switched to Rust last month." (turn 23)
"Actually, Alice went back to Python after trying Rust." (turn 47)

The system must correctly answer "What is Alice's current favorite language?" by understanding the temporal sequence of updates.

Why MTOB Complements LoCoMo

A system could score well on LoCoMo's temporal category by simply preferring recent results (high recency weight). MTOB specifically tests whether the system understands state transitions — not just recency, but the logical relationship between contradicting statements.

Evaluation Beyond Benchmarks

Retrieval Quality Metrics

Before the end-to-end benchmark score, you should understand your retrieval pipeline's raw performance:

# Evaluate retrieval quality with Dakera's built-in analytics
stats = client.analytics.retrieval_quality(
    namespace="test-ns",
    evaluation_set="locomo-eval"  # pre-loaded evaluation queries
)

# Returns:
# {
#   "recall_at_5": 0.94,
#   "recall_at_10": 0.97,
#   "mrr": 0.91,
#   "avg_latency_ms": ...,
#   "p99_latency_ms": ...
# }

Operational Metrics That Matter

Beyond accuracy, production memory systems need to perform under real-world constraints:

Metric	What It Tells You	Dakera Design
Search latency	Query speed under load	Low-latency (no GC pauses)
Write throughput	Memories ingested per second	Concurrent, lock-free writes
Memory overhead	RAM per 100K memories	~400 MB
Index build time	Time to build HNSW from cold	~45s for 100K vectors

How to Run LoCoMo Against Your Memory System

The LoCoMo benchmark is open-source and can be run against any memory system with a search API. Here's how to evaluate Dakera:

from dakera import DakeraClient
from locomo_eval import LoCoMoEvaluator, load_conversations, load_questions

client = DakeraClient(base_url="http://localhost:3300")

# Step 1: Ingest all LoCoMo conversations
conversations = load_conversations("locomo_v1/conversations.json")
for conv in conversations:
    for turn in conv["turns"]:
        client.store_memory(
            agent_id=f"locomo-{conv['id']}",
            content=turn["content"],
            metadata={"turn_number": turn["number"], "speaker": turn["speaker"]}
        )

# Step 2: Run questions against the memory
questions = load_questions("locomo_v1/questions.json")
evaluator = LoCoMoEvaluator(judge_model="claude-sonnet-4-20250514")

results = []
for q in questions:
    # Retrieve relevant memories
    memories = client.search_memories(
        agent_id=f"locomo-{q['conversation_id']}",
        query=q["question"],
        top_k=10
    )

    # Generate answer using retrieved context
    context = "\n".join([m.content for m in memories])
    answer = llm.complete(f"Context:\n{context}\n\nQuestion: {q['question']}\nAnswer:")

    # Evaluate
    score = evaluator.judge(
        question=q["question"],
        predicted=answer,
        ground_truth=q["answer"]
    )
    results.append({"category": q["category"], "score": score})

# Step 3: Aggregate scores
for cat in [1, 2, 3]:
    cat_results = [r for r in results if r["category"] == cat]
    avg = sum(r["score"] for r in cat_results) / len(cat_results)
    print(f"Category {cat}: {avg:.1%}")

overall = sum(r["score"] for r in results) / len(results)
print(f"Overall: {overall:.1%}")

Common Pitfalls in Memory Benchmarking

1. Testing on Subsets

Running only 100 questions instead of the full 1,540 produces unstable results. Category 3 in particular has high variance on small samples. Always run the full set for publishable numbers.

2. Optimizing for the Benchmark

It's tempting to tune hyperparameters specifically for LoCoMo's question distribution. This produces misleading scores that don't transfer to real workloads. Optimize for your actual use case, then report benchmark scores as a sanity check.

3. Ignoring the Judge Model

LLM-as-judge evaluation introduces variance. Different judge models (GPT-4, Claude, Gemini) may score the same answer differently. Always report which judge model you used, and be aware that comparing scores across different judge models is not meaningful.

4. Not Controlling for the Generation Model

The final answer quality depends on both the memory system (retrieval) and the generation model (answer production). A better generation model can compensate for weaker retrieval. Control this by using the same generation model across all systems you're comparing.

The Future of Memory Benchmarks

The field is moving toward more comprehensive evaluation suites that test:

Scale — how performance changes from 1K to 1M memories
Multi-session — retrieval across separate conversations over weeks/months
Adversarial — handling conflicting information, hallucination detection
Knowledge graph — entity extraction accuracy, relationship inference
Efficiency — quality-per-compute, scoring both accuracy and resource usage

Until these benchmarks mature, LoCoMo remains the best single number for comparing memory systems. But treat it as one signal among many — not the complete picture.

Benchmark Reproducibility

A benchmark score is only meaningful if others can reproduce it. Here's what you need to control for reproducible memory evaluation:

Environment Variables

Variable	Impact	Recommendation
Judge model	±3% score variance	Fix to one model, report which
Temperature	±1% on temporal questions	Set to 0 for deterministic judging
Retrieval limit (K)	Higher K = higher recall, slower	K=10 standard, report if different
Embedding model	±5% depending on domain fit	Report exact model and version
Chunk strategy	Major impact on Cat 2/3	Document chunking approach used

Statistical Significance

A 1-2% difference between systems is likely noise, not signal. Category 3 (temporal) has particularly high variance due to the ambiguity of temporal questions. When reporting results:

Run the full 1,540 questions (no subsets)
Report confidence intervals — bootstrap sampling over the question set gives ±1.5% at 95% CI for overall scores
On Cat 3 specifically, ±3% is within noise. Only claim improvement at ≥5% delta
If comparing two versions of your system, use paired comparisons (same questions, same judge) to reduce variance

Building Your Own Evaluation Suite

While LoCoMo is the standard public benchmark, you should also evaluate against your specific workload. Here's a framework for building domain-specific memory evaluations:

# Create a custom evaluation set from your production conversations
from dakera import DakeraClient

client = DakeraClient(base_url="http://localhost:3300")

# Step 1: Sample real conversations from your agent
conversations = client.list_sessions(agent_id="production", limit=50)

# Step 2: Generate ground-truth Q&A pairs (human-labeled or LLM-generated)
eval_pairs = []
for conv in conversations:
    memories = client.session_memories(session_id=conv.id)
    # Generate questions that test the capabilities you care about:
    # - Single-hop: "What did the user request in this session?"
    # - Multi-hop: "How does this session's request relate to last week's?"
    # - Temporal: "What changed between session 3 and session 7?"
    # - Entity: "What projects does this user work on?"
    pairs = generate_eval_pairs(memories, categories=["single", "multi", "temporal", "entity"])
    eval_pairs.extend(pairs)

# Step 3: Run evaluation and track over time
results = evaluate(client, eval_pairs, judge="claude-sonnet-4-20250514")
print(f"Domain-specific score: {results.overall:.1%}")
print(f"By category: {results.by_category}")

What to Test Beyond Accuracy

A production memory system isn't just about getting the right answer — it's about getting it reliably under real conditions:

Latency under load — P50 and P99 search latency with 100 concurrent queries
Accuracy at scale — does Cat 3 score drop when the agent has 100K memories vs. 1K?
Cold start behavior — performance after server restart (HNSW index reload time)
Ingestion throughput — memories stored per second during active conversation
Cross-namespace isolation — ensure Agent A's memories never leak into Agent B's results

Competitive Landscape (2026)

The agent memory space is maturing rapidly. Here's how the major systems perform on LoCoMo as of May 2026:

System	Overall	Cat 1	Cat 2	Cat 3	Architecture
Dakera v0.11.55	88.2%	~96%	~90%	70.7%	Hybrid + ML classifier
Mem0	92.5%*	—	—	—	LLM-assisted reranking, OpenAI embeddings
Mnemis	93.9%	—	—	—	Research system
0GMem	88.7%	—	—	—	Graph-based

* Mem0's 92.5% score is from their token-efficient memory algorithm research (June 2026), which uses LLM-assisted reranking per query. Mnemis is a research system, not production software. Dakera runs entirely on-device with ONNX inference — no API calls during search. The tradeoff is lower absolute scores but predictable latency, full data privacy, and zero operational cost per query.

For more details on how we run and interpret LoCoMo results, see our benchmark methodology post and the live results on our benchmark page.

AI Agent Memory Benchmarks 2026: LoCoMo, MTOB, and Beyond

Why Benchmarking Memory is Hard

The LoCoMo Benchmark

Structure

Evaluation Method

Why LoCoMo Matters

Dakera's LoCoMo Performance

What LoCoMo Doesn't Test

The MTOB Benchmark

Structure

Why MTOB Complements LoCoMo

Evaluation Beyond Benchmarks

Retrieval Quality Metrics

Operational Metrics That Matter

How to Run LoCoMo Against Your Memory System

Common Pitfalls in Memory Benchmarking

1. Testing on Subsets

2. Optimizing for the Benchmark

3. Ignoring the Judge Model

4. Not Controlling for the Generation Model

The Future of Memory Benchmarks

Benchmark Reproducibility

Environment Variables

Statistical Significance

Building Your Own Evaluation Suite

What to Test Beyond Accuracy

Competitive Landscape (2026)

Build with Dakera