AI Agent Memory Benchmarks 2026: LoCoMo, MTOB, and Beyond

A technical breakdown of how to evaluate AI memory systems. Understanding what benchmarks measure, where they fall short, and what scores actually mean.

Why Benchmarking Memory is Hard

Benchmarking AI memory systems is fundamentally different from benchmarking language models or retrieval systems in isolation. Memory involves a pipeline: ingestion, storage, retrieval, temporal reasoning, and finally the quality of the answer produced using retrieved context. A weakness at any stage cascades into a wrong answer, but the benchmark only observes the final output.

LoCoMo benchmark comparison showing Dakera at 88.2% vs Mem0 at 91.6% and Zep at 74.6%

Additionally, memory benchmarks must test capabilities that don't exist in standard retrieval evaluations:

The LoCoMo Benchmark

LoCoMo (Long Conversation Memory) is the most widely-used benchmark for evaluating AI agent memory. It was introduced in a 2024 research paper and has become the de facto standard for comparing memory systems.

Structure

LoCoMo consists of 1,540 questions derived from long conversations (10,000+ tokens each). Questions are categorized into three types:

CategoryTypeWhat It TestsExample
Cat 1Single-hopDirect fact retrieval"What is Alice's favorite color?"
Cat 2Multi-hopCombining multiple memories"What do Alice and Bob have in common?"
Cat 3TemporalTime-aware reasoning"When did Alice change her mind about the project?"

Evaluation Method

Each question has a ground-truth answer. The memory system ingests the conversation, then answers each question. Answers are evaluated using an LLM-as-judge approach that checks for semantic correctness — not exact string matching. This allows for different phrasings of correct answers.

Why LoCoMo Matters

Before LoCoMo, memory systems were evaluated on retrieval metrics alone (recall@k, NDCG). But high retrieval recall doesn't guarantee correct answers — you might retrieve the right passages but still fail at temporal inference or multi-hop reasoning. LoCoMo evaluates the full pipeline end-to-end.

Dakera's LoCoMo Performance

Dakera scores 88.2% overall on the full 1,540-question LoCoMo benchmark. Here's the breakdown by category:

CategoryQuestionsScoreKey Capability
Cat 1 (Single-hop)~700~96%Hybrid retrieval (HNSW + BM25)
Cat 2 (Multi-hop)~500~90%Cross-encoder reranking + context fusion
Cat 3 (Temporal)~34070.7%Temporal inference + ML classification

Category 3 remains the most challenging — temporal reasoning requires understanding not just what was said, but when it was said and how recent information supersedes older statements. Dakera's ML query classifier identifies temporal queries at routing time and applies specialized retrieval strategies (recency-weighted scoring, temporal filtering) for these cases.

What LoCoMo Doesn't Test

No single benchmark captures every aspect of memory quality. LoCoMo has known limitations:

The MTOB Benchmark

MTOB (Machine Theory of Belief) evaluates whether a memory system can track changing beliefs and states over time. While LoCoMo tests factual recall, MTOB tests whether the system understands that facts can change and that the most recent statement supersedes older ones.

Structure

MTOB presents conversations where entities change state multiple times:

"Alice's favorite programming language is Python." (turn 5)
"Alice switched to Rust last month." (turn 23)
"Actually, Alice went back to Python after trying Rust." (turn 47)

The system must correctly answer "What is Alice's current favorite language?" by understanding the temporal sequence of updates.

Why MTOB Complements LoCoMo

A system could score well on LoCoMo's temporal category by simply preferring recent results (high recency weight). MTOB specifically tests whether the system understands state transitions — not just recency, but the logical relationship between contradicting statements.

Evaluation Beyond Benchmarks

Retrieval Quality Metrics

Before the end-to-end benchmark score, you should understand your retrieval pipeline's raw performance:

# Evaluate retrieval quality with Dakera's built-in analytics
stats = client.analytics.retrieval_quality(
    namespace="test-ns",
    evaluation_set="locomo-eval"  # pre-loaded evaluation queries
)

# Returns:
# {
#   "recall_at_5": 0.94,
#   "recall_at_10": 0.97,
#   "mrr": 0.91,
#   "avg_latency_ms": ...,
#   "p99_latency_ms": ...
# }

Operational Metrics That Matter

Beyond accuracy, production memory systems need to perform under real-world constraints:

MetricWhat It Tells YouDakera Design
Search latencyQuery speed under loadLow-latency (no GC pauses)
Write throughputMemories ingested per secondConcurrent, lock-free writes
Memory overheadRAM per 100K memories~400 MB
Index build timeTime to build HNSW from cold~45s for 100K vectors

How to Run LoCoMo Against Your Memory System

The LoCoMo benchmark is open-source and can be run against any memory system with a search API. Here's how to evaluate Dakera:

from dakera import DakeraClient
from locomo_eval import LoCoMoEvaluator, load_conversations, load_questions

client = DakeraClient(base_url="http://localhost:3300")

# Step 1: Ingest all LoCoMo conversations
conversations = load_conversations("locomo_v1/conversations.json")
for conv in conversations:
    for turn in conv["turns"]:
        client.store_memory(
            agent_id=f"locomo-{conv['id']}",
            content=turn["content"],
            metadata={"turn_number": turn["number"], "speaker": turn["speaker"]}
        )

# Step 2: Run questions against the memory
questions = load_questions("locomo_v1/questions.json")
evaluator = LoCoMoEvaluator(judge_model="claude-sonnet-4-20250514")

results = []
for q in questions:
    # Retrieve relevant memories
    memories = client.search_memories(
        agent_id=f"locomo-{q['conversation_id']}",
        query=q["question"],
        top_k=10
    )

    # Generate answer using retrieved context
    context = "\n".join([m.content for m in memories])
    answer = llm.complete(f"Context:\n{context}\n\nQuestion: {q['question']}\nAnswer:")

    # Evaluate
    score = evaluator.judge(
        question=q["question"],
        predicted=answer,
        ground_truth=q["answer"]
    )
    results.append({"category": q["category"], "score": score})

# Step 3: Aggregate scores
for cat in [1, 2, 3]:
    cat_results = [r for r in results if r["category"] == cat]
    avg = sum(r["score"] for r in cat_results) / len(cat_results)
    print(f"Category {cat}: {avg:.1%}")

overall = sum(r["score"] for r in results) / len(results)
print(f"Overall: {overall:.1%}")

Common Pitfalls in Memory Benchmarking

1. Testing on Subsets

Running only 100 questions instead of the full 1,540 produces unstable results. Category 3 in particular has high variance on small samples. Always run the full set for publishable numbers.

2. Optimizing for the Benchmark

It's tempting to tune hyperparameters specifically for LoCoMo's question distribution. This produces misleading scores that don't transfer to real workloads. Optimize for your actual use case, then report benchmark scores as a sanity check.

3. Ignoring the Judge Model

LLM-as-judge evaluation introduces variance. Different judge models (GPT-4, Claude, Gemini) may score the same answer differently. Always report which judge model you used, and be aware that comparing scores across different judge models is not meaningful.

4. Not Controlling for the Generation Model

The final answer quality depends on both the memory system (retrieval) and the generation model (answer production). A better generation model can compensate for weaker retrieval. Control this by using the same generation model across all systems you're comparing.

The Future of Memory Benchmarks

The field is moving toward more comprehensive evaluation suites that test:

Until these benchmarks mature, LoCoMo remains the best single number for comparing memory systems. But treat it as one signal among many — not the complete picture.

Benchmark Reproducibility

A benchmark score is only meaningful if others can reproduce it. Here's what you need to control for reproducible memory evaluation:

Environment Variables

VariableImpactRecommendation
Judge model±3% score varianceFix to one model, report which
Temperature±1% on temporal questionsSet to 0 for deterministic judging
Retrieval limit (K)Higher K = higher recall, slowerK=10 standard, report if different
Embedding model±5% depending on domain fitReport exact model and version
Chunk strategyMajor impact on Cat 2/3Document chunking approach used

Statistical Significance

A 1-2% difference between systems is likely noise, not signal. Category 3 (temporal) has particularly high variance due to the ambiguity of temporal questions. When reporting results:

Building Your Own Evaluation Suite

While LoCoMo is the standard public benchmark, you should also evaluate against your specific workload. Here's a framework for building domain-specific memory evaluations:

# Create a custom evaluation set from your production conversations
from dakera import DakeraClient

client = DakeraClient(base_url="http://localhost:3300")

# Step 1: Sample real conversations from your agent
conversations = client.list_sessions(agent_id="production", limit=50)

# Step 2: Generate ground-truth Q&A pairs (human-labeled or LLM-generated)
eval_pairs = []
for conv in conversations:
    memories = client.session_memories(session_id=conv.id)
    # Generate questions that test the capabilities you care about:
    # - Single-hop: "What did the user request in this session?"
    # - Multi-hop: "How does this session's request relate to last week's?"
    # - Temporal: "What changed between session 3 and session 7?"
    # - Entity: "What projects does this user work on?"
    pairs = generate_eval_pairs(memories, categories=["single", "multi", "temporal", "entity"])
    eval_pairs.extend(pairs)

# Step 3: Run evaluation and track over time
results = evaluate(client, eval_pairs, judge="claude-sonnet-4-20250514")
print(f"Domain-specific score: {results.overall:.1%}")
print(f"By category: {results.by_category}")

What to Test Beyond Accuracy

A production memory system isn't just about getting the right answer — it's about getting it reliably under real conditions:

Competitive Landscape (2026)

The agent memory space is maturing rapidly. Here's how the major systems perform on LoCoMo as of May 2026:

SystemOverallCat 1Cat 2Cat 3Architecture
Dakera v0.11.5588.2%~96%~90%70.7%Hybrid + ML classifier
Mem092.5%*LLM-assisted reranking, OpenAI embeddings
Mnemis93.9%Research system
0GMem88.7%Graph-based

* Mem0's 92.5% score is from their token-efficient memory algorithm research (June 2026), which uses LLM-assisted reranking per query. Mnemis is a research system, not production software. Dakera runs entirely on-device with ONNX inference — no API calls during search. The tradeoff is lower absolute scores but predictable latency, full data privacy, and zero operational cost per query.

For more details on how we run and interpret LoCoMo results, see our benchmark methodology post and the live results on our benchmark page.

Build with Dakera

Give your AI agents persistent memory — self-hosted, production-ready, zero dependencies.

Stay in the loop
Get Dakera updates — releases, guides, and benchmarks. No spam.
✓ Subscribed. Thanks!