Back to Blog

AI Agent Memory Benchmarks 2026: LoCoMo, MTOB, and Beyond

A technical breakdown of how to evaluate AI memory systems. Understanding what benchmarks measure, where they fall short, and what scores actually mean.

Why Benchmarking Memory is Hard

Benchmarking AI memory systems is fundamentally different from benchmarking language models or retrieval systems in isolation. Memory involves a pipeline: ingestion, storage, retrieval, temporal reasoning, and finally the quality of the answer produced using retrieved context. A weakness at any stage cascades into a wrong answer, but the benchmark only observes the final output.

Additionally, memory benchmarks must test capabilities that don't exist in standard retrieval evaluations:

  • Temporal reasoning — can the system answer "when did X change?" or "what was the most recent Y?"
  • Multi-hop inference — can it combine facts across multiple memories?
  • Contradiction handling — when older and newer information conflict, does it prefer the update?
  • Long-term retention — does performance degrade as memory volume grows?

The LoCoMo Benchmark

LoCoMo (Long Conversation Memory) is the most widely-used benchmark for evaluating AI agent memory. It was introduced in a 2024 research paper and has become the de facto standard for comparing memory systems.

Structure

LoCoMo consists of 1,540 questions derived from long conversations (10,000+ tokens each). Questions are categorized into three types:

CategoryTypeWhat It TestsExample
Cat 1Single-hopDirect fact retrieval"What is Alice's favorite color?"
Cat 2Multi-hopCombining multiple memories"What do Alice and Bob have in common?"
Cat 3TemporalTime-aware reasoning"When did Alice change her mind about the project?"

Evaluation Method

Each question has a ground-truth answer. The memory system ingests the conversation, then answers each question. Answers are evaluated using an LLM-as-judge approach that checks for semantic correctness — not exact string matching. This allows for different phrasings of correct answers.

Why LoCoMo Matters

Before LoCoMo, memory systems were evaluated on retrieval metrics alone (recall@k, NDCG). But high retrieval recall doesn't guarantee correct answers — you might retrieve the right passages but still fail at temporal inference or multi-hop reasoning. LoCoMo evaluates the full pipeline end-to-end.

Dakera's LoCoMo Performance

Dakera scores 87.6% overall on the full 1,540-question LoCoMo benchmark. Here's the breakdown by category:

CategoryQuestionsScoreKey Capability
Cat 1 (Single-hop)~700~96%Hybrid retrieval (HNSW + BM25)
Cat 2 (Multi-hop)~500~90%Cross-encoder reranking + context fusion
Cat 3 (Temporal)~34070.7%Temporal inference + ML classification

Category 3 remains the most challenging — temporal reasoning requires understanding not just what was said, but when it was said and how recent information supersedes older statements. Dakera's ML query classifier identifies temporal queries at routing time and applies specialized retrieval strategies (recency-weighted scoring, temporal filtering) for these cases.

What LoCoMo Doesn't Test

No single benchmark captures every aspect of memory quality. LoCoMo has known limitations:

  • Static conversations — the test conversations are synthetic and pre-written, not generated from real agent interactions
  • No knowledge graph evaluation — LoCoMo doesn't test entity extraction or graph traversal capabilities
  • No scale testing — conversations are ~10K tokens. Real agents may accumulate millions of tokens over months
  • Single-user only — doesn't test multi-tenant isolation or cross-agent memory sharing
  • English only — no multilingual evaluation

The MTOB Benchmark

MTOB (Machine Theory of Belief) evaluates whether a memory system can track changing beliefs and states over time. While LoCoMo tests factual recall, MTOB tests whether the system understands that facts can change and that the most recent statement supersedes older ones.

Structure

MTOB presents conversations where entities change state multiple times:

"Alice's favorite programming language is Python." (turn 5)
"Alice switched to Rust last month." (turn 23)
"Actually, Alice went back to Python after trying Rust." (turn 47)

The system must correctly answer "What is Alice's current favorite language?" by understanding the temporal sequence of updates.

Why MTOB Complements LoCoMo

A system could score well on LoCoMo's temporal category by simply preferring recent results (high recency weight). MTOB specifically tests whether the system understands state transitions — not just recency, but the logical relationship between contradicting statements.

Evaluation Beyond Benchmarks

Retrieval Quality Metrics

Before the end-to-end benchmark score, you should understand your retrieval pipeline's raw performance:

# Evaluate retrieval quality with Dakera's built-in analytics
stats = client.analytics.retrieval_quality(
    namespace="test-ns",
    evaluation_set="locomo-eval"  # pre-loaded evaluation queries
)

# Returns:
# {
#   "recall_at_5": 0.94,
#   "recall_at_10": 0.97,
#   "mrr": 0.91,
#   "avg_latency_ms": 12,
#   "p99_latency_ms": 23
# }

Operational Metrics That Matter

Beyond accuracy, production memory systems need to perform under real-world constraints:

MetricWhat It Tells YouDakera Typical
Search latency p50Typical query speed8 ms
Search latency p99Worst-case query speed23 ms
Write throughputMemories ingested per second1,200/s
Memory overheadRAM per 100K memories~400 MB
Index build timeTime to build HNSW from cold~45s for 100K vectors

How to Run LoCoMo Against Your Memory System

The LoCoMo benchmark is open-source and can be run against any memory system with a search API. Here's how to evaluate Dakera:

from dakera import Dakera
from locomo_eval import LoCoMoEvaluator, load_conversations, load_questions

client = Dakera(base_url="http://localhost:3300")

# Step 1: Ingest all LoCoMo conversations
conversations = load_conversations("locomo_v1/conversations.json")
for conv in conversations:
    for turn in conv["turns"]:
        client.memory.add(
            namespace=f"locomo-{conv['id']}",
            content=turn["content"],
            metadata={"turn_number": turn["number"], "speaker": turn["speaker"]}
        )

# Step 2: Run questions against the memory
questions = load_questions("locomo_v1/questions.json")
evaluator = LoCoMoEvaluator(judge_model="claude-sonnet-4-20250514")

results = []
for q in questions:
    # Retrieve relevant memories
    memories = client.memory.search(
        namespace=f"locomo-{q['conversation_id']}",
        query=q["question"],
        limit=10
    )

    # Generate answer using retrieved context
    context = "\n".join([m.content for m in memories])
    answer = llm.complete(f"Context:\n{context}\n\nQuestion: {q['question']}\nAnswer:")

    # Evaluate
    score = evaluator.judge(
        question=q["question"],
        predicted=answer,
        ground_truth=q["answer"]
    )
    results.append({"category": q["category"], "score": score})

# Step 3: Aggregate scores
for cat in [1, 2, 3]:
    cat_results = [r for r in results if r["category"] == cat]
    avg = sum(r["score"] for r in cat_results) / len(cat_results)
    print(f"Category {cat}: {avg:.1%}")

overall = sum(r["score"] for r in results) / len(results)
print(f"Overall: {overall:.1%}")

Common Pitfalls in Memory Benchmarking

1. Testing on Subsets

Running only 100 questions instead of the full 1,540 produces unstable results. Category 3 in particular has high variance on small samples. Always run the full set for publishable numbers.

2. Optimizing for the Benchmark

It's tempting to tune hyperparameters specifically for LoCoMo's question distribution. This produces misleading scores that don't transfer to real workloads. Optimize for your actual use case, then report benchmark scores as a sanity check.

3. Ignoring the Judge Model

LLM-as-judge evaluation introduces variance. Different judge models (GPT-4, Claude, Gemini) may score the same answer differently. Always report which judge model you used, and be aware that comparing scores across different judge models is not meaningful.

4. Not Controlling for the Generation Model

The final answer quality depends on both the memory system (retrieval) and the generation model (answer production). A better generation model can compensate for weaker retrieval. Control this by using the same generation model across all systems you're comparing.

The Future of Memory Benchmarks

The field is moving toward more comprehensive evaluation suites that test:

  • Scale — how performance changes from 1K to 1M memories
  • Multi-session — retrieval across separate conversations over weeks/months
  • Adversarial — handling conflicting information, hallucination detection
  • Knowledge graph — entity extraction accuracy, relationship inference
  • Efficiency — quality-per-compute, scoring both accuracy and resource usage

Until these benchmarks mature, LoCoMo remains the best single number for comparing memory systems. But treat it as one signal among many — not the complete picture.

For more details on how we run and interpret LoCoMo results, see our benchmark methodology post and the live results on our benchmark page.

Try Dakera Today

Single binary, zero dependencies, 87.6% LoCoMo benchmark.

Get Started