How We Benchmark Memory: Dakera on LoCoMo

When we claim Dakera scores 87.6% on LoCoMo, that number deserves an explanation. What is LoCoMo? What does that score actually mean for an AI agent remembering real conversations? And how can you verify the number yourself?

This post covers all of it. No hand-waving — just the methodology, the data, and the tools to reproduce it.

What is LoCoMo?

LoCoMo (Long-Context Memory) is a benchmark for evaluating how well an AI memory system retains and recalls information from long conversational histories. It was introduced by researchers at Google and has become the de-facto standard evaluation for dedicated AI memory engines.

The benchmark works by providing a memory system with a long conversational history — months of simulated interactions between two people — then asking questions that require recalling specific facts from that history. The questions are designed to reflect the kinds of recall failures that break real agents in production:

Cat1 — Single-hop recall

Direct facts from recent or distant sessions. Tests whether the memory system stores and retrieves faithfully.

86.9%

282 questions — Dakera v0.11.54

Cat2 — Multi-hop reasoning

Questions requiring inference across two or more stored memories. Tests whether recall compounds correctly across sessions.

85.4%

321 questions — improved by entity graph traversal

Cat3 — Temporal inference

Questions about sequences, durations, and "what happened before/after". Tests temporal reasoning across stored memories.

73.9%

92 questions — hardest category, actively improving

Cat4 — Open-domain recall

Free-form questions spanning multiple topics and entities. Tests breadth and versatility of the retrieval system.

91.0%

841 questions — best category

Across 1,540 questions spanning all four categories, Dakera v0.11.54 scores 87.6% overall. This is evaluated on a standardized dataset with an LLM judge scoring recall accuracy.

Why 1,540 questions? The full LoCoMo dataset covers 50 conversations. We evaluate across all of them (1,540 total questions) rather than a sampled subset. Smaller evaluations (100Q) can produce misleading variance — two percentage points on a 100-question eval is statistically meaningless, while on 1,540 questions it represents a real signal.

Dakera's score: the full picture

Dakera v0.11.54
87.6%
Overall — full dataset, no LLM post-processing

Questions evaluated

1,540

Full dataset, 50 conversations

Evaluated

May 2026

Version v0.11.54 — scores tracked per release

We publish our evaluation methodology in full so you can reproduce, audit, or challenge any of these numbers. Scores change with releases — track them in our GitHub releases.

What drives the score

Temporal inference at 73.9% is our hardest category and the most active area of development. Questions like "how long did they discuss the apartment renovation?" or "what did they decide after the trip to Japan?" require the memory system to reason about time sequences — not just retrieve stored facts.

Three architectural choices drive Dakera's overall score:

1. Hybrid retrieval

Vector similarity alone fails for queries with specific terms ("that API endpoint with the dry_run flag"). BM25 alone fails for semantic queries ("what was that issue with the server last month?"). Dakera runs both in parallel and combines them with a configurable scoring function:

score = α × vector_similarity + β × bm25_score + γ × importance Defaults α=0.5, β=0.3, γ=0.2 — tuned against LoCoMo training set. Configurable per-query.

2. Temporal re-ranking

After hybrid retrieval returns candidates, Dakera applies a post-reranker that scales scores multiplicatively based on memory age and recency signals in the query. A question about "the recent camping trip" down-weights memories from years ago, even if they're semantically similar to the query. The temporal re-ranker (CE-86) added +2.2pp on Cat3; subsequent improvements including session-date consensus year injection (CE-109) added another +1.0pp, and stronger temporal proximity scoring (CE-115) added another +2.2pp — together bringing Cat3 from ~67% to the current 73.9%.

3. Importance-weighted recall

Not all memories are equal. A memory stored with importance=0.95 (a critical configuration, a key decision) outranks a 0.4-importance observation even when both match the query semantically. The importance score compounds with decay — memories that prove useful over time stay sharp; memories that are never recalled fade away.

Together these three systems explain why Dakera outperforms pure vector-database approaches. Vector DBs handle single-hop recall well but struggle with temporal inference and importance weighting. Dakera's retrieval pipeline is purpose-built for agent memory workloads, not generic similarity search.

Our evaluation setup

Every Dakera release that touches retrieval runs the full 1,540-question LoCoMo evaluation against a live server instance. The pipeline is:

Spin up a fresh Dakera instance with default configuration
Ingest all 50 LoCoMo conversations into the memory store
Run 1,540 questions against the memory API
Score answers with an LLM judge (GPT-4o)
Report overall accuracy and per-category breakdown

We intentionally use default configuration — no per-benchmark tuning. The scores you see are what you'd get if you dropped the Dakera binary on a server and ran the benchmark yourself.

Run it yourself

The benchmark tool is open-source at github.com/Dakera-AI/dakera-bench. To run the full evaluation against any Dakera instance:

# Clone the benchmark tool
git clone https://github.com/Dakera-AI/dakera-bench
cd dakera-bench

# Set your Dakera server URL and API key
export DAKERA_URL=http://localhost:3300
export DAKERA_API_KEY=your-api-key
export OPENAI_API_KEY=sk-...  # for LLM judging

# Run the full 1,540-question evaluation
cargo run --release -- bench --questions 1540 --output results.json

# Output: overall score + per-category breakdown
# → overall: 87.6%, cat_temporal: 73.9%, ...

The benchmark takes 20–40 minutes depending on your hardware and API rate limits. Results are written to JSON with per-question detail so you can audit individual failures.

Temporal inference: the hard problem

A 73.9% score on temporal inference deserves more explanation, because it's the category most relevant to production agent workloads.

Temporal questions ask things like: "Who was mentioned more recently — Alex or Sam?" or "How many months passed between the camping trip and the house purchase?" These require the memory system to reason about time sequences across dozens of stored memories — not just retrieve one fact, but synthesize a temporal narrative from multiple fragments.

This is genuinely hard. The LLM judge (GPT-4o) evaluates whether the agent's answer is factually correct about the timeline, not just semantically related. A response that correctly retrieves "camping trip: July 2023" and "house purchase: November 2023" but calculates "four months" instead of "five months" scores zero for that question.

Our temporal re-ranker (CE-86+CE-88) and session-date consensus year injection (CE-109) and stronger temporal proximity scoring (CE-115, `INFERENCE_TEMPORAL_MULT_BETA` 0.5→0.65) have brought this from ~67% to 73.9% — multiplicatively boosting temporally-relevant memories and anchoring year inference to session metadata. Temporal inference remains our most actively developed category.

Why publish the hard category? Because developers evaluating memory engines deserve real numbers. A vendor that only reports their best categories isn't giving you useful information. We report all four categories in every release. Temporal inference at 73.9% is where we are today — and it's better than where we were last quarter.

What 87.6% means in practice

Benchmark scores are proxies. What matters for production is whether your agents recall the right context at the right time. A few things the score tells you — and doesn't tell you:

Question	What the score tells you
Will my agent remember facts accurately?	Yes — 87.6% on a standardized long-context eval is strong coverage of factual recall
Will it handle temporal reasoning?	Partially — 73.9% on temporal questions; complex timeline reasoning is harder
Will it work for your specific domain?	Unknown — LoCoMo uses general conversational data. Run domain-specific evals if precision matters
Is 87.6% better than rolling your own?	Yes — pure vector DB approaches typically score 60–70% on LoCoMo; hybrid retrieval matters

The benchmark is a starting point, not an end state. We publish it with every release precisely so you can watch the trajectory — not just trust a static number.

For the full category breakdown and evaluation details, see the dedicated benchmark page.

Run Dakera against your own workload

Early access is open. Drop a single binary, point your agents at it, and measure recall on your own data.

Request Early Access → View dakera-bench on GitHub