How Agent Memory Actually Works: Hybrid Retrieval and Importance Decay

When developers first add memory to an AI agent, they usually do the same thing: store text, embed it, retrieve by cosine similarity. It works in demos. It falls apart in production.

The failure modes are specific. An agent helping a developer has a memory from six weeks ago about a Kubernetes cluster — highly relevant context. But the query "what was that deployment issue?" returns a dozen other deployment-related memories scored higher by cosine similarity because they're more recent and the embeddings happen to cluster. The right memory is buried. The agent hallucinates because it couldn't find what it stored.

Building production memory means solving four problems that vector-only systems don't address: recall precision (finding the right memory, not just semantically similar ones), temporal relevance (recent context matters more), importance weighting (not all memories are equal), and cross-session continuity (agents don't forget things across conversations just because a session ended).

Here's how Dakera solves them.

Problem 1: Vectors alone aren't enough

Embedding models map text into high-dimensional vector space. Semantic similarity queries — "what do I know about authentication?" — work well here. But agents issue many recall queries that aren't purely semantic. They ask "what was the exact error message last Tuesday?" or "what's the API endpoint for the billing service?" These queries need exact term matching, not semantic similarity.

Consider an agent that stored this memory two months ago:

POST /v1/invoice/generate?customer_id=cus_4f8a&dry_run=true

If the user later asks "what's the invoice generation endpoint?", a vector search might correctly find it. But if they ask "what was the exact URL for dry-run invoice generation?", the query becomes too specific — the embeddings might drift from the stored text, and BM25 term matching ("dry_run=true", "invoice/generate") would find it instantly.

Dakera uses hybrid retrieval: both signals are computed independently and combined via a weighted scoring function.

score = α × vector_similarity + β × bm25_score + γ × importance α, β, γ are configurable per-query; defaults tuned against the LoCoMo benchmark

The BM25 index uses English stemming, stop-word filtering, and IDF-weighted term scoring — standard information retrieval techniques that handle vocabulary variations correctly ("generating" matches "generate", "invoices" matches "invoice").

Query type	Vector score	BM25 score	Winner
"deployment issues last month"	High (semantic)	Low (no exact terms)	Vector
"dry_run=true invoice endpoint"	Medium (topic match)	High (exact terms)	BM25
"billing API timeout config"	Medium-high	Medium-high	Hybrid (both reinforce)

On the LoCoMo long-context memory benchmark — 1,540 questions across four recall categories — hybrid retrieval consistently outperforms either signal alone. Dakera's current release achieves 87.6% overall recall accuracy against a benchmark designed to stress-test exactly these failure modes.

Problem 2: Not all memories are equal

When you store a memory, you assign it an importance score between 0.0 and 1.0. This isn't just metadata — it's a first-class retrieval signal.

# High-importance: critical configuration
client.memories.store(
    agent_id="ops-agent",
    content="Production DB replica lag threshold: 500ms — above this triggers failover",
    importance=0.95
)

# Lower-importance: incidental context
client.memories.store(
    agent_id="ops-agent",
    content="User mentioned they prefer dark mode in the UI settings",
    importance=0.4
)

Importance affects both retrieval ranking and decay rate. Two memories with the same vector similarity score to a query are ranked by importance first. A 0.95-importance memory about a production threshold surfaces above a 0.4-importance memory about UI preferences, even if both match the query semantically.

Problem 3: Decay — keeping memory sharp over time

The hardest problem in agent memory is staleness. An agent that stores everything indefinitely accumulates noise. A memory from six months ago about a staging environment configuration might conflict with current production settings. Without decay, older, stale context pollutes recall.

Dakera implements a half-life decay engine. Each memory's importance decays exponentially over time according to:

I(t) = I₀ × e^−λt where λ = ln(2) / half_life Default half_life = 30 days for episodic memories. Configurable per memory type and namespace.

Critically, access resets decay. When a memory is recalled — because an agent used it — its importance score is restored toward its original value. This creates a natural selection pressure: memories that prove useful stay sharp, memories that are never recalled fade away. You don't need to manually curate your agent's memory; the workload curates it for you.

The decay engine runs as a background scheduler with configurable sweep intervals. Memories that decay below a minimum importance threshold are archived to cold storage and eventually pruned. The hot index stays fast because it only contains relevant, active memories.

Problem 4: Session isolation vs. cross-session continuity

Most agent workloads have a natural session structure: a conversation, a task, a workflow. But agents also need to remember things across sessions. The developer who asked about the Kubernetes deployment last Tuesday is the same developer asking about it again today — and the agent should recognize that.

Dakera uses sessions as a first-class primitive:

# Open a session for a conversation
session = client.sessions.start(agent_id="dev-assistant")

# Store memories tagged to this session
client.memories.store(
    agent_id="dev-assistant",
    session_id=session.id,
    content="Debugging nginx 502 errors on prod-cluster-3 — caused by upstream timeout",
    importance=0.85
)

# Cross-session recall — finds memories from all past sessions
results = client.memories.recall(
    agent_id="dev-assistant",
    query="nginx issues prod cluster",
    session_id=None,  # no session filter = cross-session search
    top_k=5
)

For agents that operate over long time horizons — months of interactions with the same user or system — the knowledge graph layer connects memories through shared entities. A memory about "Alice's billing configuration" and a later memory about "Alice's preferred payment method" both link to the entity person:Alice. A query about Alice can traverse the knowledge graph to surface both, even if the semantic similarity between the two memories is low.

Under the hood: HNSW and RocksDB

Dakera's vector index uses HNSW (Hierarchical Navigable Small World graphs) — a graph-based approximate nearest neighbor algorithm that trades a small amount of recall accuracy for dramatically faster query times compared to exact search. At 10 million vectors, an exact flat search might take 200ms; HNSW returns results in under 2ms at 99%+ recall accuracy.

Persistent storage uses RocksDB — a write-optimized LSM-tree key-value store. WAL (write-ahead logging) ensures durability: every store operation is fsync'd before the response returns. Snapshots enable point-in-time backups without taking the server offline.

The in-memory cache layer (Moka) keeps hot memories resident in L3 cache. For agents with a working set under ~50k memories — the majority of production deployments — most recall operations hit cache and return in sub-millisecond times without touching disk or the HNSW index.

Benchmark transparency: All performance claims are validated against the LoCoMo long-context memory benchmark — 1,540 questions, four recall categories (recent-event, cross-session, multi-hop, temporal). We publish full benchmark results with every release. You can run the bench yourself against any Dakera instance using the open-source dakera-bench tool.

Putting it together: a 5-line integration

All of this machinery is invisible to the developer using the SDK. Hybrid retrieval, importance decay, session management, and knowledge graph traversal are all automatic. You store memories, you recall memories.

from dakera import DakeraClient

client = DakeraClient(url="http://localhost:3300", api_key="your-key")

# Store context from this interaction
client.memories.store(agent_id="my-agent", content=context, importance=0.8)

# Before the next response, inject relevant memory
memories = client.memories.recall(agent_id="my-agent", query=user_query, top_k=5)
context_block = "\n".join(m.content for m in memories)

# Use context_block in your system prompt

The full details — including configuration options for decay half-life, importance thresholds, BM25 weight, and knowledge graph entity types — are in the documentation.

Try Dakera in your agent

Five lines of code to persistent memory. Early access is open now.

Join the Waitlist → Quickstart Guide