RAG + Persistent Memory

The problem

Traditional RAG retrieves documents from a static corpus — it doesn't learn from interactions. If a user corrects the system, asks a follow-up, or provides new context, that knowledge is lost after the session ends. The system makes the same mistakes repeatedly.

Consider an internal knowledge base assistant. A user asks about the API rate limit. The RAG system retrieves an outdated document that says "500 requests per minute." The user corrects it: "That changed in v2.3 — it's 1000 now." The correction happens in the chat, but the underlying retrieval system doesn't learn from it. The next user who asks the same question gets the same wrong answer.

The problem compounds over time. As documents age, the gap between what the corpus says and what's actually true widens. Without a mechanism to incorporate user feedback, corrections, and new information into the retrieval pipeline, RAG systems degrade rather than improve.

How Dakera solves it

Dakera sits alongside your existing RAG pipeline as a persistent memory layer. It stores interactions, corrections, and learned context — then includes them in future retrievals alongside your static corpus.

Hybrid retrieval (HNSW vector + BM25 fulltext) with cross-encoder reranking. Dakera doesn't just do vector similarity. It combines semantic search with full-text keyword matching, then reranks results using a cross-encoder model. This means better recall than vector-only RAG — especially for queries that include specific terms, IDs, or technical jargon.
Memories have importance scores that decay over time. A correction stored with high importance (0.95) surfaces prominently in results. An outdated piece of context naturally fades as its importance decays. The half-life is configurable — 30 days by default, but you can set it per memory type or per namespace.
New interactions are stored as memories that influence future retrievals. When a user provides a correction or new information, store it as a memory. The next recall query automatically includes it alongside your static corpus results. The system gets better with every interaction.
Built-in embeddings (bge-large, 1024-dim) — no external embedding API calls. Dakera ships with its own embedding model. Your data never leaves your infrastructure for embedding computation. No OpenAI or Cohere API keys required.

Static RAG vs. RAG + Dakera

Capability	Static RAG	RAG + Dakera
Retrieval method	Vector similarity only	Hybrid (HNSW + BM25 + cross-encoder)
Learns from corrections	No	Yes — stored as high-importance memories
Stale information	Stays forever	Fades via importance decay
Cross-session context	Lost after session ends	Persists and accumulates
Embedding dependency	External API (OpenAI, etc.)	Built-in (bge-large, 1024-dim)

Implementation

Here's how to add a persistent memory layer to an existing RAG pipeline:

from dakera import DakeraClient

client = DakeraClient(base_url="http://localhost:3300", api_key="dk-...")

# User provides a correction — store it with high importance
client.store(
    agent_id="rag-assistant",
    content="User corrected: The API rate limit is 1000 req/min, not 500. Updated in v2.3 release notes.",
    importance=0.95,
    tags=["correction", "api-limits", "verified"]
)

# Next query combines RAG corpus + learned memories
memories = client.recall(
    agent_id="rag-assistant",
    query="What is the current API rate limit?",
    top_k=5
)
# Returns the user correction ranked high due to importance + recency

The pattern is straightforward: after your static RAG retrieval, also call client.recall() to pull relevant memories. Merge both result sets into your LLM context. Memories with high importance and recent timestamps naturally surface above stale corpus documents.

Storing learned context automatically

You don't need to wait for explicit corrections. Any interaction that produces useful context can be stored:

# After the LLM generates a response, store the Q&A pair
client.store(
    agent_id="rag-assistant",
    content=f"Q: {user_query}\nA: {llm_response}\nUser feedback: {feedback}",
    importance=0.6 if feedback == "helpful" else 0.3,
    tags=["qa-pair", "auto-captured"]
)

Decay keeps things clean: You don't need to manually curate memories. Low-importance auto-captured context fades naturally. High-importance corrections persist. Memories that are recalled frequently have their decay reset — the system self-curates based on actual usage patterns.

Deploy persistent memory for your agents

Self-hosted, no external API dependencies, production-ready. Add a learning layer to your RAG pipeline in under 10 minutes.

Get Started Read the Docs