Context Window Management

Category: Optimization

Problem

LLM context windows are finite. Even with 128k or 200k token models, injecting all available memories into the prompt is wasteful and degrades response quality. The agent needs a strategy to decide which memories to include, how to rank them, and when to truncate — maximizing the signal-to-noise ratio within the token budget.

Architecture

This pattern recalls a broad set of candidate memories, then applies a scoring function combining relevance, recency, and importance. Memories are ranked by composite score and trimmed to fit within a configurable token budget. This ensures the most valuable context always makes it into the prompt.

Flow

Implementation

from dakera import Dakera
import time

client = Dakera(base_url="http://localhost:3300", api_key="dk-...")

def estimate_tokens(text: str) -> int:
    """Rough token estimate: ~4 chars per token for English."""
    return len(text) // 4

def score_memory(memory: dict, now: float) -> float:
    """Compute composite score from relevance, recency, and importance."""
    relevance = memory.get("score", 0.5)
    importance = memory.get("metadata", {}).get("importance", 0.5)
    created = memory.get("metadata", {}).get("timestamp", now)
    age_hours = (now - created) / 3600
    recency = max(0, 1.0 - (age_hours / 720))  # Decays over 30 days
    return (relevance * 0.5) + (importance * 0.3) + (recency * 0.2)

def get_context_memories(query: str, namespace: str, token_budget: int = 2000) -> str:
    """Recall and trim memories to fit within a token budget."""
    results = client.memory.recall(
        query=query,
        namespace=namespace,
        top_k=50
    )

    now = time.time()
    memories = results["results"]

    # Score and rank
    for m in memories:
        m["_composite"] = score_memory(m, now)
    memories.sort(key=lambda x: x["_composite"], reverse=True)

    # Trim to token budget
    selected = []
    tokens_used = 0
    for m in memories:
        cost = estimate_tokens(m["content"])
        if tokens_used + cost > token_budget:
            break
        selected.append(m["content"])
        tokens_used += cost

    return "\n---\n".join(selected)

# Usage: inject into prompt with budget
context = get_context_memories(
    query="How should I structure my API?",
    namespace="user-alice",
    token_budget=1500
)

system_prompt = f"""You are a helpful assistant.

Relevant context from memory (do not repeat verbatim):
{context}
"""

When to Use This Pattern

Key Considerations