Context Window Management

Category: Optimization

Problem

LLM context windows are finite. Even with 128k or 200k token models, injecting all available memories into the prompt is wasteful and degrades response quality. The agent needs a strategy to decide which memories to include, how to rank them, and when to truncate — maximizing the signal-to-noise ratio within the token budget.

Architecture

This pattern recalls a broad set of candidate memories, then applies a scoring function combining relevance, recency, and importance. Memories are ranked by composite score and trimmed to fit within a configurable token budget. This ensures the most valuable context always makes it into the prompt.

Flow

Recall a generous set of candidate memories (top_k=50+)
Score each memory: composite of relevance score, recency decay, and importance weight
Sort by composite score descending
Accumulate memories until token budget is reached, then stop

Implementation

from dakera import Dakera
import time

client = Dakera(base_url="http://localhost:3300", api_key="dk-...")

def estimate_tokens(text: str) -> int:
    """Rough token estimate: ~4 chars per token for English."""
    return len(text) // 4

def score_memory(memory: dict, now: float) -> float:
    """Compute composite score from relevance, recency, and importance."""
    relevance = memory.get("score", 0.5)
    importance = memory.get("metadata", {}).get("importance", 0.5)
    created = memory.get("metadata", {}).get("timestamp", now)
    age_hours = (now - created) / 3600
    recency = max(0, 1.0 - (age_hours / 720))  # Decays over 30 days
    return (relevance * 0.5) + (importance * 0.3) + (recency * 0.2)

def get_context_memories(query: str, namespace: str, token_budget: int = 2000) -> str:
    """Recall and trim memories to fit within a token budget."""
    results = client.memory.recall(
        query=query,
        namespace=namespace,
        top_k=50
    )

    now = time.time()
    memories = results["results"]

    # Score and rank
    for m in memories:
        m["_composite"] = score_memory(m, now)
    memories.sort(key=lambda x: x["_composite"], reverse=True)

    # Trim to token budget
    selected = []
    tokens_used = 0
    for m in memories:
        cost = estimate_tokens(m["content"])
        if tokens_used + cost > token_budget:
            break
        selected.append(m["content"])
        tokens_used += cost

    return "\n---\n".join(selected)

# Usage: inject into prompt with budget
context = get_context_memories(
    query="How should I structure my API?",
    namespace="user-alice",
    token_budget=1500
)

system_prompt = f"""You are a helpful assistant.

Relevant context from memory (do not repeat verbatim):
{context}
"""

When to Use This Pattern

Any production agent that injects memory into LLM prompts
Cost-sensitive applications where token usage maps to spend
Agents with large memory stores where naive recall returns too much
Real-time applications where prompt size affects latency

Key Considerations

Tune the scoring weights (relevance/importance/recency) based on your use case
Reserve token budget for the user message and expected response length
Consider using memory compression for frequently recalled but verbose memories
Monitor hit rates — if important memories are being trimmed, increase the budget or importance scores