Context Window Management

Category: Optimization

Problem

LLM context windows are finite. Even with 128k or 200k token models, injecting all available memories into the prompt is wasteful and degrades response quality. The agent needs a strategy to decide which memories to include, how to rank them, and when to truncate — maximizing the signal-to-noise ratio within the token budget.

Architecture

This pattern recalls a broad set of candidate memories, then applies a scoring function combining relevance, recency, and importance. Memories are ranked by composite score and trimmed to fit within a configurable token budget. This ensures the most valuable context always makes it into the prompt.

Flow

  • Recall a generous set of candidate memories (top_k=50+)
  • Score each memory: composite of relevance score, recency decay, and importance weight
  • Sort by composite score descending
  • Accumulate memories until token budget is reached, then stop

Implementation

from dakera import Dakera
import time

client = Dakera(base_url="http://localhost:3300", api_key="dk-...")

def estimate_tokens(text: str) -> int:
    """Rough token estimate: ~4 chars per token for English."""
    return len(text) // 4

def score_memory(memory: dict, now: float) -> float:
    """Compute composite score from relevance, recency, and importance."""
    relevance = memory.get("score", 0.5)
    importance = memory.get("metadata", {}).get("importance", 0.5)
    created = memory.get("metadata", {}).get("timestamp", now)
    age_hours = (now - created) / 3600
    recency = max(0, 1.0 - (age_hours / 720))  # Decays over 30 days
    return (relevance * 0.5) + (importance * 0.3) + (recency * 0.2)

def get_context_memories(query: str, namespace: str, token_budget: int = 2000) -> str:
    """Recall and trim memories to fit within a token budget."""
    results = client.memory.recall(
        query=query,
        namespace=namespace,
        top_k=50
    )

    now = time.time()
    memories = results["results"]

    # Score and rank
    for m in memories:
        m["_composite"] = score_memory(m, now)
    memories.sort(key=lambda x: x["_composite"], reverse=True)

    # Trim to token budget
    selected = []
    tokens_used = 0
    for m in memories:
        cost = estimate_tokens(m["content"])
        if tokens_used + cost > token_budget:
            break
        selected.append(m["content"])
        tokens_used += cost

    return "\n---\n".join(selected)

# Usage: inject into prompt with budget
context = get_context_memories(
    query="How should I structure my API?",
    namespace="user-alice",
    token_budget=1500
)

system_prompt = f"""You are a helpful assistant.

Relevant context from memory (do not repeat verbatim):
{context}
"""

When to Use This Pattern

  • Any production agent that injects memory into LLM prompts
  • Cost-sensitive applications where token usage maps to spend
  • Agents with large memory stores where naive recall returns too much
  • Real-time applications where prompt size affects latency

Key Considerations

  • Tune the scoring weights (relevance/importance/recency) based on your use case
  • Reserve token budget for the user message and expected response length
  • Consider using memory compression for frequently recalled but verbose memories
  • Monitor hit rates — if important memories are being trimmed, increase the budget or importance scores