Context Window Management
Category: Optimization
Problem
LLM context windows are finite. Even with 128k or 200k token models, injecting all available memories into the prompt is wasteful and degrades response quality. The agent needs a strategy to decide which memories to include, how to rank them, and when to truncate — maximizing the signal-to-noise ratio within the token budget.
Architecture
This pattern recalls a broad set of candidate memories, then applies a scoring function combining relevance, recency, and importance. Memories are ranked by composite score and trimmed to fit within a configurable token budget. This ensures the most valuable context always makes it into the prompt.
Flow
- Recall a generous set of candidate memories (top_k=50+)
- Score each memory: composite of relevance score, recency decay, and importance weight
- Sort by composite score descending
- Accumulate memories until token budget is reached, then stop
Implementation
from dakera import Dakera
import time
client = Dakera(base_url="http://localhost:3300", api_key="dk-...")
def estimate_tokens(text: str) -> int:
"""Rough token estimate: ~4 chars per token for English."""
return len(text) // 4
def score_memory(memory: dict, now: float) -> float:
"""Compute composite score from relevance, recency, and importance."""
relevance = memory.get("score", 0.5)
importance = memory.get("metadata", {}).get("importance", 0.5)
created = memory.get("metadata", {}).get("timestamp", now)
age_hours = (now - created) / 3600
recency = max(0, 1.0 - (age_hours / 720)) # Decays over 30 days
return (relevance * 0.5) + (importance * 0.3) + (recency * 0.2)
def get_context_memories(query: str, namespace: str, token_budget: int = 2000) -> str:
"""Recall and trim memories to fit within a token budget."""
results = client.memory.recall(
query=query,
namespace=namespace,
top_k=50
)
now = time.time()
memories = results["results"]
# Score and rank
for m in memories:
m["_composite"] = score_memory(m, now)
memories.sort(key=lambda x: x["_composite"], reverse=True)
# Trim to token budget
selected = []
tokens_used = 0
for m in memories:
cost = estimate_tokens(m["content"])
if tokens_used + cost > token_budget:
break
selected.append(m["content"])
tokens_used += cost
return "\n---\n".join(selected)
# Usage: inject into prompt with budget
context = get_context_memories(
query="How should I structure my API?",
namespace="user-alice",
token_budget=1500
)
system_prompt = f"""You are a helpful assistant.
Relevant context from memory (do not repeat verbatim):
{context}
"""
When to Use This Pattern
- Any production agent that injects memory into LLM prompts
- Cost-sensitive applications where token usage maps to spend
- Agents with large memory stores where naive recall returns too much
- Real-time applications where prompt size affects latency
Key Considerations
- Tune the scoring weights (relevance/importance/recency) based on your use case
- Reserve token budget for the user message and expected response length
- Consider using memory compression for frequently recalled but verbose memories
- Monitor hit rates — if important memories are being trimmed, increase the budget or importance scores