Track
Window · Summarize · Retrieve · Evict
LLMs have a hard limit on how much text they can process at once — the context window. Once the conversation grows beyond that window, older messages get dropped. This is fine for a single Q&A, but agents often need to maintain context across many turns, or remember facts from earlier in a long session.
Memory systems solve this by selectively deciding what to keep, what to compress, and what to discard. Instead of one fixed window, you build a tiered system: recent messages stay intact, older ones get summarized, and important facts get pinned so they're never evicted.
The simplest memory strategy is a sliding window: keep the most recent messages within a token budget, always preserve the system prompt, and drop the oldest non-system messages when the budget is exceeded.
When the budget fills, the oldest messages are evicted. The agent can still see the recent ones plus the system prompt. This keeps memory bounded and predictable.
Sliding windows work by recency, but sometimes the most relevant piece of information isn't the most recent. Query-based retrieval lets the agent search memory by meaning rather than time.
The agent stores facts as they accumulate, and when it needs something specific, it queries the store. This turns memory from a simple FIFO queue into a searchable knowledge base.
Concrete Example
class SlidingMemory:
def __init__(self, max_tokens, count_tokens):
self.max_tokens = max_tokens
self.count_tokens = count_tokens
self.messages = []
def add(self, role, content):
self.messages.append({"role": role, "content": content})
def render(self):
system = [m for m in self.messages
if m["role"] == "system"]
others = [m for m in self.messages
if m["role"] != "system"]
budget = self.max_tokens
for m in system:
budget -= self.count_tokens(m["content"])
kept = []
for msg in reversed(others):
tokens = self.count_tokens(msg["content"])
if budget - tokens >= 0:
budget -= tokens
kept.append(msg)
return system + list(reversed(kept))System messages are pinned — they're never evicted. Non-system messages are scanned newest-first; each fits within the remaining budget or is dropped. The result is always within max_tokens while retaining the most recent context.
Keep the newest messages within a token budget; drop oldest first.
System prompts are always preserved and never evicted.
Compress dropped context into a compact summary before removal.
Search memory by semantic relevance, not just recency.
Automatically expire facts that are too old, keeping the store fresh.
6 problems. Sign in to start solving.
Sign in to open a workspace and solve these problems.