
A simple relational shape works well and keeps options open:
-- Core session objects
CREATE TABLE sessions (
session_id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
status TEXT CHECK (status IN ('opened','active','compacted','archived')),
created_at TIMESTAMP NOT NULL,
last_turn_at TIMESTAMP,
turn_count INT DEFAULT 0,
token_in INT DEFAULT 0,
token_out INT DEFAULT 0
);
CREATE TABLE messages (
msg_id TEXT PRIMARY KEY,
session_id TEXT NOT NULL,
role TEXT CHECK (role IN ('user','assistant','system','tool')),
content TEXT NOT NULL,
tokens INT,
created_at TIMESTAMP NOT NULL,
FOREIGN KEY (session_id) REFERENCES sessions(session_id)
);
CREATE INDEX idx_messages_session_time ON messages(session_id, created_at);
Short-term history can be modeled as a ring buffer of most-recent N tokens/turns. Beyond thresholds, compact older spans into summaries and milestone notes.
class Window:
def __init__(self, max_tokens: int, keep_last_turns: int = 6):
self.max_tokens = max_tokens
self.keep_last_turns = keep_last_turns
def assemble(self, sys_msg, recent_msgs, milestone_notes, retrieved_snippets, tool_outputs):
# Deterministic ordering by role, then recency
context = [sys_msg]
context += recent_msgs[-self.keep_last_turns:]
context += milestone_notes
context += retrieved_snippets
context += tool_outputs
return trim_to_token_budget(context, self.max_tokens)
Compaction triggers keep the window stable:
def maybe_compact(session, messages, thresholds):
too_many_turns = session.turn_count > thresholds.max_turns
too_many_tokens = (session.token_in + session.token_out) > thresholds.max_tokens
if not (too_many_turns or too_many_tokens):
return None
span = select_older_span(messages, keep_last=thresholds.keep_last_turns)
summary = summarize_extractive_then_abstractive(span)
milestone = extract_structured_facts(span) # commitments, constraints, ids
persist_summary_and_milestone(session.id, summary, milestone)
mark_span_compacted(span)
Key design notes
Think of PageIn as a ranked include-list and PageOut as principled forgetting in AI Agent context management.
Scoring function (example):
def score(item, now):
# item: {type, text, embedding, timestamp, role, metadata}
w_recency = 0.35
w_semantic = 0.45
w_role = 0.10
w_signal = 0.10 # clicks, tool success, citations
recency = exp_decay(now - item.timestamp, half_life_minutes=45)
semantic = cosine(item.embedding, query_embedding())
role_boost = 1.0 if item.role in ("system","milestone") else 0.7
signal = min(1.0, item.metadata.get("utilization_rate", 0.0))
return w_recency*recency + w_semantic*semantic + w_role*role_boost + w_signal*signal
Selection and eviction:
def page_in_out(candidates, budget_tokens):
ranked = sorted(candidates, key=lambda x: score(x, now()), reverse=True)
selected, used = [], 0
for c in ranked:
if used + c.tokens <= budget_tokens:
selected.append(c)
used += c.tokens
else:
continue
# PageOut policy: LRU for plain chat, semantic TTL for knowledge
evict = [x for x in candidates if x not in selected and should_evict(x)]
return selected, evict
def should_evict(item):
if item.type == 'verbatim_turn' and is_older_than(item, minutes=120):
return True
if item.type == 'snippet' and below_similarity(item, 0.25):
return True
return False
Operational notes
Complexity and failure modes
You’ll need both hard and soft compression:
A hybrid pipeline works well in practice:
def summarize_extractive_then_abstractive(span):
key_sents = extract_top_k_sentences(span, k=8, with_numbers=True)
draft = llm_abstractive_summary(key_sents, style="bullet+yaml_facts")
return draft
Safeguards against summary drift
Comparison (rule-of-thumb)
| Method | Strength | Risk | Added latency |
|---|---|---|---|
| Extractive | Faithful, cheap | Fragmented context | Low |
| Abstractive | Coherent gist | Miss rare facts | Medium |
| Token-pruning | Big savings | Cryptic prompts | Medium–High |
Citations: See the 2024 Prompt Compression Survey and the 2025 LLM-DCP paper for technique overviews and trade-offs.
Vector-only retrieval is great for paraphrases but brittle for IDs and policies. In AI Agent context management, hybrid pipelines combine filters, sparse lexical signals, and dense vectors; top-K gets fused and optionally reranked.
Retrieval plan
Pseudocode
def hybrid_retrieve(query, k=20, filters=None):
cand_a = bm25_search(query, filters=filters, k=3*k)
cand_b = vector_search(embedding(query), filters=filters, k=3*k)
fused = reciprocal_rank_fusion(cand_a, cand_b, top=k)
return fused
When to bypass vectors
Authoritative references: Elastic’s hybrid search guidance (2024–2026), Weaviate hybrid search primers, and Pinecone’s RAG guides offer concrete fusion and reranking recipes.
Rather than stuffing tool specs and transcripts into the prompt, call tools programmatically and log the results. MCP formalizes secure, two-way connections to tools and data.
Principles
Sketch
@retry(idempotent=True, backoff=expo)
def update_ticket(tool, ticket_id, payload):
with atomic():
ok = tool.call("update_ticket", {"id": ticket_id, "payload": payload})
log_tool_io(tool="ticketing", op="update", id=ticket_id, ok=ok)
return ok
References: Anthropic’s MCP docs and engineering posts (2024–2026) detail patterns for programmatic tool calling, catalogs, and secure connections.
Goal: an agent that remembers user preferences and prior resolutions across sessions, retrieves the right policy/ticket context, and safely executes updates.
Minimal schema
CREATE TABLE users (
user_id TEXT PRIMARY KEY,
locale TEXT,
tier TEXT,
created_at TIMESTAMP
);
CREATE TABLE memories (
memory_id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
kind TEXT CHECK (kind IN ('preference','fact','episodic')),
key TEXT,
value JSONB,
embedding VECTOR,
created_at TIMESTAMP,
last_used_at TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
CREATE TABLE tool_logs (
id TEXT PRIMARY KEY,
session_id TEXT,
tool TEXT,
op TEXT,
req JSONB,
res JSONB,
ok BOOLEAN,
created_at TIMESTAMP
);
Memory capture and consolidation
def capture_long_term(session_id, turn_bundle):
prefs = extract_preferences(turn_bundle)
facts = extract_facts(turn_bundle)
episodic = summarize_episode(turn_bundle)
upsert_memories(prefs+facts+[episodic])
Per-turn retrieval and assembly
def assemble_turn_context(session, user_id, query):
filters = {"locale": user_locale(user_id), "tier": user_tier(user_id)}
mem_cands = search_memories(user_id, embedding(query), k=12)
kb_cands = hybrid_retrieve(query, k=20, filters=filters)
candidates = (
last_k_turns(session.id, k=6)
+ milestone_notes(session.id)
+ mem_cands
+ kb_cands
+ recent_tool_summaries(session.id)
)
selected, _ = page_in_out(candidates, budget_tokens=8_000)
return selected
Tool actions with MCP
def maybe_escalate(session, answer):
if needs_escalation(answer):
return update_ticket(mcp_ticket_tool(), ticket_id(answer), payload(answer))
return None
Turn loop (simplified)
def handle_turn(session, user_id, user_msg):
query = user_msg.content
selected = assemble_turn_context(session, user_id, query)
reply = llm_chat(selected + [user_msg])
action = maybe_escalate(session, reply)
capture_long_term(session.id, turn_bundle=(user_msg, reply, action))
return reply
A neutral product example (Knowledge Base Source)
For customer data, keep the kernel close to the data.
Checklist (mapped to widely used controls)
Background sources: NIST CSRC materials on encryption, access control, and auditing provide stable anchors for privacy-first operations.
Define success with a small pilot before broad rollout.
Metrics
Pilot plan (example targets)
| Area | Baseline | Target after windowing + hybrid |
|---|---|---|
| Token in (p50) | 11k | 6k |
| Latency (p50) | 3.2s | 2.1s |
| nDCG@10 | 0.62 | 0.74 |
| Task success | 72% | 83% |
Run A/Bs: toggle compression, vary K in hybrid retrieval, and compare keep_last_turns=4 vs. 8. Instrument compaction overhead to confirm savings aren’t eaten by summarization calls.
Keep the window small, the rules explicit, and the logs honest. Start with deterministic trims, add hybrid retrieval, and layer compression only where it pays. For reliable AI Agent context management at scale, treat recall as a product feature, not a side effect. If you need a context base purpose-built for agents, evaluate options like puppyone alongside your existing stack.