AI & Agents

How to Implement Memory Compaction for Long-Running AI Agents

Long-running AI agents lose track of earlier reasoning as conversations grow beyond context window limits. Memory compaction solves this by summarizing, pruning, and compressing conversation history while preserving the facts and decisions that matter. This guide walks through five production-tested compaction strategies, from anchored summarization to hybrid graph-vector retrieval, with concrete implementation patterns for each.

Fast.io Editorial Team 12 min read
Compaction keeps agents sharp across thousands of turns.

Why Context Drift Kills Agents Before Token Limits Do

Most developers assume that context window size is the hard constraint on agent sessions. The real problem is subtler. As conversations grow, the agent's attention spreads thin across thousands of tokens, and critical early decisions get diluted by later noise. This is context drift, and according to enterprise deployment data from 2025, roughly 65% of AI agent failures traced back to context drift or memory loss during multi-step reasoning rather than raw token exhaustion.

Context drift manifests in predictable ways. An agent building a codebase over 50 turns might forget an architectural decision from turn 3. A research agent might contradict its own earlier findings. A customer support agent might ask for information the user already provided. These failures erode trust and force human operators to intervene, defeating the purpose of autonomous agents.

Expanding the context window does not fix drift. A 200K-token window still suffers from attention degradation on long sequences, and inference costs scale with input length. The practical solution is compaction: reducing the conversation to its essential facts, decisions, and state while discarding redundant or superseded information. The industry has converged on this insight. As of early 2026, the focus has shifted from "bigger windows" to "smarter context management" across every major provider.

Five strategies have emerged as production-viable approaches to compaction. Each addresses a different failure mode, and most production systems combine two or more. The rest of this guide walks through each one with enough detail to implement it.

Dashboard showing AI audit trail and conversation context tracking

How Anchored Summarization Preserves Decision History

Anchored iterative summarization is the most widely adopted compaction technique. The core idea: maintain a persistent, structured summary that gets incrementally updated as new conversation segments are processed. When the context window approaches capacity, only the newly truncated portion gets summarized and merged into the existing summary.

This differs from naive summarization, where the entire conversation is re-summarized from scratch at each compression point. Re-summarization loses detail with every pass. Facts that seemed unimportant in one summary might turn out to be critical later, and they are already gone. Anchored summarization avoids this by never regenerating the full summary from scratch.

Factory.ai published evaluations comparing anchored summarization against provider-native approaches. Their structured summarization scored 3.70 overall across accuracy, context retention, artifact tracking, completeness, and continuity dimensions, compared to 3.44 for Anthropic's default compaction and 3.35 for OpenAI's. The key advantage was artifact tracking: anchored summaries maintained explicit sections for session intent, file modifications, decisions made, and next steps, which prevented the "silent dropping" of file paths and variable names that freeform summarization tends to cause.

To implement anchored summarization in your own agent, structure the persistent summary with explicit sections:

### Session Intent
Building a data pipeline for customer analytics

### Key Decisions
- Using PostgreSQL over MongoDB (turn 12, rationale: ACID compliance)
- Batch size set to 1000 records (turn 23, after benchmarking)

### Modified Artifacts
- pipeline/etl.py: added retry logic (turn 31)
- config/db.yaml: connection pool set to 20 (turn 15)

### Current State
ETL pipeline runs but fails on malformed dates in source CSV

### Next Steps
Add date validation before insert stage

When compression triggers, summarize only the new turns since the last compression, then merge those facts into the relevant sections. This preserves the full decision history while keeping the summary compact. The structured format also makes it easy for developers to inspect and debug the agent's memory state.

Provider-Native Compaction APIs

If you are building on Claude or OpenAI, both now offer built-in compaction that handles summarization server-side. These APIs remove the need to manage compression logic yourself, though they come with tradeoffs in control and customization.

Anthropic shipped its compaction API under the beta flag compact-2026-01-12. When enabled, the API monitors input token count and automatically summarizes earlier conversation history when it crosses a configurable threshold. The minimum trigger threshold is 50,000 tokens. The summary gets stored in a special compaction block, and subsequent requests continue from that compressed state with no client-side conversation management required. It currently supports Claude Opus 4.6 and Claude Sonnet 4.6, and works across the direct API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry.

You can customize what the summarizer preserves by passing an instructions field. This is important for technical agents: the default summarization prompt focuses on conversational continuity, but coding agents need to preserve exact variable names, file paths, and error messages. A custom instruction like "preserve all file paths, function signatures, and error stack traces verbatim" improves downstream accuracy.

OpenAI's approach through the Agents SDK uses OpenAIResponsesCompactionSession, which can auto-compact after each turn based on should_trigger_compaction. For lower-latency scenarios, you can disable auto-compaction and call run_compaction() manually between turns when your own heuristics say it is time.

The tradeoff with provider-native compaction is visibility. You send the conversation in, get a compressed version back, but have limited control over what gets preserved or dropped. Factory.ai's benchmarks showed that custom anchored summarization outperformed both Anthropic and OpenAI defaults, particularly on artifact tracking (maintaining references to specific files, configs, and code changes). For straightforward conversational agents, provider-native APIs are the fast path to production. For agents that manipulate files, code, or structured data, layering custom summarization on top yields better results.

Cloudflare's Agent Memory takes a different approach entirely. Rather than compacting and discarding, it ships the conversation to a separate memory service during compaction for permanent ingestion. Extracted facts get classified as Facts, Events, Instructions, or Tasks, stored with content-addressed IDs, and made retrievable through five parallel channels: full-text search, exact key lookup, raw message search, direct vector similarity, and HyDE (Hypothetical Document Embedding). This turns compaction from a lossy compression step into an archival opportunity.

Neural network visualization representing AI memory indexing and retrieval
Fastio features

Persist Agent Memory Across Sessions Without Managing Infrastructure

Fast.io gives your agents a shared workspace where compacted memory, session logs, and project files live side by side. Intelligence Mode auto-indexes everything for semantic search. 50 GB free, no credit card required.

Failure-Driven Optimization with ACON

Anchored summarization and provider APIs both rely on human intuition about what information matters. ACON (Agent Context Optimization) takes a different approach: it learns what to preserve by analyzing where compression causes failures.

The ACON framework, published by researchers at KAIST and other institutions, compresses both environment observations and interaction histories. What makes it distinct is the optimization loop. ACON runs tasks twice: once with full context and once with compressed context. When the compressed version fails but the full version succeeds, an LLM analyzes the failure and updates the compression guidelines to preserve whatever was missing.

In benchmarks across AppWorld, OfficeBench, and multi-objective QA tasks, ACON reduced peak token usage by 26 to 54% while preserving task performance. When the optimized compressor was distilled into smaller models, it retained over 95% of accuracy. For smaller language models, the improvement was even more dramatic, with up to 46% performance gains on long-horizon tasks.

The practical takeaway is that compression guidelines should be treated as living documents, not static prompts. If your agent fails after compaction, the failure itself contains information about what the compressor should have preserved. You can implement a lightweight version of this pattern without the full ACON framework:

  1. Log every case where an agent produces an error or unexpected result after a compaction event
  2. Have a separate LLM compare the pre-compaction and post-compaction context for that failure
  3. Identify what information was lost that would have prevented the failure
  4. Update your summarization prompt or anchored summary structure to explicitly preserve that category of information

Over time, this builds a compression policy tuned to your specific agent's failure modes rather than generic "what seems important" heuristics.

Graph-Enhanced Memory and Hybrid Retrieval

Vector databases excel at finding semantically similar past exchanges, but they treat each memory as an independent point. When an agent needs to reason about relationships between facts (this user prefers Python, they switched from JavaScript last month because of a performance issue on the analytics dashboard), flat vector retrieval falls short. Graph memory preserves those connections.

The distinction matters for compaction because graph structures let you prune aggressively while keeping relationship chains intact. You can remove the raw conversation about the JavaScript-to-Python migration while preserving the preference node, the reason node, and the edge connecting them. Vector-only retrieval might surface "user likes Python" but lose the context of why, which the agent may need later.

GraphRAG approaches show measurable improvements. Benchmarks from early 2026 indicate up to 35% precision improvement over vector-only retrieval for queries requiring multi-hop reasoning. The hybrid pattern that has gained traction combines vector search as the semantic entry point (finding the right neighborhood of memories) with graph traversal for relational depth (following connections between those memories).

Mem0 provides a production-ready implementation of this pattern. Their documentation recommends enabling graph memory when your use case involves complex entity relationships: medical patient contexts, enterprise account hierarchies, or technical system interdependencies. For simpler personalization (user preferences without deep relational context), vector-only retrieval performs adequately with lower latency.

Letta (formerly MemGPT) approaches the problem from an operating systems perspective, with three memory tiers inspired by computer architecture. Core Memory lives in the context window like RAM. Recall Memory stores searchable conversation history outside context like a disk cache. Archival Memory provides long-term storage the agent queries via tool calls, like cold storage. The agent manages its own memory through explicit read and write operations, deciding what to promote to core memory and what to archive.

ReMe, released by AgentScope in April 2026, offers another take: three complementary memory types (personal, task, and tool memory) plus a working memory buffer that keeps recent reasoning compact and accessible. When messages are compressed, they are automatically persisted to dated JSONL files, creating an audit trail that the agent or a human can review later.

For most production deployments, the recommended architecture is a hybrid: vector search for semantic retrieval, graph storage for relationship-heavy domains, and a structured working memory buffer for the current session. The compaction strategy then becomes: compress the working buffer periodically, archive raw conversations, and let the graph and vector indices serve as the agent's long-term memory.

Audit log interface showing connected AI agent actions and memory states

How to Store Compacted Memory in a Shared Workspace

Compaction strategies produce summaries, extracted facts, and archived conversations. All of that data needs to live somewhere that survives session restarts, is accessible to both agents and humans, and does not require managing a separate database cluster.

Local file storage works for single-developer prototyping but breaks down in team environments. An agent's compacted memory sitting in /tmp/agent_memory/ on one machine is invisible to teammates and lost on the next reboot. Standalone vector databases like Pinecone or Weaviate handle the retrieval side well but isolate agent knowledge from the rest of the team's workspace. When an agent stores its compressed context in a siloed database, human team members cannot easily view, audit, or correct that memory.

What production agents need is a shared workspace where compacted memory lives alongside the project files, documents, and discussions that humans already use. This shared context ensures agents and humans operate from the same source of truth.

Fast.io workspaces handle this by design. When you store compaction artifacts (summary files, extracted facts, session logs) in a Fast.io workspace, they are automatically indexed by Intelligence Mode for semantic search and RAG chat. An agent can write its compressed session state to a workspace file, and a human team member can ask Ripley AI "what decisions did the agent make yesterday?" and get a citation-backed answer.

The MCP server at mcp.fast.io gives agents direct read and write access to workspaces through Streamable HTTP. An agent compacting its context can write the anchored summary to a workspace file, archive the raw conversation as a JSONL log, and update a task tracking its progress, all through MCP tool calls without local filesystem dependencies.

File locks prevent conflicts when multiple agents share a workspace. If two agents are compacting simultaneously and writing to the same summary file, locks ensure one completes before the other starts. Versioning keeps every revision of the compacted state, so you can trace exactly how the agent's memory evolved over the session.

The ownership transfer model also fits well here. An agent builds a workspace with its compaction logs, session artifacts, and final outputs. When the work is complete, it transfers the organization to a human who inherits the full audit trail. The agent retains admin access for future sessions, and the human gets a complete record of what the agent knew and when.

S3 or Google Cloud Storage work too, particularly for high-volume archival of raw conversation logs. But they lack the intelligence layer (automatic indexing, semantic search, RAG) and the collaboration primitives (comments, tasks, approvals) that make agent memory inspectable by humans. For teams building agents that need to hand off work to people, a workspace-native approach reduces friction.

The free agent plan includes 50 GB of storage, 5,000 credits per month, and 5 workspaces with no credit card required, which is enough runway for most compaction workflows. Credits cover AI indexing and search operations, so your agent's compacted memory becomes queryable the moment it is written.

Frequently Asked Questions

What is memory compaction for LLM agents?

Memory compaction is the process of summarizing, pruning, and compressing an AI agent's conversation history to maintain reasoning quality within context window limits. Rather than feeding the entire conversation into each request, compaction distills earlier turns into a compact representation that preserves key decisions, facts, and state while discarding redundant information.

How do you prevent AI agents from losing context?

Use anchored iterative summarization to maintain a structured, persistent summary that gets incrementally updated rather than regenerated from scratch. Combine this with a provider-native compaction API (Anthropic's compact-2026-01-12 or OpenAI's Agents SDK sessions) for automatic trigger management. For relationship-heavy contexts, add graph memory to preserve connections between facts. Store compaction artifacts in a shared workspace like Fast.io so both agents and humans can inspect the compressed state.

How does Anthropic's compaction API work?

Anthropic's compaction API (beta flag compact-2026-01-12) monitors input token count and automatically summarizes earlier conversation when it exceeds a configurable threshold (minimum 50,000 tokens). The summary is stored in a compaction block, and subsequent requests continue from the compressed state. You can customize what gets preserved by passing an instructions field to the API. It supports Claude Opus 4.6 and Sonnet 4.6 across the direct API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry.

What causes context drift in AI agents?

Context drift happens when an agent's attention spreads thin across a growing conversation, diluting critical early decisions and facts. As token count increases, the model gives less weight to information near the beginning of the context. This causes the agent to contradict earlier decisions, forget established constraints, or repeat questions already answered. Roughly 65% of enterprise AI failures in 2025 were attributed to context drift rather than raw context window exhaustion.

What is the difference between graph memory and vector memory for agents?

Vector memory retrieves facts that are semantically similar to the current query but treats each memory independently. Graph memory preserves relationships between facts, so the agent can reason about how information connects. Vector search answers 'what is like this?' while graph traversal answers 'what is connected to this?' Production agents typically use both: vector search as the semantic entry point to find relevant neighborhoods, and graph traversal for relational depth.

How does ACON improve context compression?

ACON (Agent Context Optimization) learns what to preserve by analyzing failures. It runs tasks with both full and compressed context, identifies cases where compression caused failures, and uses an LLM to update compression guidelines based on what information was missing. In benchmarks, ACON reduced peak token usage by 26 to 54% while preserving over 95% of task accuracy when distilled into smaller models.

Related Resources

Fastio features

Persist Agent Memory Across Sessions Without Managing Infrastructure

Fast.io gives your agents a shared workspace where compacted memory, session logs, and project files live side by side. Intelligence Mode auto-indexes everything for semantic search. 50 GB free, no credit card required.