EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAgentic Workflows
Mature
⇄ Compare

Context Compression

[EAAPL-WRK007] Context Compression

Category: Agentic Workflows Sub-category: Memory Management Architecture Version: 1.0 Maturity: Emerging Tags: context-window, summarisation, context-compression, rolling-window, memory-management Regulatory Relevance: ISO 42001 §8.4, APRA CPS 230


1. Executive Summary

The Context Compression Pattern defines techniques for managing an agent's context window during long-running workflows: rolling summarisation of earlier history, selective retention of high-relevance content, episodic retrieval of previously compressed material, and hierarchical memory tiers. Without active context management, long-running agentic workflows eventually exceed the model's context window, causing either hard failures or silent truncation of critical earlier context. This pattern prevents context overflow while preserving the reasoning continuity needed for high-quality multi-step task completion.

For CIO/CTO audiences: every AI model has a limit on how much information it can hold in working memory at one time (the context window). For short tasks this is not a problem. For long-running workflows — a multi-day research project, a complex code generation task, an extended client engagement — the working memory fills up and important earlier context gets forgotten. This pattern is the equivalent of a skilled professional's note-taking discipline: compressing older details into summaries, retaining only the most relevant specifics, and retrieving historical detail on demand. It enables AI agents to work effectively on tasks that span thousands of steps without losing sight of earlier work.


2. Problem Statement

Business Problem

Enterprise long-running tasks — regulatory analysis across hundreds of documents, code review of large codebases, extended document drafting sessions — generate more intermediate context than any current LLM context window can hold. Without context management, these tasks either fail with context overflow errors or produce degraded outputs because critical earlier findings have been silently truncated from the model's working memory.

Technical Problem

LLM context windows are finite (ranging from 8K to 1M tokens across current models). Long-running agentic workflows accumulate thought-action-observation records that grow without bound. Naive strategies (truncate oldest content, always truncate longest observations) destroy reasoning continuity and produce confusion in later reasoning steps that reference context that has been silently removed.

Symptoms of Absence

  • Agent loses track of earlier findings, re-asks questions already answered, or contradicts its own earlier conclusions
  • Hard context overflow errors terminate workflows mid-execution
  • Observations are silently truncated; agent reasons about incomplete tool results without knowing they are incomplete
  • No mechanism to retrieve earlier reasoning context on demand

Cost of Inaction

  • Reliability: Context overflow errors terminate tasks, wasting all prior computation
  • Quality: Silent truncation produces outputs that contradict or ignore earlier findings
  • Trust: Users discover that the agent has "forgotten" key earlier context, destroying confidence

3. Context

When to Apply

  • Tasks generate context that may approach or exceed the model's context window budget
  • Long-running agents with many tool calls (>15 iterations)
  • Tasks spanning multiple sessions (inter-session context preservation)
  • Tasks requiring recall of information from early stages when synthesising final outputs

When NOT to Apply

  • Short tasks with predictably small context footprints
  • Tasks where complete context history must be preserved verbatim (use an explicit log, not the context window)
  • Tasks where summarisation would destroy required precision (highly technical numerical content)

Prerequisites

  • EAAPL-AGT002 (Stateful Agent Memory) for persistent episodic storage
  • Token counting utility for real-time context budget monitoring
  • Summarisation prompt templates per content type (tool observations, reasoning steps, user turns)
  • Context budget thresholds for triggering compression

Industry Applicability

Industry Long-Running Task Context Challenge
Legal Multi-week document review project Thousands of document excerpts accumulate
Financial Services Complex financial model analysis Large data tables plus reasoning accumulate
Technology Large codebase analysis and refactoring File contents plus analysis notes accumulate
Healthcare Extended clinical case analysis Lab results, notes, guidelines accumulate
Government Multi-agency policy development Submissions, analysis, revision history accumulate

4. Architecture Overview

Context compression is implemented as a continuous monitoring and management layer that wraps the agent's context window, operating proactively before overflow occurs.

Token Budget Monitor The token budget monitor tracks the current context window usage in real time, counting tokens after every addition (thought, observation, user message). It emits threshold events at configurable percentages of the context window capacity: a warning threshold (e.g., 70%) triggers proactive compression, a critical threshold (e.g., 85%) triggers aggressive compression.

Rolling Summarisation When the warning threshold is crossed, the Rolling Summariser identifies the oldest N thought-action-observation blocks in the scratchpad and replaces them with a structured summary. The summary preserves: key findings (entities, decisions, conclusions), pending actions, and unresolved questions. The raw blocks are archived to episodic memory (EAAPL-AGT002) before being removed from the active context. The summary explicitly marks itself as a compression artefact so that the agent's reasoning knows it is working with summarised rather than raw context.

Selective Retention Not all context has equal relevance. The Relevance Scorer evaluates each context block against the current task objective and assigns a relevance score. High-relevance blocks (directly referenced by the current reasoning step) are retained in full. Medium-relevance blocks are summarised. Low-relevance blocks are archived to episodic memory. Relevance scoring is performed by a lightweight LLM call or by embedding similarity to the current task objective.

Episodic Retrieval When the agent's reasoning references content that has been compressed out of the active context ("Earlier I noted that clause 12.3 had a specific obligation — what was it exactly?"), the Episodic Retriever performs a semantic search against the episodic memory store (EAAPL-AGT002) and injects the retrieved content back into context as a bounded retrieval observation. This on-demand retrieval prevents the need to retain all historical context proactively.

Hierarchical Memory Tiers Context is managed across three tiers: (1) Active Context — full fidelity, in-window; (2) Summary Buffer — compressed representations of recent history, in-window; (3) Episodic Archive — full-fidelity earlier content, out-of-window but retrievable. The size allocated to each tier is configurable per task type.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph ActiveContext["Active Context Window"] A[Active Blocks] B[Summary Buffer] end subgraph Monitor["Context Budget Monitor"] C{Token Budget Threshold?} D[Warning 70%] E[Critical 85%] end subgraph Compression["Compression Engine"] F[Relevance Scorer] G[Rolling Summariser] H[Archive to Episodic] end subgraph Retrieval["Episodic Retrieval"] I[Retrieval Trigger] J[Semantic Search] K[Inject Retrieved Content] end subgraph Storage["Memory Tiers"] L[(Active Context)] M[(Episodic Archive)] end A --> C C -->|below threshold| A C -->|warning| D C -->|critical| E D --> F E --> F F --> G G --> H H --> M G --> B A --> I I --> J J --> M J --> K K --> A

6. Components

Component Type Responsibility Technology Options Criticality
Token Budget Monitor Logic Component Real-time token counting; threshold event emission tiktoken (OpenAI); Anthropic token counter; custom Critical
Relevance Scorer AI/Logic Component Scores context blocks against current task objective Embedding similarity (fast); LLM scoring (accurate) High
Rolling Summariser AI Component Generates structured summaries of oldest context blocks GPT-4o-mini, Claude 3 Haiku (cost-optimised for summarisation) Critical
Episodic Archiver Integration Writes full-fidelity blocks to episodic memory before removal EAAPL-AGT002 write API Critical
Episodic Retriever Integration Semantic search against episodic archive; injects retrieved content EAAPL-AGT002 retrieval API; pgvector; Pinecone High
Summary Buffer Manager State Manages the summary buffer section of active context; limits buffer size Custom context management layer High
Compression Audit Logger Governance Records every compression event: what was summarised, what was archived PostgreSQL; append-only log Medium

7. Data Flow

Step Actor Action Output
1 Agent Adds Thought-Action-Observation block 47 to context Context: 71,200 tokens (71.2% of 100K window)
2 Token Budget Monitor 71.2% > warning threshold (70%); trigger compression Compression event fired
3 Relevance Scorer Scores blocks 1–30 against current task objective Blocks 1–10: low relevance; blocks 11–20: medium; blocks 21–30: high
4 Rolling Summariser Summarises blocks 1–10 into structured summary (1,200 tokens) Summary: {key_findings: [...], decisions: [...], pending: [...]}
5 Episodic Archiver Archives raw blocks 1–10 to episodic store 10 blocks stored with embeddings
6 Context Manager Replaces blocks 1–10 with summary in active context Context reduced to 62,000 tokens (62%)
7 Agent (later) Thought references: "In my earlier analysis of document section 4.2..." Retrieval trigger detected
8 Episodic Retriever Semantic search: "document section 4.2 analysis" → finds archived block 7 Retrieved block 7
9 Context Manager Injects retrieved block as bounded observation: [RETRIEVED CONTEXT: Block 7] Active context updated

Error Flow

Error Detection Recovery
Summarisation produces low-quality summary Summary quality check (length, structure) Retry summarisation with more explicit instructions; archive-only as fallback
Episodic archive write failure Archive API error Keep block in active context; alert; retry on next compression cycle
Retrieval returns irrelevant content Relevance threshold on retrieval results Return no-result observation; let agent reason without retrieved context
Context hits critical threshold before compression completes Token count exceeds critical threshold Emergency truncation of oldest low-relevance blocks; immediate archival

8. Security Considerations

Sensitive Content in Episodic Archive

  • The episodic archive persists sensitive intermediate content (PII, commercial-in-confidence data) that may have shorter retention requirements than the task output
  • Mitigation: Apply same retention and access controls to episodic archive as to task audit records; support time-limited archival with automatic deletion

OWASP LLM Top 10

OWASP LLM Risk Context Compression Applicability Mitigation
LLM01 Prompt Injection Archived content retrieved and re-injected may contain injection attempts Sanitise retrieved content before re-injection; apply injection detection on retrieval
LLM06 Sensitive Information Episodic archive accumulates all task context including sensitive data Encrypt episodic archive; apply per-block retention policies; PII detection before archival
LLM09 Overreliance Agent may rely on compressed summaries that omit nuanced details Summaries clearly marked as compressed; agent can trigger retrieval for full fidelity

9. Governance Considerations

Compression Audit Trail

  • Every compression event must be logged: which blocks were compressed, what summary was produced, and what was archived. This ensures that the full task context can be reconstructed for audit purposes even though it was not present in the active context at every point.

Governance Artefacts

Artefact Owner Frequency Purpose
Compression Event Log AI Platform Per event; retained with task audit Documents every compression operation for context reconstruction
Episodic Archive Compliance Per task; retained per policy Full-fidelity earlier context retrievable for audit
Summarisation Quality Benchmark ML Engineering Monthly Validates that summaries preserve key findings; detects quality degradation
Compression Threshold Policy AI Governance Board Quarterly Documents warning/critical thresholds per task type

10. Operational Considerations

SLOs

SLO Target Window Alert
Context overflow rate (hard context window exceeded) 0% 24-hour rolling Any overflow triggers P1; compression triggering too late
Compression-induced quality degradation rate ≤ 2% of tasks Weekly eval > 5% triggers P2; review summarisation quality
Episodic retrieval precision (relevant content retrieved) ≥ 85% Weekly eval < 75% triggers P3; review retrieval embedding quality
Compression overhead latency ≤ 3s per compression event 1-hour rolling > 8s triggers P3

Monitoring

  • Context budget usage trending per task type: approaching ceiling early indicates need for earlier compression or higher budget threshold
  • Summary to raw token ratio: tracks compression efficiency
  • Retrieval usage rate: high retrieval rate indicates compression is too aggressive (too much important content being compressed)

11. Cost Considerations

Cost Factor Impact Mitigation
Summarisation LLM calls Low–Medium (small model, short outputs) Use GPT-4o-mini or Claude 3 Haiku for summarisation; never the primary reasoning model
Episodic retrieval embedding calls Low Batch embedding calls; cache embeddings for stable content
Extended context window (avoid compression) Very High Compression is far cheaper than doubling the context window tier
Compression preventing task failure Very High positive Prevents waste of all prior computation

Indicative Cost Comparison

Approach Cost per 100-iteration task Reliability
No compression (overflow at iteration 50) $0.50 then failure Poor
Large context window (1M tokens) $5.00–20.00 (premium model pricing) High
Context compression (proactive) $0.80–1.20 (base + summarisation overhead) High

12. Trade-Off Analysis

Option Context Continuity Cost Complexity Reliability Best For
A: Rolling summarisation + episodic retrieval (Recommended) High Low–Medium Medium Very High Long-running production agents
B: Fixed rolling window (keep last N blocks) Low Very Low Very Low Medium Short tasks where earlier context is less important
C: Large context window (no compression) Very High Very High Very Low High Tasks requiring full fidelity; budget available
D: Hierarchical summarisation (multi-level) Very High Medium High Very High Very long tasks (hundreds of iterations)

Architectural Tensions

Tension Left Pole Right Pole Balance
Compression aggressiveness Aggressive (low context usage, low fidelity) Conservative (high fidelity, context overflow risk) Compress at 70% warning; aggressive at 85%
Summary fidelity Short, lossy summary Long, detailed summary (uses more context budget) Summary ≤ 20% of source length; structured format
Retrieval frequency Retrieve-on-demand (efficient) Proactive re-injection of all archived content On-demand only; triggered by agent reasoning references

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Summary drops critical finding Medium High — agent loses important earlier information Retrieval usage spikes; agent re-asks answered questions Structured summary format enforces key-finding extraction; retrieval provides fallback
Episodic archive grows unbounded Low–Medium Medium — storage and retrieval cost Archive size monitoring Retention policy; tiered archive (warm/cold)
Retrieval injects irrelevant content Medium Medium — context pollution Retrieval precision monitoring Relevance threshold on retrieval; low-confidence retrievals discarded
Compression loop (compression triggers faster than it reduces context) Low High — agent stuck compressing Compression event frequency alert Emergency mode: archive all low-relevance blocks; force 50% context reduction
PII retained in episodic archive beyond retention period Low High — compliance violation Automated PII scan on archive Time-limited archival with automatic deletion; PII masking before archival

14. Regulatory Considerations

ISO 42001

  • §8.4: Compression decisions affect the AI system's operational quality; compression algorithms must be documented and tested against quality benchmarks.

APRA CPS 230

  • For agents operating in material business processes, the full reasoning context (including archived episodic records) must be retainable for operational resilience investigations.

Australian Context

  • Privacy Act 1988: Content archived to episodic storage inherits the privacy sensitivity of the source task; retention and deletion policies must match the organisation's records management framework.

15. Reference Implementations

AWS

Component Service
Token Counter Lambda function with tiktoken / boto3 tokenizer
Summarisation Amazon Bedrock (Claude 3 Haiku — cost-optimised)
Episodic Archive Amazon OpenSearch Serverless (vector search for retrieval)
Summary Buffer DynamoDB per-task context state
Monitoring CloudWatch custom metrics for context budget

Azure

Component Service
Summarisation Azure OpenAI (GPT-4o-mini)
Episodic Archive Azure AI Search (vector index)
Context State Azure Cosmos DB
Monitoring Azure Monitor with custom context budget metrics

On-Premises

Component Technology
Summarisation vLLM (Llama 3.1 8B for cost-efficient summarisation)
Episodic Archive pgvector (PostgreSQL vector extension)
Context Management LangChain ConversationSummaryBufferMemory; custom

Pattern ID Relationship Type Notes
Stateful Agent Memory EAAPL-AGT002 Depends On Episodic archive is implemented on top of the stateful agent memory store
ReAct Agent Loop EAAPL-WRK001 Integrates With Context compression wraps the ReAct scratchpad; manages its context budget
Long-Running Agent EAAPL-AGT007 Integrates With Long-running agents are the primary consumers of context compression
Workflow Tracing and Replay EAAPL-WRK013 Integrates With Compression event log is part of the workflow trace

17. Maturity Assessment

Overall Maturity: Emerging

Dimension Score (1–5) Evidence
Research Foundation 4 MemGPT, Compressor-Retriever, LongMem papers provide foundation
Production Deployment 3 Deployed in long-running research and code-generation agents; general enterprise tooling maturing
Framework Support 3 LangChain SummaryBufferMemory; MemGPT; custom implementations common
Summarisation Quality 3 Summarisation quality for technical content improving; not yet standardised
Retrieval Integration 3 Retrieval-augmented context injection maturing; precision benchmarks lacking

18. Revision History

Version Date Author Changes
1.0 2025-06-13 Architecture Board Initial publication in Agentic Workflows category
← Back to LibraryMore Agentic Workflows