[EAAPL-WRK007] Context Compression
Category: Agentic Workflows
Sub-category: Memory Management Architecture
Version: 1.0
Maturity: Emerging
Tags: context-window, summarisation, context-compression, rolling-window, memory-management
Regulatory Relevance: ISO 42001 §8.4, APRA CPS 230
1. Executive Summary
The Context Compression Pattern defines techniques for managing an agent's context window during long-running workflows: rolling summarisation of earlier history, selective retention of high-relevance content, episodic retrieval of previously compressed material, and hierarchical memory tiers. Without active context management, long-running agentic workflows eventually exceed the model's context window, causing either hard failures or silent truncation of critical earlier context. This pattern prevents context overflow while preserving the reasoning continuity needed for high-quality multi-step task completion.
For CIO/CTO audiences: every AI model has a limit on how much information it can hold in working memory at one time (the context window). For short tasks this is not a problem. For long-running workflows — a multi-day research project, a complex code generation task, an extended client engagement — the working memory fills up and important earlier context gets forgotten. This pattern is the equivalent of a skilled professional's note-taking discipline: compressing older details into summaries, retaining only the most relevant specifics, and retrieving historical detail on demand. It enables AI agents to work effectively on tasks that span thousands of steps without losing sight of earlier work.
2. Problem Statement
Business Problem
Enterprise long-running tasks — regulatory analysis across hundreds of documents, code review of large codebases, extended document drafting sessions — generate more intermediate context than any current LLM context window can hold. Without context management, these tasks either fail with context overflow errors or produce degraded outputs because critical earlier findings have been silently truncated from the model's working memory.
Technical Problem
LLM context windows are finite (ranging from 8K to 1M tokens across current models). Long-running agentic workflows accumulate thought-action-observation records that grow without bound. Naive strategies (truncate oldest content, always truncate longest observations) destroy reasoning continuity and produce confusion in later reasoning steps that reference context that has been silently removed.
Symptoms of Absence
- Agent loses track of earlier findings, re-asks questions already answered, or contradicts its own earlier conclusions
- Hard context overflow errors terminate workflows mid-execution
- Observations are silently truncated; agent reasons about incomplete tool results without knowing they are incomplete
- No mechanism to retrieve earlier reasoning context on demand
Cost of Inaction
- Reliability: Context overflow errors terminate tasks, wasting all prior computation
- Quality: Silent truncation produces outputs that contradict or ignore earlier findings
- Trust: Users discover that the agent has "forgotten" key earlier context, destroying confidence
3. Context
When to Apply
- Tasks generate context that may approach or exceed the model's context window budget
- Long-running agents with many tool calls (>15 iterations)
- Tasks spanning multiple sessions (inter-session context preservation)
- Tasks requiring recall of information from early stages when synthesising final outputs
When NOT to Apply
- Short tasks with predictably small context footprints
- Tasks where complete context history must be preserved verbatim (use an explicit log, not the context window)
- Tasks where summarisation would destroy required precision (highly technical numerical content)
Prerequisites
- EAAPL-AGT002 (Stateful Agent Memory) for persistent episodic storage
- Token counting utility for real-time context budget monitoring
- Summarisation prompt templates per content type (tool observations, reasoning steps, user turns)
- Context budget thresholds for triggering compression
Industry Applicability
| Industry |
Long-Running Task |
Context Challenge |
| Legal |
Multi-week document review project |
Thousands of document excerpts accumulate |
| Financial Services |
Complex financial model analysis |
Large data tables plus reasoning accumulate |
| Technology |
Large codebase analysis and refactoring |
File contents plus analysis notes accumulate |
| Healthcare |
Extended clinical case analysis |
Lab results, notes, guidelines accumulate |
| Government |
Multi-agency policy development |
Submissions, analysis, revision history accumulate |
4. Architecture Overview
Context compression is implemented as a continuous monitoring and management layer that wraps the agent's context window, operating proactively before overflow occurs.
Token Budget Monitor
The token budget monitor tracks the current context window usage in real time, counting tokens after every addition (thought, observation, user message). It emits threshold events at configurable percentages of the context window capacity: a warning threshold (e.g., 70%) triggers proactive compression, a critical threshold (e.g., 85%) triggers aggressive compression.
Rolling Summarisation
When the warning threshold is crossed, the Rolling Summariser identifies the oldest N thought-action-observation blocks in the scratchpad and replaces them with a structured summary. The summary preserves: key findings (entities, decisions, conclusions), pending actions, and unresolved questions. The raw blocks are archived to episodic memory (EAAPL-AGT002) before being removed from the active context. The summary explicitly marks itself as a compression artefact so that the agent's reasoning knows it is working with summarised rather than raw context.
Selective Retention
Not all context has equal relevance. The Relevance Scorer evaluates each context block against the current task objective and assigns a relevance score. High-relevance blocks (directly referenced by the current reasoning step) are retained in full. Medium-relevance blocks are summarised. Low-relevance blocks are archived to episodic memory. Relevance scoring is performed by a lightweight LLM call or by embedding similarity to the current task objective.
Episodic Retrieval
When the agent's reasoning references content that has been compressed out of the active context ("Earlier I noted that clause 12.3 had a specific obligation — what was it exactly?"), the Episodic Retriever performs a semantic search against the episodic memory store (EAAPL-AGT002) and injects the retrieved content back into context as a bounded retrieval observation. This on-demand retrieval prevents the need to retain all historical context proactively.
Hierarchical Memory Tiers
Context is managed across three tiers: (1) Active Context — full fidelity, in-window; (2) Summary Buffer — compressed representations of recent history, in-window; (3) Episodic Archive — full-fidelity earlier content, out-of-window but retrievable. The size allocated to each tier is configurable per task type.
5. Architecture Diagram
flowchart TD
subgraph ActiveContext["Active Context Window"]
A[Active Blocks]
B[Summary Buffer]
end
subgraph Monitor["Context Budget Monitor"]
C{Token Budget Threshold?}
D[Warning 70%]
E[Critical 85%]
end
subgraph Compression["Compression Engine"]
F[Relevance Scorer]
G[Rolling Summariser]
H[Archive to Episodic]
end
subgraph Retrieval["Episodic Retrieval"]
I[Retrieval Trigger]
J[Semantic Search]
K[Inject Retrieved Content]
end
subgraph Storage["Memory Tiers"]
L[(Active Context)]
M[(Episodic Archive)]
end
A --> C
C -->|below threshold| A
C -->|warning| D
C -->|critical| E
D --> F
E --> F
F --> G
G --> H
H --> M
G --> B
A --> I
I --> J
J --> M
J --> K
K --> A
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Token Budget Monitor |
Logic Component |
Real-time token counting; threshold event emission |
tiktoken (OpenAI); Anthropic token counter; custom |
Critical |
| Relevance Scorer |
AI/Logic Component |
Scores context blocks against current task objective |
Embedding similarity (fast); LLM scoring (accurate) |
High |
| Rolling Summariser |
AI Component |
Generates structured summaries of oldest context blocks |
GPT-4o-mini, Claude 3 Haiku (cost-optimised for summarisation) |
Critical |
| Episodic Archiver |
Integration |
Writes full-fidelity blocks to episodic memory before removal |
EAAPL-AGT002 write API |
Critical |
| Episodic Retriever |
Integration |
Semantic search against episodic archive; injects retrieved content |
EAAPL-AGT002 retrieval API; pgvector; Pinecone |
High |
| Summary Buffer Manager |
State |
Manages the summary buffer section of active context; limits buffer size |
Custom context management layer |
High |
| Compression Audit Logger |
Governance |
Records every compression event: what was summarised, what was archived |
PostgreSQL; append-only log |
Medium |
7. Data Flow
| Step |
Actor |
Action |
Output |
| 1 |
Agent |
Adds Thought-Action-Observation block 47 to context |
Context: 71,200 tokens (71.2% of 100K window) |
| 2 |
Token Budget Monitor |
71.2% > warning threshold (70%); trigger compression |
Compression event fired |
| 3 |
Relevance Scorer |
Scores blocks 1–30 against current task objective |
Blocks 1–10: low relevance; blocks 11–20: medium; blocks 21–30: high |
| 4 |
Rolling Summariser |
Summarises blocks 1–10 into structured summary (1,200 tokens) |
Summary: {key_findings: [...], decisions: [...], pending: [...]} |
| 5 |
Episodic Archiver |
Archives raw blocks 1–10 to episodic store |
10 blocks stored with embeddings |
| 6 |
Context Manager |
Replaces blocks 1–10 with summary in active context |
Context reduced to 62,000 tokens (62%) |
| 7 |
Agent (later) |
Thought references: "In my earlier analysis of document section 4.2..." |
Retrieval trigger detected |
| 8 |
Episodic Retriever |
Semantic search: "document section 4.2 analysis" → finds archived block 7 |
Retrieved block 7 |
| 9 |
Context Manager |
Injects retrieved block as bounded observation: [RETRIEVED CONTEXT: Block 7] |
Active context updated |
Error Flow
| Error |
Detection |
Recovery |
| Summarisation produces low-quality summary |
Summary quality check (length, structure) |
Retry summarisation with more explicit instructions; archive-only as fallback |
| Episodic archive write failure |
Archive API error |
Keep block in active context; alert; retry on next compression cycle |
| Retrieval returns irrelevant content |
Relevance threshold on retrieval results |
Return no-result observation; let agent reason without retrieved context |
| Context hits critical threshold before compression completes |
Token count exceeds critical threshold |
Emergency truncation of oldest low-relevance blocks; immediate archival |
8. Security Considerations
Sensitive Content in Episodic Archive
- The episodic archive persists sensitive intermediate content (PII, commercial-in-confidence data) that may have shorter retention requirements than the task output
- Mitigation: Apply same retention and access controls to episodic archive as to task audit records; support time-limited archival with automatic deletion
OWASP LLM Top 10
| OWASP LLM Risk |
Context Compression Applicability |
Mitigation |
| LLM01 Prompt Injection |
Archived content retrieved and re-injected may contain injection attempts |
Sanitise retrieved content before re-injection; apply injection detection on retrieval |
| LLM06 Sensitive Information |
Episodic archive accumulates all task context including sensitive data |
Encrypt episodic archive; apply per-block retention policies; PII detection before archival |
| LLM09 Overreliance |
Agent may rely on compressed summaries that omit nuanced details |
Summaries clearly marked as compressed; agent can trigger retrieval for full fidelity |
9. Governance Considerations
Compression Audit Trail
- Every compression event must be logged: which blocks were compressed, what summary was produced, and what was archived. This ensures that the full task context can be reconstructed for audit purposes even though it was not present in the active context at every point.
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Compression Event Log |
AI Platform |
Per event; retained with task audit |
Documents every compression operation for context reconstruction |
| Episodic Archive |
Compliance |
Per task; retained per policy |
Full-fidelity earlier context retrievable for audit |
| Summarisation Quality Benchmark |
ML Engineering |
Monthly |
Validates that summaries preserve key findings; detects quality degradation |
| Compression Threshold Policy |
AI Governance Board |
Quarterly |
Documents warning/critical thresholds per task type |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| Context overflow rate (hard context window exceeded) |
0% |
24-hour rolling |
Any overflow triggers P1; compression triggering too late |
| Compression-induced quality degradation rate |
≤ 2% of tasks |
Weekly eval |
> 5% triggers P2; review summarisation quality |
| Episodic retrieval precision (relevant content retrieved) |
≥ 85% |
Weekly eval |
< 75% triggers P3; review retrieval embedding quality |
| Compression overhead latency |
≤ 3s per compression event |
1-hour rolling |
> 8s triggers P3 |
Monitoring
- Context budget usage trending per task type: approaching ceiling early indicates need for earlier compression or higher budget threshold
- Summary to raw token ratio: tracks compression efficiency
- Retrieval usage rate: high retrieval rate indicates compression is too aggressive (too much important content being compressed)
11. Cost Considerations
| Cost Factor |
Impact |
Mitigation |
| Summarisation LLM calls |
Low–Medium (small model, short outputs) |
Use GPT-4o-mini or Claude 3 Haiku for summarisation; never the primary reasoning model |
| Episodic retrieval embedding calls |
Low |
Batch embedding calls; cache embeddings for stable content |
| Extended context window (avoid compression) |
Very High |
Compression is far cheaper than doubling the context window tier |
| Compression preventing task failure |
Very High positive |
Prevents waste of all prior computation |
Indicative Cost Comparison
| Approach |
Cost per 100-iteration task |
Reliability |
| No compression (overflow at iteration 50) |
$0.50 then failure |
Poor |
| Large context window (1M tokens) |
$5.00–20.00 (premium model pricing) |
High |
| Context compression (proactive) |
$0.80–1.20 (base + summarisation overhead) |
High |
12. Trade-Off Analysis
| Option |
Context Continuity |
Cost |
Complexity |
Reliability |
Best For |
| A: Rolling summarisation + episodic retrieval (Recommended) |
High |
Low–Medium |
Medium |
Very High |
Long-running production agents |
| B: Fixed rolling window (keep last N blocks) |
Low |
Very Low |
Very Low |
Medium |
Short tasks where earlier context is less important |
| C: Large context window (no compression) |
Very High |
Very High |
Very Low |
High |
Tasks requiring full fidelity; budget available |
| D: Hierarchical summarisation (multi-level) |
Very High |
Medium |
High |
Very High |
Very long tasks (hundreds of iterations) |
Architectural Tensions
| Tension |
Left Pole |
Right Pole |
Balance |
| Compression aggressiveness |
Aggressive (low context usage, low fidelity) |
Conservative (high fidelity, context overflow risk) |
Compress at 70% warning; aggressive at 85% |
| Summary fidelity |
Short, lossy summary |
Long, detailed summary (uses more context budget) |
Summary ≤ 20% of source length; structured format |
| Retrieval frequency |
Retrieve-on-demand (efficient) |
Proactive re-injection of all archived content |
On-demand only; triggered by agent reasoning references |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Summary drops critical finding |
Medium |
High — agent loses important earlier information |
Retrieval usage spikes; agent re-asks answered questions |
Structured summary format enforces key-finding extraction; retrieval provides fallback |
| Episodic archive grows unbounded |
Low–Medium |
Medium — storage and retrieval cost |
Archive size monitoring |
Retention policy; tiered archive (warm/cold) |
| Retrieval injects irrelevant content |
Medium |
Medium — context pollution |
Retrieval precision monitoring |
Relevance threshold on retrieval; low-confidence retrievals discarded |
| Compression loop (compression triggers faster than it reduces context) |
Low |
High — agent stuck compressing |
Compression event frequency alert |
Emergency mode: archive all low-relevance blocks; force 50% context reduction |
| PII retained in episodic archive beyond retention period |
Low |
High — compliance violation |
Automated PII scan on archive |
Time-limited archival with automatic deletion; PII masking before archival |
14. Regulatory Considerations
ISO 42001
- §8.4: Compression decisions affect the AI system's operational quality; compression algorithms must be documented and tested against quality benchmarks.
APRA CPS 230
- For agents operating in material business processes, the full reasoning context (including archived episodic records) must be retainable for operational resilience investigations.
Australian Context
- Privacy Act 1988: Content archived to episodic storage inherits the privacy sensitivity of the source task; retention and deletion policies must match the organisation's records management framework.
15. Reference Implementations
AWS
| Component |
Service |
| Token Counter |
Lambda function with tiktoken / boto3 tokenizer |
| Summarisation |
Amazon Bedrock (Claude 3 Haiku — cost-optimised) |
| Episodic Archive |
Amazon OpenSearch Serverless (vector search for retrieval) |
| Summary Buffer |
DynamoDB per-task context state |
| Monitoring |
CloudWatch custom metrics for context budget |
Azure
| Component |
Service |
| Summarisation |
Azure OpenAI (GPT-4o-mini) |
| Episodic Archive |
Azure AI Search (vector index) |
| Context State |
Azure Cosmos DB |
| Monitoring |
Azure Monitor with custom context budget metrics |
On-Premises
| Component |
Technology |
| Summarisation |
vLLM (Llama 3.1 8B for cost-efficient summarisation) |
| Episodic Archive |
pgvector (PostgreSQL vector extension) |
| Context Management |
LangChain ConversationSummaryBufferMemory; custom |
| Pattern |
ID |
Relationship Type |
Notes |
| Stateful Agent Memory |
EAAPL-AGT002 |
Depends On |
Episodic archive is implemented on top of the stateful agent memory store |
| ReAct Agent Loop |
EAAPL-WRK001 |
Integrates With |
Context compression wraps the ReAct scratchpad; manages its context budget |
| Long-Running Agent |
EAAPL-AGT007 |
Integrates With |
Long-running agents are the primary consumers of context compression |
| Workflow Tracing and Replay |
EAAPL-WRK013 |
Integrates With |
Compression event log is part of the workflow trace |
17. Maturity Assessment
Overall Maturity: Emerging
| Dimension |
Score (1–5) |
Evidence |
| Research Foundation |
4 |
MemGPT, Compressor-Retriever, LongMem papers provide foundation |
| Production Deployment |
3 |
Deployed in long-running research and code-generation agents; general enterprise tooling maturing |
| Framework Support |
3 |
LangChain SummaryBufferMemory; MemGPT; custom implementations common |
| Summarisation Quality |
3 |
Summarisation quality for technical content improving; not yet standardised |
| Retrieval Integration |
3 |
Retrieval-augmented context injection maturing; precision benchmarks lacking |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-06-13 |
Architecture Board |
Initial publication in Agentic Workflows category |