Emerging

Context Compression

Agentic WorkflowsISO/IEC 42001

[EAAPL-WRK007] Context Compression

Category: Agentic Workflows Sub-category: Memory Management Architecture Version: 1.0 Maturity: Emerging Tags: context-window, summarisation, context-compression, rolling-window, memory-management Regulatory Relevance: ISO 42001 §8.4, APRA CPS 230

1. Executive Summary

The Context Compression Pattern defines techniques for managing an agent's context window during long-running workflows: rolling summarisation of earlier history, selective retention of high-relevance content, episodic retrieval of previously compressed material, and hierarchical memory tiers. Without active context management, long-running agentic workflows eventually exceed the model's context window, causing either hard failures or silent truncation of critical earlier context. This pattern prevents context overflow while preserving the reasoning continuity needed for high-quality multi-step task completion.

For CIO/CTO audiences: every AI model has a limit on how much information it can hold in working memory at one time (the context window). For short tasks this is not a problem. For long-running workflows — a multi-day research project, a complex code generation task, an extended client engagement — the working memory fills up and important earlier context gets forgotten. This pattern is the equivalent of a skilled professional's note-taking discipline: compressing older details into summaries, retaining only the most relevant specifics, and retrieving historical detail on demand. It enables AI agents to work effectively on tasks that span thousands of steps without losing sight of earlier work.

2. Problem Statement

Business Problem

Enterprise long-running tasks — regulatory analysis across hundreds of documents, code review of large codebases, extended document drafting sessions — generate more intermediate context than any current LLM context window can hold. Without context management, these tasks either fail with context overflow errors or produce degraded outputs because critical earlier findings have been silently truncated from the model's working memory.

Technical Problem

LLM context windows are finite (ranging from 8K to 1M tokens across current models). Long-running agentic workflows accumulate thought-action-observation records that grow without bound. Naive strategies (truncate oldest content, always truncate longest observations) destroy reasoning continuity and produce confusion in later reasoning steps that reference context that has been silently removed.

Symptoms of Absence

Agent loses track of earlier findings, re-asks questions already answered, or contradicts its own earlier conclusions
Hard context overflow errors terminate workflows mid-execution
Observations are silently truncated; agent reasons about incomplete tool results without knowing they are incomplete
No mechanism to retrieve earlier reasoning context on demand

Cost of Inaction

Reliability: Context overflow errors terminate tasks, wasting all prior computation
Quality: Silent truncation produces outputs that contradict or ignore earlier findings
Trust: Users discover that the agent has "forgotten" key earlier context, destroying confidence

3. Context

When to Apply

Tasks generate context that may approach or exceed the model's context window budget
Long-running agents with many tool calls (>15 iterations)
Tasks spanning multiple sessions (inter-session context preservation)
Tasks requiring recall of information from early stages when synthesising final outputs

When NOT to Apply

Short tasks with predictably small context footprints
Tasks where complete context history must be preserved verbatim (use an explicit log, not the context window)
Tasks where summarisation would destroy required precision (highly technical numerical content)

Prerequisites

EAAPL-AGT002 (Stateful Agent Memory) for persistent episodic storage
Token counting utility for real-time context budget monitoring
Summarisation prompt templates per content type (tool observations, reasoning steps, user turns)
Context budget thresholds for triggering compression

Industry Applicability

Industry	Long-Running Task	Context Challenge
Legal	Multi-week document review project	Thousands of document excerpts accumulate
Financial Services	Complex financial model analysis	Large data tables plus reasoning accumulate
Technology	Large codebase analysis and refactoring	File contents plus analysis notes accumulate
Healthcare	Extended clinical case analysis	Lab results, notes, guidelines accumulate
Government	Multi-agency policy development	Submissions, analysis, revision history accumulate

4. Architecture Overview

Context compression is implemented as a continuous monitoring and management layer that wraps the agent's context window, operating proactively before overflow occurs.

Token Budget Monitor The token budget monitor tracks the current context window usage in real time, counting tokens after every addition (thought, observation, user message). It emits threshold events at configurable percentages of the context window capacity: a warning threshold (e.g., 70%) triggers proactive compression, a critical threshold (e.g., 85%) triggers aggressive compression.

Rolling Summarisation When the warning threshold is crossed, the Rolling Summariser identifies the oldest N thought-action-observation blocks in the scratchpad and replaces them with a structured summary. The summary preserves: key findings (entities, decisions, conclusions), pending actions, and unresolved questions. The raw blocks are archived to episodic memory (EAAPL-AGT002) before being removed from the active context. The summary explicitly marks itself as a compression artefact so that the agent's reasoning knows it is working with summarised rather than raw context.

Selective Retention Not all context has equal relevance. The Relevance Scorer evaluates each context block against the current task objective and assigns a relevance score. High-relevance blocks (directly referenced by the current reasoning step) are retained in full. Medium-relevance blocks are summarised. Low-relevance blocks are archived to episodic memory. Relevance scoring is performed by a lightweight LLM call or by embedding similarity to the current task objective.

Episodic Retrieval When the agent's reasoning references content that has been compressed out of the active context ("Earlier I noted that clause 12.3 had a specific obligation — what was it exactly?"), the Episodic Retriever performs a semantic search against the episodic memory store (EAAPL-AGT002) and injects the retrieved content back into context as a bounded retrieval observation. This on-demand retrieval prevents the need to retain all historical context proactively.

Hierarchical Memory Tiers Context is managed across three tiers: (1) Active Context — full fidelity, in-window; (2) Summary Buffer — compressed representations of recent history, in-window; (3) Episodic Archive — full-fidelity earlier content, out-of-window but retrievable. The size allocated to each tier is configurable per task type.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph ActiveContext["Active Context Window"] A[Active Blocks] B[Summary Buffer] end subgraph Monitor["Context Budget Monitor"] C{Token Budget Threshold?} D[Warning 70%] E[Critical 85%] end subgraph Compression["Compression Engine"] F[Relevance Scorer] G[Rolling Summariser] H[Archive to Episodic] end subgraph Retrieval["Episodic Retrieval"] I[Retrieval Trigger] J[Semantic Search] K[Inject Retrieved Content] end subgraph Storage["Memory Tiers"] L[(Active Context)] M[(Episodic Archive)] end A --> C C -->|below threshold| A C -->|warning| D C -->|critical| E D --> F E --> F F --> G G --> H H --> M G --> B A --> I I --> J J --> M J --> K K --> A

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Token Budget Monitor	Logic Component	Real-time token counting; threshold event emission	tiktoken (OpenAI); Anthropic token counter; custom	Critical
Relevance Scorer	AI/Logic Component	Scores context blocks against current task objective	Embedding similarity (fast); LLM scoring (accurate)	High
Rolling Summariser	AI Component	Generates structured summaries of oldest context blocks	GPT-4o-mini, Claude 3 Haiku (cost-optimised for summarisation)	Critical
Episodic Archiver	Integration	Writes full-fidelity blocks to episodic memory before removal	EAAPL-AGT002 write API	Critical
Episodic Retriever	Integration	Semantic search against episodic archive; injects retrieved content	EAAPL-AGT002 retrieval API; pgvector; Pinecone	High
Summary Buffer Manager	State	Manages the summary buffer section of active context; limits buffer size	Custom context management layer	High
Compression Audit Logger	Governance	Records every compression event: what was summarised, what was archived	PostgreSQL; append-only log	Medium

7. Data Flow

Step	Actor	Action	Output
1	Agent	Adds Thought-Action-Observation block 47 to context	Context: 71,200 tokens (71.2% of 100K window)
2	Token Budget Monitor	71.2% > warning threshold (70%); trigger compression	Compression event fired
3	Relevance Scorer	Scores blocks 1–30 against current task objective	Blocks 1–10: low relevance; blocks 11–20: medium; blocks 21–30: high
4	Rolling Summariser	Summarises blocks 1–10 into structured summary (1,200 tokens)	Summary: `{key_findings: [...], decisions: [...], pending: [...]}`
5	Episodic Archiver	Archives raw blocks 1–10 to episodic store	10 blocks stored with embeddings
6	Context Manager	Replaces blocks 1–10 with summary in active context	Context reduced to 62,000 tokens (62%)
7	Agent (later)	Thought references: "In my earlier analysis of document section 4.2..."	Retrieval trigger detected
8	Episodic Retriever	Semantic search: "document section 4.2 analysis" → finds archived block 7	Retrieved block 7
9	Context Manager	Injects retrieved block as bounded observation: `[RETRIEVED CONTEXT: Block 7]`	Active context updated

Error Flow

Error	Detection	Recovery
Summarisation produces low-quality summary	Summary quality check (length, structure)	Retry summarisation with more explicit instructions; archive-only as fallback
Episodic archive write failure	Archive API error	Keep block in active context; alert; retry on next compression cycle
Retrieval returns irrelevant content	Relevance threshold on retrieval results	Return no-result observation; let agent reason without retrieved context
Context hits critical threshold before compression completes	Token count exceeds critical threshold	Emergency truncation of oldest low-relevance blocks; immediate archival

8. Security Considerations

Sensitive Content in Episodic Archive

The episodic archive persists sensitive intermediate content (PII, commercial-in-confidence data) that may have shorter retention requirements than the task output
Mitigation: Apply same retention and access controls to episodic archive as to task audit records; support time-limited archival with automatic deletion

OWASP LLM Top 10

OWASP LLM Risk	Context Compression Applicability	Mitigation
LLM01 Prompt Injection	Archived content retrieved and re-injected may contain injection attempts	Sanitise retrieved content before re-injection; apply injection detection on retrieval
LLM06 Sensitive Information	Episodic archive accumulates all task context including sensitive data	Encrypt episodic archive; apply per-block retention policies; PII detection before archival
LLM09 Overreliance	Agent may rely on compressed summaries that omit nuanced details	Summaries clearly marked as compressed; agent can trigger retrieval for full fidelity

9. Governance Considerations

Compression Audit Trail

Every compression event must be logged: which blocks were compressed, what summary was produced, and what was archived. This ensures that the full task context can be reconstructed for audit purposes even though it was not present in the active context at every point.

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Compression Event Log	AI Platform	Per event; retained with task audit	Documents every compression operation for context reconstruction
Episodic Archive	Compliance	Per task; retained per policy	Full-fidelity earlier context retrievable for audit
Summarisation Quality Benchmark	ML Engineering	Monthly	Validates that summaries preserve key findings; detects quality degradation
Compression Threshold Policy	AI Governance Board	Quarterly	Documents warning/critical thresholds per task type

10. Operational Considerations

SLOs

SLO	Target	Window	Alert
Context overflow rate (hard context window exceeded)	0%	24-hour rolling	Any overflow triggers P1; compression triggering too late
Compression-induced quality degradation rate	≤ 2% of tasks	Weekly eval	> 5% triggers P2; review summarisation quality
Episodic retrieval precision (relevant content retrieved)	≥ 85%	Weekly eval	< 75% triggers P3; review retrieval embedding quality
Compression overhead latency	≤ 3s per compression event	1-hour rolling	> 8s triggers P3

Monitoring

Context budget usage trending per task type: approaching ceiling early indicates need for earlier compression or higher budget threshold
Summary to raw token ratio: tracks compression efficiency
Retrieval usage rate: high retrieval rate indicates compression is too aggressive (too much important content being compressed)

11. Cost Considerations

Cost Factor	Impact	Mitigation
Summarisation LLM calls	Low–Medium (small model, short outputs)	Use GPT-4o-mini or Claude 3 Haiku for summarisation; never the primary reasoning model
Episodic retrieval embedding calls	Low	Batch embedding calls; cache embeddings for stable content
Extended context window (avoid compression)	Very High	Compression is far cheaper than doubling the context window tier
Compression preventing task failure	Very High positive	Prevents waste of all prior computation

Indicative Cost Comparison

Approach	Cost per 100-iteration task	Reliability
No compression (overflow at iteration 50)	$0.50 then failure	Poor
Large context window (1M tokens)	$5.00–20.00 (premium model pricing)	High
Context compression (proactive)	$0.80–1.20 (base + summarisation overhead)	High

12. Trade-Off Analysis

Option	Context Continuity	Cost	Complexity	Reliability	Best For
A: Rolling summarisation + episodic retrieval (Recommended)	High	Low–Medium	Medium	Very High	Long-running production agents
B: Fixed rolling window (keep last N blocks)	Low	Very Low	Very Low	Medium	Short tasks where earlier context is less important
C: Large context window (no compression)	Very High	Very High	Very Low	High	Tasks requiring full fidelity; budget available
D: Hierarchical summarisation (multi-level)	Very High	Medium	High	Very High	Very long tasks (hundreds of iterations)

Architectural Tensions

Tension	Left Pole	Right Pole	Balance
Compression aggressiveness	Aggressive (low context usage, low fidelity)	Conservative (high fidelity, context overflow risk)	Compress at 70% warning; aggressive at 85%
Summary fidelity	Short, lossy summary	Long, detailed summary (uses more context budget)	Summary ≤ 20% of source length; structured format
Retrieval frequency	Retrieve-on-demand (efficient)	Proactive re-injection of all archived content	On-demand only; triggered by agent reasoning references

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Summary drops critical finding	Medium	High — agent loses important earlier information	Retrieval usage spikes; agent re-asks answered questions	Structured summary format enforces key-finding extraction; retrieval provides fallback
Episodic archive grows unbounded	Low–Medium	Medium — storage and retrieval cost	Archive size monitoring	Retention policy; tiered archive (warm/cold)
Retrieval injects irrelevant content	Medium	Medium — context pollution	Retrieval precision monitoring	Relevance threshold on retrieval; low-confidence retrievals discarded
Compression loop (compression triggers faster than it reduces context)	Low	High — agent stuck compressing	Compression event frequency alert	Emergency mode: archive all low-relevance blocks; force 50% context reduction
PII retained in episodic archive beyond retention period	Low	High — compliance violation	Automated PII scan on archive	Time-limited archival with automatic deletion; PII masking before archival

14. Regulatory Considerations

ISO 42001

§8.4: Compression decisions affect the AI system's operational quality; compression algorithms must be documented and tested against quality benchmarks.

APRA CPS 230

For agents operating in material business processes, the full reasoning context (including archived episodic records) must be retainable for operational resilience investigations.

Australian Context

Privacy Act 1988: Content archived to episodic storage inherits the privacy sensitivity of the source task; retention and deletion policies must match the organisation's records management framework.

15. Reference Implementations

AWS

Component	Service
Token Counter	Lambda function with tiktoken / boto3 tokenizer
Summarisation	Amazon Bedrock (Claude 3 Haiku — cost-optimised)
Episodic Archive	Amazon OpenSearch Serverless (vector search for retrieval)
Summary Buffer	DynamoDB per-task context state
Monitoring	CloudWatch custom metrics for context budget

Azure

Component	Service
Summarisation	Azure OpenAI (GPT-4o-mini)
Episodic Archive	Azure AI Search (vector index)
Context State	Azure Cosmos DB
Monitoring	Azure Monitor with custom context budget metrics

On-Premises

Component	Technology
Summarisation	vLLM (Llama 3.1 8B for cost-efficient summarisation)
Episodic Archive	pgvector (PostgreSQL vector extension)
Context Management	LangChain ConversationSummaryBufferMemory; custom

Pattern	ID	Relationship Type	Notes
Stateful Agent Memory	EAAPL-AGT002	Depends On	Episodic archive is implemented on top of the stateful agent memory store
ReAct Agent Loop	EAAPL-WRK001	Integrates With	Context compression wraps the ReAct scratchpad; manages its context budget
Long-Running Agent	EAAPL-AGT007	Integrates With	Long-running agents are the primary consumers of context compression
Workflow Tracing and Replay	EAAPL-WRK013	Integrates With	Compression event log is part of the workflow trace

17. Maturity Assessment

Overall Maturity: Emerging

Dimension	Score (1–5)	Evidence
Research Foundation	4	MemGPT, Compressor-Retriever, LongMem papers provide foundation
Production Deployment	3	Deployed in long-running research and code-generation agents; general enterprise tooling maturing
Framework Support	3	LangChain SummaryBufferMemory; MemGPT; custom implementations common
Summarisation Quality	3	Summarisation quality for technical content improving; not yet standardised
Retrieval Integration	3	Retrieval-augmented context injection maturing; precision benchmarks lacking

18. Revision History

Version	Date	Author	Changes
1.0	2025-06-13	Architecture Board	Initial publication in Agentic Workflows category

Track this pattern for APRA/ASIC review

← Back to Library More Agentic Workflows →