EAAPL-AGT007Proven

Long-Running Agent

Agentic AIEU AI ActNIST AI RMF

[EAAPL-AGT007] Long-Running Agent

Category: Agentic AI Sub-category: Async Execution Architecture Version: 1.2 Maturity: Proven Tags: long-running, async, task-queue, heartbeat, cost-budget, partial-results, deadline-management, human-checkin Regulatory Relevance: APRA CPS 230 (Operational Resilience), ISO 22301, NIST AI RMF (MANAGE 4.1), EU AI Act (Art. 9, 14)

1. Executive Summary

The Long-Running Agent Pattern defines the architecture for AI agents that execute tasks over hours or days — due diligence analysis, large codebase refactoring, enterprise-wide data reconciliation, or extended research synthesis. These tasks cannot fit within the synchronous request-response paradigm: calling systems cannot hold a connection open for hours, LLM context windows cannot hold 48 hours of tool results, and cost controls require active monitoring rather than post-hoc billing surprises.

For CIO/CTO audiences: this pattern transforms AI agents from interactive request-responders into asynchronous workforce members — entities you assign a task to on Monday morning and receive a deliverable from by Friday, with status updates throughout and the ability to pause or redirect them at any point. It defines how to decompose multi-day tasks into manageable segments, how to monitor and control running costs, how to ensure partial results are safely preserved if the task is interrupted, and how to maintain human oversight over extended autonomous operation. The resulting architecture is what separates a toy AI demo from a production AI workforce capability.

2. Problem Statement

Business Problem

High-value knowledge work tasks take hours or days. A due diligence review of 500 contracts, a codebase-wide security audit, or a multi-source research synthesis cannot complete in seconds. If AI agents are restricted to short tasks, the most valuable automation opportunities remain out of reach.

Technical Problem

Synchronous agent execution (HTTP request/response model) is unsuitable for long tasks: connection timeouts, LLM context window limits, token cost unpredictability, and inability to inject human checkpoints all fail at scale. Context window exhaustion on multi-hour tasks is a particularly severe problem: a 100K token context window fills after 60–100 tool calls with moderate result sizes.

Symptoms of Absence

Tasks taking longer than 30 minutes are decomposed manually by humans into shorter subtasks, negating automation benefits
Cost surprises: a long agent task consumes 10–50× the anticipated token budget with no warning
Partial work is lost when infrastructure restarts or LLM provider timeouts occur at hour 3 of a 5-hour task
No mechanism for human course-correction once a long task is launched

Cost of Inaction

High-value automation opportunities (due diligence, audit, research) remain manual
Ad hoc workarounds (manually splitting tasks) create brittle processes that fail when task sizes vary
Infrastructure teams field escalations about unexplained high AI inference costs from long tasks without budget controls

3. Context

When to Apply

Expected task duration is > 30 minutes
Task involves processing a large corpus (hundreds of documents, thousands of records)
Human review or approval at intermediate milestones is required
Cost predictability and budget control are required
Partial results have value (delivering results incrementally is better than delivering nothing if the task is interrupted)

When NOT to Apply

Tasks that complete in < 5 minutes (async overhead not justified)
Tasks that require a synchronous response in the same user session
Tasks with no natural decomposition into independently useful segments

Prerequisites

EAAPL-AGT005 (Checkpoint and Recovery) — mandatory for multi-hour tasks
Durable task queue with dead-letter handling
Async notification infrastructure (webhooks, event bus, push notifications)
Cost monitoring and kill switch capability
Human management API (pause, redirect, cancel)

Industry Applicability

Industry	Long-Running Task	Duration	Human Check-in Frequency
Legal / M&A	Due diligence (500+ documents)	4–24 hours	At task creation, 50% progress, completion
Financial Services	Regulatory report generation, reconciliation	2–12 hours	At key milestones; anomaly-triggered
Technology	Large codebase security audit, refactoring	4–48 hours	At phase boundaries
Healthcare	Multi-source patient cohort analysis	2–8 hours	At each data source completion
Research	Literature synthesis, competitive analysis	8–72 hours	Daily check-in

4. Architecture Overview

The Long-Running Agent Pattern addresses four fundamental challenges of extended autonomous execution: context window management, task decomposition and progress tracking, cost budget enforcement, and human oversight at meaningful checkpoints.

Task Decomposition and Segment Orchestration A long task is decomposed by the Task Planner into an ordered sequence of segments — bounded sub-tasks each of which can complete within the single-agent pattern's standard execution model (typically < 30 minutes, < 50K tokens). The segment plan is stored durably at task creation and is the master execution schedule. Each segment produces a partial result that is stored in the Partial Result Store. If the task is interrupted, the segment plan acts as the recovery map: completed segments are skipped; the next incomplete segment is resumed.

The segment plan is not a rigid pre-specified plan. The Task Planner can be queried to revise the remaining segment plan based on discoveries made in early segments (adaptive planning). For example, if segment 3 discovers that 200 additional documents need to be reviewed, the plan is revised to add segments 3a–3n before segment 4.

Context Window Management Across Segments Each segment executes in a fresh context window. The context for segment N includes: the original task objective, a summary of results from segments 1 through N-1 (produced by the Context Summariser component), the current segment's specific sub-objective, and the relevant tools. The summary is a lossy compression of prior results — the Task Planner specifies what information must be preserved across segment boundaries in the task plan.

This approach solves context exhaustion by design: no single segment accumulates more context than the window can hold. The cost is that inter-segment reasoning is mediated through the summary, which may lose nuance. For tasks that require tight consistency across many segments (e.g., a legal review where clause 400 must reference clause 12), the Task Planner must preserve the critical cross-references in the carry-forward summary.

Heartbeat and Progress Monitoring The long-running agent emits a heartbeat event to the monitoring system at the completion of each segment and at configurable intervals within a segment. The heartbeat includes: current segment number, total segments estimated, cost consumed so far, cost projected to completion (based on average cost per segment × remaining segments), elapsed time, and an ETA for completion. The Heartbeat Monitor triggers alerts if heartbeat events are not received within the expected interval — indicating a stuck or crashed agent.

Human Check-in Points The task plan defines human check-in points — typically at task creation (human reviews and approves the decomposition plan), at significant milestones (e.g., 50% completion), and at completion. At check-in points, the long-running agent pauses execution (using the checkpoint mechanism from EAAPL-AGT005), delivers the partial results and a progress summary to the human via a notification, and waits for human acknowledgment or instruction. The human can: approve and resume, redirect (modify the remaining segment plan), or cancel. This implements EU AI Act Art. 14 human oversight for high-risk long-running tasks.

Cost Budget and Kill Switch Before execution begins, the calling system specifies a cost budget (maximum token spend for the task). The Cost Controller monitors cumulative spend at each segment boundary. If projected cost-to-completion exceeds the budget, the Cost Controller pauses the task and notifies the human with the current partial results and a cost projection. The human can approve budget extension or accept the partial results. A hard kill switch (emergency stop) is available to humans at any time, delivering immediately available partial results and a clean task termination.

Partial Result Delivery Each completed segment's output is written to the Partial Result Store immediately upon completion. A Partial Result Aggregator compiles the running partial results into a human-consumable intermediate deliverable. The calling system can request partial results at any time via the management API, regardless of whether the task is still running. This enables progressive value delivery — a due diligence review that identifies 20 critical issues in the first 30% of documents is actionable immediately, before the full 500-document review completes.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Task Initiation"] A[Long Task Request] B[Task Planner] end subgraph Execution["Async Execution Engine"] C[Task Queue] D[Segment Worker] E[Cost Controller] end subgraph Storage["State and Results"] F[(Checkpoint Store)] G[(Partial Result Store)] end A --> B B -->|segment plan + human approval| C C --> D D -->|checkpoint each segment| F D -->|segment output| G D --> E E -->|over budget| B F -->|recover on failure| D G -->|final aggregation| A style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Task Planner	Orchestration / AI	Decomposes long task into segment plan; revises plan adaptively	LLM-based planner; rule-based decomposition for structured tasks	Critical
Task Plan Store	Persistence	Stores segment plan; tracks segment completion status	DynamoDB, PostgreSQL, Azure Cosmos DB	Critical
Task Queue	Message Queue	Durable queue for segment execution; dead-letter for failed segments	SQS, Azure Service Bus, Google Pub/Sub, Kafka	Critical
Segment Worker	Compute	Executes each segment as a standard agent loop (EAAPL-AGT001)	Containerised agent runtime; ECS, AKS, Cloud Run	Critical
Context Summariser	AI Component	Compresses prior segment results into carry-forward context	LLM summarisation; structured summary template per task type	High
Partial Result Store	Persistence	Stores completed segment outputs; supports partial result queries	PostgreSQL, S3 + DynamoDB index, Cosmos DB	High
Partial Result Aggregator	Orchestration	Compiles segment outputs into intermediate deliverable	Custom; LLM-assisted for natural language outputs	Medium
Heartbeat Emitter	Monitoring	Emits heartbeat events at configurable intervals	Custom; part of segment worker	High
Heartbeat Monitor	Monitoring	Detects missed heartbeats; triggers recovery	CloudWatch Alarms, Azure Monitor, custom	High
Cost Controller	Governance	Tracks cumulative cost; projects to completion; enforces budget ceiling	Custom + LLM provider usage APIs	Critical
Management API	Operations	Exposes pause, redirect, cancel, status, partial-result endpoints	REST API; API Gateway + Lambda/Functions	High
Human Check-in Queue	Human Oversight	Delivers milestone notifications to human approvers; collects decisions	Email, Slack, Teams, custom approval portal	High
Checkpoint Store	Recovery	Stores segment-level checkpoints (EAAPL-AGT005)	Redis, DynamoDB, Cosmos DB	Critical
Deadline Manager	SLA	Monitors task ETA vs. deadline; alerts if deadline at risk	Custom scheduler + ETA calculation	Medium

7. Data Flow

Task Initiation

Step	Actor	Action	Output
1	Calling System	Submits long task: instruction, corpus reference, cost_budget, deadline, checkin_points	Task request
2	Task Planner	Analyses task; decomposes into N segments; assigns cost estimate per segment; identifies checkin milestones	Segment plan: [{segment_id, sub_objective, input_scope, estimated_cost, checkin: bool}]
3	Human Check-in	Delivers plan to human for review; awaits approval	Approved / Modified plan
4	Task Queue	Enqueues segment 1 for execution	Segment 1 in queue

Segment Execution

Step	Actor	Action	Output
1	Segment Worker	Dequeues segment N; loads carry-forward context from Context Summariser	Assembled context
2	Agent Loop	Executes standard agent loop for segment N scope	Segment N result
3	Partial Result Store	Writes segment N result	Partial result record
4	Cost Controller	Updates cumulative cost; projects remaining cost	Cost status
5	Heartbeat Emitter	Emits segment completion heartbeat	Heartbeat event
6	Checkpoint	Writes segment N checkpoint	Recovery state
7	Context Summariser	Produces carry-forward summary including segment N findings	Updated cross-segment summary
8	Checkin Gate	If checkin milestone: pause; notify human; await instruction	Human instruction
9	Task Queue	Enqueues segment N+1 (or revised plan if redirected)	Next segment queued

Error Flow

Error	Detection	Recovery
Segment worker crashes mid-execution	Missed heartbeat	Heartbeat monitor triggers recovery; resume from last checkpoint within segment
Task queue message lost	Dead-letter queue	DLQ alarm; reprocess segment from last checkpoint
LLM provider outage	Segment worker invocation failure	Exponential backoff retry; failover to secondary LLM provider if configured; alert
Cost overrun projection	Cost Controller	Pause task; notify human; await budget decision
Deadline at risk	Deadline Manager	Alert human; option to increase parallelism or reduce scope

8. Security Considerations

Long-Running Identity Tokens

Agent authentication tokens for accessing external tools must not expire during a multi-hour task
Implement token refresh within the segment worker; use long-lived service account credentials, not short-lived user tokens
Dynamic secrets (auto-rotating) must have rotation intervals longer than the maximum task duration

Data Retention of In-Progress Tasks

Partial results contain sensitive intermediate data; they must be encrypted and access-controlled
Partial results for cancelled tasks must be cleaned up according to the data retention policy
Cross-segment context summaries may contain PII extracted from processed documents; apply the same classification and retention rules as the source data

OWASP LLM Top 10

OWASP LLM Risk	Long-Running Applicability	Mitigation
LLM08 Excessive Agency	A long-running agent operating autonomously for hours may drift from its initial scope without human awareness	Mandatory human check-in at milestones; segment plan visible to humans from task creation; Management API enables real-time course correction at any point
LLM04 DoS	Runaway long tasks consume excessive compute and API quotas	Hard cost ceiling; segment count limit; deadline enforcement
LLM01 Prompt Injection	Documents processed by the agent may contain injected instructions	Content sanitisation on all ingested documents before task planning and segment execution
LLM09 Overreliance	Business stakeholders may trust long-running agent outputs without appropriate scrutiny	Output metadata includes confidence and completeness indicators; human check-in at completion is mandatory for high-stakes tasks

9. Governance Considerations

Human Oversight for Long-Running Tasks

All long-running tasks must have a named human owner who is notified of check-in points and receives partial results
Tasks exceeding a configured duration (default: 4 hours) automatically escalate to the human owner's manager
No task may run longer than 72 hours without a human re-approval of the segment plan

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Task Execution Log	Platform Engineering	Per task	Complete segment-by-segment execution record with costs, durations, and human decisions
Cost Budget Report	FinOps	Monthly	Aggregate long-task spend vs. budget; overrun analysis
Missed Deadline Report	Operations	Monthly	Tasks that exceeded deadline; root cause analysis
Human Check-in Audit	AI Governance	Quarterly	Review of human check-in compliance; decision quality audit

10. Operational Considerations

SLOs

SLO	Target	Window	Alert
Heartbeat interval compliance	100% heartbeats within 2× expected interval	Per task	Any missed heartbeat triggers P2
Task completion rate	≥ 95% of started tasks complete	Monthly	< 90% triggers investigation
Segment retry rate	≤ 5% of segments require retry	24-hour rolling	> 10% indicates infrastructure instability
Human check-in response time	≤ 4 hours for milestone approvals	Per check-in	> 8 hours triggers escalation to task owner's manager

Capacity

Segment workers are stateless containers; horizontal scaling is bounded by LLM provider quota and tool API rate limits
Estimate: 1 worker per 5 concurrent segments for 30-minute segments; scale up to 1 worker per concurrent segment for 5-minute segments
Partial result storage grows with task count × average output size; provision for 30-day retention of all partial results

11. Cost Considerations

Cost Drivers

Cost Driver	Example	Control
Total token consumption	500-doc due diligence: ~5M tokens	Budget ceiling; scope reduction option
Context summarisation overhead	5–10% of total tokens for summaries	Efficient summarisation prompt; smaller model for summaries
Segment retry cost	Redundant work on retry	Checkpoint granularity; reliable infrastructure
Long-running compute	Worker idle time between segments	Event-driven scaling; scale-to-zero between segments

Indicative Cost Range (USD)

Task Type	Scale	Estimated Token Count	Estimated LLM Cost
Contract review (50 documents)	Medium	~1.5M tokens	$15–60
Contract review (500 documents)	Large	~12M tokens	$120–480
Codebase security audit (100K LOC)	Large	~8M tokens	$80–320
Research synthesis (200 papers)	Large	~6M tokens	$60–240

12. Trade-Off Analysis

Task Decomposition Options

Option	Description	Pros	Cons	Best For
A: LLM-Planned Segmentation (Recommended)	Task Planner uses LLM to decompose task into segments	Adaptive; handles irregular corpora	Planner itself consumes tokens; plan quality depends on model	Complex, variable tasks
B: Rule-Based Segmentation	Fixed rules decompose by document count, page count, or time estimate	Predictable; no LLM planning overhead	Inflexible; poor fit for varied task types	Well-structured, homogeneous tasks
C: User-Defined Milestones	Human specifies segment boundaries upfront	Maximum human control	Requires human upfront effort; may mis-estimate	Regulated tasks where human defines scope
D: Workflow Engine Native	Temporal or Durable Functions handle segmentation	Built-in persistence and retry; mature tooling	Less LLM-native; segment boundaries are code-defined	Engineering-intensive regulated workloads

Architectural Tensions

Tension	Left Pole	Right Pole	Balance
Segment granularity vs. Context continuity	Many small segments — low risk per segment	Few large segments — better cross-segment reasoning	20–30 minute segments balancing context continuity and recovery granularity
Cost certainty vs. Completeness	Hard budget ceiling — task may not complete	Best-effort — may overrun budget	Budget ceiling with human escalation at 80% spend; partial results delivered at ceiling
Human oversight frequency vs. Task latency	Check-in after every segment	Single check-in at completion	Risk-tiered: check-in at task creation, major milestones, and completion

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Agent drifts from task scope over many segments	Medium	High — wasted work; wrong outputs	Human check-in reveals drift; output quality monitoring	Re-anchor with task objective in carry-forward context; human redirect
Cross-segment context summary loses critical information	Medium	High — logical inconsistencies in final output	Human review of final output; quality scoring	Preserve critical references explicitly in summary template; test on sample tasks
Task never terminates (segment count grows adaptively)	Low	High — cost overrun	Segment count limit alert; cost ceiling	Hard limit on total segment count; cost ceiling enforcement
Partial results delivered to wrong principal	Very Low	Critical — data breach	Access control on partial result store	IAM on partial result endpoints; audit of all retrievals
Infrastructure change invalidates checkpoint schema	Low	Medium — recovery fails	Checkpoint deserialisation failure	Schema versioning; migration function

14. Regulatory Considerations

APRA CPS 230

Long-running agents supporting material business services require RTO/RPO; the checkpoint + segmentation architecture enables sub-segment RTO
Multi-hour tasks interacting with critical systems require operational risk assessment and business impact analysis

EU AI Act

Art. 14 (Human Oversight): mandatory human check-ins at task creation and significant milestones implement the "meaningful human oversight" requirement for high-risk long-running agents
For high-risk AI systems: the complete task execution log (all segments, costs, human decisions, partial results) is a required audit artefact

15. Reference Implementations

AWS

Component	Service
Task Queue	Amazon SQS (FIFO with DLQ)
Segment Worker	AWS ECS Fargate (event-triggered)
Task Plan + Partial Results	Amazon DynamoDB
Workflow	AWS Step Functions (for structured decomposition)
Heartbeat Monitor	CloudWatch Alarms
Human Check-in	Amazon SNS + custom approval portal or AWS Step Functions human task

Azure

Component	Service
Task Queue	Azure Service Bus
Segment Worker	Azure Container Apps
Task Plan + Partial Results	Azure Cosmos DB
Workflow	Azure Durable Functions
Human Check-in	Azure Logic Apps + Adaptive Cards (Teams)

On-Premises

Component	Technology
Task Queue	Apache Kafka or RabbitMQ
Segment Worker	Kubernetes Jobs
Task Plan + Partial Results	PostgreSQL
Workflow	Temporal OSS

Pattern	ID	Relationship Type	Notes
Single Agent Pattern	EAAPL-AGT001	Extended By	Each segment is a single agent loop execution
Agent Checkpoint and Recovery	EAAPL-AGT005	Depends On	Checkpointing is mandatory for multi-hour tasks
Agent Cost Governance	EAAPL-AGT010	Integrates With	Budget ceiling and kill switch are cost governance capabilities
Human-in-the-Loop Agent	EAAPL-MAG003	Extends	Human check-in at milestones is a specialised application of HITL
Supervisor Agent	EAAPL-MAG002	Related	Supervisor can orchestrate long-running segments; alternative decomposition model

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Evidence
Core Technology (Queuing + Checkpointing)	5	Durable queues and checkpointing are mature distributed systems patterns
Context Summarisation Quality	3	Cross-segment context compression is a known challenge; LLM summarisation quality varies
Human Check-in UX	3	Tooling for human review of multi-hour tasks improving; no standard UX pattern yet
Cost Estimation Accuracy	3	Per-segment cost estimates improve with task history; initial estimates are rough
Adaptive Re-planning	2	Adaptive segment plan revision is emerging; limited production evidence

18. Revision History

Version	Date	Author	Changes
1.0	2024-06-01	Architecture Board	Initial publication
1.1	2024-10-15	Platform Engineering	Added adaptive re-planning; deadline manager; partial result aggregator
1.2	2025-03-01	Architecture Board	Added EU AI Act Art. 14 mapping; human check-in escalation policy; cost estimation table

Track this pattern for APRA/ASIC review

← Back to Library More Agentic AI →