EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAgentic AIEAAPL-AGT005
EAAPL-AGT005Proven
⇄ Compare

Agent Checkpoint and Recovery

🤖 Agentic AIEU AI ActNIST AI RMF

[EAAPL-AGT005] Agent Checkpoint and Recovery

Category: Agentic AI Sub-category: Reliability Architecture Version: 1.4 Maturity: Proven Tags: checkpoint, durable-execution, idempotency, recovery, workflow-orchestration, state-serialisation, human-pause-resume Regulatory Relevance: APRA CPS 230 (Operational Resilience), ISO 22301 (BCM), NIST AI RMF (MANAGE 4.1), EU AI Act (Art. 9)


1. Executive Summary

The Agent Checkpoint and Recovery Pattern defines the durable execution architecture that enables AI agents running long-horizon tasks to survive infrastructure failures, model provider outages, and intentional human pauses without losing progress or re-executing actions that have already completed. Without checkpointing, a 45-minute agent execution that fails at iteration 38 must restart from the beginning — at full cost, with risk of re-triggering side effects (duplicate emails, duplicate payments, duplicate database records).

For CIO/CTO audiences: this pattern is the difference between an AI agent that is production-grade and one that is a fragile experiment. In financial services, a reconciliation agent that re-executes after failure without idempotency protection could create duplicate transactions. In healthcare, a treatment plan agent that replays all its tool calls after a crash could submit duplicate orders. This pattern eliminates both failure modes by guaranteeing that each action executes exactly once, that failures are recoverable without data loss, and that humans can pause and resume agent tasks at defined checkpoints. It is a prerequisite for deploying agents on tasks where the consequence of re-execution is unacceptable.


2. Problem Statement

Business Problem

Long-horizon agent tasks — multi-document review, multi-step research, complex data processing — take minutes to hours. In any distributed system, infrastructure failures during this window are not exceptional events; they are expected occurrences. An agent with no recovery mechanism treats these failures as task failures, wasting all prior work and risking incorrect outcomes from partial re-execution.

Technical Problem

LLM agent loops are inherently stateful but execute on stateless infrastructure. The state — conversation history, tool call results, partial outputs, memory references — lives in process memory. A process crash loses all of it. Restarting the task from scratch without idempotency guarantees on tool calls causes duplicate side effects on non-idempotent external systems (APIs, databases, message queues).

Symptoms of Absence

  • Failed agent tasks require complete restart from scratch, consuming full token and compute budget again
  • Duplicate records appear in downstream systems after agent failures (duplicate emails, duplicate API calls)
  • Human-pause functionality does not exist; humans cannot safely interrupt a running agent without losing all progress
  • A single LLM provider timeout causes a cascading task failure with no recovery
  • Operations team has no visibility into how far through a long task a failed agent had progressed

Cost of Inaction

  • Financial: Re-execution from scratch on long tasks doubles or triples LLM token costs per failure event
  • Risk: Duplicate side effects from non-idempotent re-execution create data integrity issues and potential regulatory events
  • Operational: APRA CPS 230 requires RTO/RPO for material business services; an agent with no recovery cannot meet any realistic RTO
  • Human Oversight: Inability to pause and resume means human-in-the-loop controls cannot be applied mid-task

3. Context

When to Apply

  • Agent tasks that routinely exceed 5 minutes of wall-clock time
  • Tasks that invoke non-idempotent external systems (payment APIs, email APIs, database mutations)
  • Tasks that require human approval at intermediate steps (see EAAPL-MAG003)
  • Environments with regulated RTO requirements (APRA CPS 230, ISO 22301)
  • Tasks where partial results have value (a 90%-complete document review is useful even if the last 10% failed)

When NOT to Apply

  • Tasks that complete in under 60 seconds with a single LLM call and ≤3 tool calls (overhead not justified)
  • Fully idempotent tasks where re-execution from scratch is safe and cost-acceptable
  • Tasks where the checkpoint store introduces unacceptable latency on each iteration

Prerequisites

  • Durable, low-latency state store (Redis, DynamoDB, Cosmos DB, or equivalent) accessible from agent runtime
  • Idempotency key generation per tool call
  • Workflow orchestration integration (optional but recommended for complex flows)
  • Human approval queue infrastructure (if pause/resume is needed for HITL gates)

Industry Applicability

Industry Use Case Recovery RTO Requirement Checkpointing Priority
Financial Services Multi-document reconciliation, regulatory report generation Minutes Critical
Healthcare Clinical summary generation, multi-system data aggregation Minutes Critical
Legal / Professional Services Multi-contract review, due diligence Hours High
Technology / SaaS Large codebase refactoring agent, multi-repo analysis Hours High
Government Complex case assessment, multi-agency data aggregation Hours High

4. Architecture Overview

The Agent Checkpoint and Recovery Pattern introduces three core mechanisms to the baseline agent loop: state serialisation at each checkpoint, idempotency keys on all tool calls, and a recovery protocol that replays the logical execution plan while skipping already-completed actions.

Why checkpoint at every iteration rather than every N iterations? The answer is the cost of re-execution. Each LLM iteration consumes tokens; each tool call may have side effects. Re-executing N iterations to restore state costs N × (token cost + potential side effects). Checkpointing at every iteration ensures maximum-one iteration is ever lost to recovery, regardless of when the failure occurs. For short iterations (sub-second), the checkpoint write overhead is minimal relative to the LLM inference latency. For expensive iterations (multi-second tool calls), each checkpoint write is even more justified.

State Serialisation At the end of each loop iteration, the agent's execution state is serialised to a durable checkpoint store. The state object is a versioned JSON document containing: task_id, current iteration number, tool call history (IDs, arguments, results), memory references (episodic/semantic record IDs), current context window snapshot (or a reference to it), partial results, and metadata (timestamps, token consumption, cost so far). The serialisation is an atomic write — either the full state is written or nothing is written; partial checkpoint states are detected and rejected during recovery.

Checkpoint Store Design The checkpoint store must provide: durability (survives process and node failures), low write latency (ideally ≤10ms per checkpoint write to not dominate iteration latency), and support for conditional writes (CAS — compare-and-swap — to prevent concurrent checkpoint writes from two instances of the same task). Redis with AOF persistence, DynamoDB with conditional writes, or Azure Cosmos DB with optimistic concurrency are appropriate. For highest durability requirements, a write-ahead log pattern (checkpoint written to durable log first, then to fast store) provides strong guarantees.

Idempotency Keys Every tool call is issued with a unique idempotency key: a UUID generated at the time the call is first planned, stored in the checkpoint state, and reused on replay. External APIs that support idempotency keys (Stripe, most REST APIs via custom headers) will deduplicate re-submissions with the same key, returning the original response rather than executing the operation again. For APIs that do not natively support idempotency keys, the checkpoint includes the tool result, and the recovery protocol skips re-calling the tool entirely, returning the stored result. This is the "result cache" pattern for recovery.

Recovery Protocol On task startup, the agent checks the checkpoint store for an existing checkpoint for the task_id. If found, it loads the checkpoint state, reconstructs the context (injecting the stored tool call history and partial results), and resumes from the iteration after the last checkpointed iteration. Tool calls in the history are marked as complete and their results are returned from the checkpoint rather than re-executed. This ensures that external systems see at-most-once execution semantics for non-idempotent calls.

Human Pause and Resume The checkpoint mechanism naturally supports human-controlled pause and resume. When a human sends a pause signal (via the management API), the Pause Controller sets a pause flag in the task state. The Termination Controller checks this flag at each iteration boundary and, when set, writes a checkpoint with status: paused and stops execution. The task remains in the checkpoint store, frozen in time. When the human sends a resume signal, the agent restarts from the paused checkpoint exactly as if recovering from a failure — with all prior context intact. This enables safe human review of partial results and context injection before resumption.

Workflow Orchestration Integration For complex multi-stage agent tasks (where the agent itself is a step in a larger workflow), this pattern integrates with workflow orchestration engines. Temporal and Azure Durable Functions provide built-in state persistence and replay-safe execution semantics at the workflow level. In this mode, the agent loop is implemented as a Temporal Workflow or Durable Function, and the platform handles checkpointing automatically via its event-sourced execution model. This is the preferred implementation for tasks that compose multiple agent instances or require saga-style compensation logic.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Input Layer"] A[Task Request] B[Human Control API] end subgraph Core["Agent Execution Core"] C[Task Initialiser] D{Checkpoint Exists?} E[Agent Loop] F[Idempotency Manager] end subgraph Storage["State Storage"] G[(Checkpoint Store)] H[(Audit Log)] end subgraph Output["Output Layer"] I[Final Output] J[Paused State] end A --> C B -->|pause/resume| G C --> D D -->|found| E D -->|not found| E E --> F F -->|cached result| E F -->|new call + save| G G -->|restore state| E E -->|complete| I E -->|paused| J J -->|resume| E E --> H style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#d1fae5,stroke:#10b981 style J fill:#fee2e2,stroke:#ef4444

6. Components

Component Type Responsibility Technology Options Criticality
Checkpoint Store Durable State Store Stores serialised task state per iteration; supports CAS writes; low latency Redis (AOF) + conditional SET, DynamoDB (condition expressions), Azure Cosmos DB (ETag CAS) Critical
Task Initialiser Orchestration Checks for existing checkpoint; routes to restore or fresh execution Custom; part of agent framework Critical
State Restorer Orchestration Loads checkpoint; reconstructs context; marks completed tool calls Custom; integrated into agent loop Critical
Idempotency Key Manager Reliability Generates and stores UUID idempotency keys per tool call; retrieves keys on replay Custom; UUID v4 generation; stored in checkpoint state Critical
Checkpoint Writer Persistence Atomically serialises and writes state after each iteration Custom; Redis SETNX + pipeline; DynamoDB PutItem with condition Critical
Pause Controller Human Control Sets pause flag in checkpoint state on human signal; ensures clean checkpoint before stop Custom management API High
Retry Controller Reliability Implements exponential backoff retry policy on transient failures; respects max retry limit Custom; Temporal retry policy; AWS Step Functions High
Management API Operations Exposes pause, resume, cancel, and status endpoints for human operators REST API; FastAPI, Express, Azure Functions High
Task Status Viewer Operations Reads checkpoint store to display current task state, progress, and cost Dashboard UI; custom + Grafana Medium
Temporal / Durable Functions Engine Workflow Orchestration Provides event-sourced durable execution natively (optional but recommended) Temporal OSS, Temporal Cloud, Azure Durable Functions, AWS Step Functions High (if used)
Audit Log Compliance Records checkpoint writes, restores, pauses, and resumes with timestamps WORM store: S3 Object Lock, Azure Immutable Blob Critical

7. Data Flow

Fresh Execution with Checkpointing

Step Actor Action Output
1 Calling System Submits task with unique task_id Task queued
2 Task Initialiser Queries checkpoint store for task_id No checkpoint found
3 Agent Loop Executes iteration 1: context assembly → plan → tool call → result Iteration 1 result
4 Idempotency Manager Generates UUID idempotency key for tool call; stores in pending state Keyed tool call record
5 Checkpoint Writer Atomically writes state: {task_id, iteration: 1, tool_history: [{tool_id, idempotency_key, result}], partial_output, token_count} Checkpoint record v1
6 Agent Loop Continues for iterations 2..N Checkpoint written after each iteration
7 Termination Task completes; final output returned; checkpoint marked status: complete Final output

Recovery from Mid-Task Failure

Step Actor Action Output
1 Retry Controller Detects failure; initiates recovery after backoff Recovery signal
2 Task Initialiser Queries checkpoint store for task_id Checkpoint found at iteration K
3 State Restorer Loads checkpoint; reconstructs context with full tool call history up to iteration K Restored context
4 Idempotency Manager For each tool call in history with a stored result: marks as complete, returns cached result Cached results injected
5 Agent Loop Resumes from iteration K+1; LLM receives full context including all prior tool results Execution continues
6 External API If iteration K+1 tool call has idempotency key already stored (sent before failure): API returns original response No duplicate side effect

Error Flow

Error Detection Recovery
Checkpoint write failure (store unavailable) Write exception; circuit breaker Retry write with backoff; if checkpoint store unavailable for > threshold, abort task cleanly; alert
Checkpoint CAS failure (concurrent write) Conditional write rejection Indicates duplicate execution; one instance wins; other aborts; coordination via distributed lock
Idempotency key not accepted by external API HTTP 422 / API-specific error Log; attempt with new key if API behaviour permits; escalate if duplicate detected
State deserialisation failure on restore Schema version mismatch Versioned state schema; migration function for minor versions; fresh execution if major version mismatch

8. Security Considerations

Checkpoint State Protection

  • Checkpoint state may contain sensitive intermediate tool results (customer data, partial financial records); it must be encrypted at rest with CMK
  • Checkpoint store access is restricted to the agent service identity; human operators can view task status via the management API but cannot read raw checkpoint state without elevated access
  • Checkpoint states are automatically expired after the task retention period; no indefinite accumulation of sensitive data

Idempotency Key Exposure

  • Idempotency keys must be treated as sensitive: if an attacker can obtain the key for a payment API call, they could potentially replay or probe the API
  • Keys are stored in the encrypted checkpoint state, not in logs or observable metadata

OWASP LLM Top 10

OWASP LLM Risk Checkpoint Relevance Mitigation
LLM08 Excessive Agency Recovery replay could re-execute a previously blocked action if policy changed after checkpoint Policy check is re-evaluated at each iteration after restore, not skipped based on checkpoint history
LLM01 Prompt Injection Checkpoint state could contain injected content from a compromised tool result Content validation applied to restored context before injection into LLM prompt
LLM06 Sensitive Information Disclosure Checkpoint state contains intermediate sensitive data Encryption at rest; access controls on checkpoint store; expiry policy

9. Governance Considerations

Audit Trail

  • Every checkpoint write, restore, pause, resume, and cancel event is recorded in the immutable audit log
  • The audit trail enables reconstruction of a complete task execution timeline, including any human interventions
  • For regulated tasks (financial, clinical), the checkpoint audit trail is a material compliance artefact

Human Override Records

  • Every pause, resume, and cancel action through the management API is recorded with the human operator's identity, timestamp, and justification
  • Context injected at resume points (human feedback, corrected data) is appended to the task audit trail

Governance Artefacts

Artefact Owner Frequency Purpose
Task Execution Audit Trail Platform Engineering Per task Complete timeline of execution, checkpoints, human interventions
Recovery Incident Log Operations Per recovery event Records failure, recovery attempt, outcome, and any re-execution anomalies
Idempotency Violation Report Operations Monthly Documents any detected duplicate side effects; root cause analysis
Checkpoint Store Capacity Report Platform Engineering Monthly Storage growth, TTL expiry rates, capacity planning

10. Operational Considerations

SLOs

SLO Target Window Alert
Checkpoint write latency ≤ 15ms p95 1-hour rolling > 50ms triggers P2
Recovery time (from failure detection to resumed execution) ≤ 60 seconds Per event > 5 minutes triggers P1
Task completion rate (including recovered tasks) ≥ 98% 24-hour rolling < 95% triggers P2
Checkpoint store availability 99.99% Monthly Any degradation triggers P1

Monitoring

  • Checkpoint write success/failure rate per task_id
  • Task recovery event rate — spike indicates infrastructure instability
  • Average iterations per task — significant increase may indicate agent getting stuck in recovery loops
  • Cost per task with recovery events — quantifies the financial impact of failures

DR and Capacity

DR Tier Checkpoint Store Config RTO RPO
Standard Redis with AOF + daily snapshot to S3 5 min 1 iteration (1 checkpoint interval)
Enhanced DynamoDB Global Tables (multi-region) < 1 min Near-zero (synchronous replication)
Premium Temporal Cloud (fully managed) < 30 sec Zero (event-sourced; no data loss)

11. Cost Considerations

Cost Drivers

Cost Driver Scaling Behaviour Control Lever
Checkpoint store writes Linear with tasks × iterations per task Checkpoint every N iterations instead of every 1 (trade recovery granularity for cost)
Checkpoint state storage Linear with concurrent tasks × state size × retention period Compress checkpoint state; aggressive TTL on completed tasks
Recovery LLM calls Proportional to failure rate × iterations replayed Improve infrastructure reliability to reduce failure rate; idempotent cached results avoid re-inference

Indicative Cost Impact

Scenario Cost Impact vs. No Checkpoint
0% failure rate, 100% completion +2–5% (checkpoint write overhead only)
5% failure rate, recovery from checkpoint Break-even to 10% savings (avoid full re-execution cost)
10% failure rate without recovery 2–3× effective cost (full re-execution + risk of duplicate side effects)

12. Trade-Off Analysis

Checkpointing Strategy Options

Option Granularity Write Overhead Recovery Granularity Best For
A: Per-Iteration (Recommended) Every loop iteration Low (Redis ≤15ms) Maximum — at most 1 iteration lost Tasks with expensive or side-effectful iterations
B: Per-N-Iterations Every N iterations Proportionally lower At most N iterations lost Short, cheap iterations where overhead matters
C: Workflow Engine Native Platform-managed (Temporal/Durable Functions) Platform overhead Zero — event-sourced replay Complex multi-stage workflows; regulated workloads
D: No Checkpointing (Idempotent Tasks Only) N/A — full re-execution Zero Full re-execution from start Fully idempotent tasks where re-execution is safe and cheap

Architectural Tensions

Tension Left Pole Right Pole Balance
Recovery granularity vs. Checkpoint overhead Checkpoint after every action (microseconds apart) Checkpoint only at major milestones Per-iteration checkpointing is the practical optimum
Idempotency key lifespan vs. Storage cost Keep keys indefinitely Expire after task completion Expire with task retention period; dedup window for external APIs is typically 24h
Human pause flexibility vs. Task complexity Allow pause at any iteration Only allow pause at defined milestones Pause at any iteration for safety; display milestone progress for human context

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Checkpoint store failure during write Low Medium — last iteration lost; replay from prior checkpoint Write exception + circuit breaker Recover from last successful checkpoint; at most 1 iteration re-executed
Duplicate task execution (two instances race) Low High — duplicate side effects if not protected by idempotency keys CAS conflict on checkpoint write One instance wins CAS; other detects conflict and aborts
Checkpoint state corruption Very Low High — task cannot recover; must restart from scratch Deserialisation failure Versioned schema migration; if unrecoverable, fresh start with duplicate side effect audit
Idempotency key rejected by external API Low Medium — tool call may not complete API error code Log and investigate; if API does not support idempotency, implement result cache in checkpoint
Recovery loop (agent stuck, keeps recovering) Low High — cost overrun Recovery attempt count in checkpoint; alert after N recoveries Kill task after max recovery attempts; alert; human review

14. Regulatory Considerations

APRA CPS 230 (Operational Resilience)

  • Material business services must have documented RTO/RPO; this pattern enables RTO measured in seconds/minutes for agent-powered services
  • Recovery testing is required; the checkpoint restore path must be exercised in DR testing

ISO 22301 (Business Continuity)

  • Agent checkpoint/recovery maps to §8.4 (Business Continuity Plans); the restore protocol is the BCM procedure for agent task failure

EU AI Act

  • Art. 9 (Risk Management): the ability to safely pause and inspect a running agent implements a key risk management control
  • Art. 14 (Human Oversight): the pause/resume mechanism is a direct implementation of the human oversight requirement — humans can halt agent execution at any point without losing task progress

NIST AI RMF

  • MANAGE 4.1: The recovery protocol and idempotency design implement the incident response and recovery management requirement

15. Reference Implementations

AWS

Component Service
Checkpoint Store Amazon DynamoDB (conditional PutItem; strong consistency)
Workflow Engine AWS Step Functions (built-in state persistence)
Management API AWS API Gateway + Lambda
Audit Log AWS CloudTrail + S3 Object Lock

Azure

Component Service
Checkpoint Store Azure Cosmos DB (ETag-based optimistic concurrency)
Workflow Engine Azure Durable Functions (event-sourced; built-in checkpointing)
Management API Azure Functions + API Management

GCP

Component Service
Checkpoint Store Cloud Spanner (strong consistency; CAS transactions)
Workflow Engine Cloud Workflows + Cloud Run
Management API Cloud Run API endpoint

On-Premises

Component Technology
Checkpoint Store Redis 7+ with AOF persistence (fsync always for highest durability)
Workflow Engine Temporal OSS (self-hosted on Kubernetes)
Management API FastAPI on Kubernetes

Pattern ID Relationship Type Notes
Single Agent Pattern EAAPL-AGT001 Extends Checkpoint adds durable state to the base agent loop
Stateful Agent Memory EAAPL-AGT002 Integrates With Checkpoint references memory record IDs; restore reloads referenced memories
Long-Running Agent EAAPL-AGT007 Depends On Long-running agent requires checkpointing as a foundational capability
Human-in-the-Loop Agent EAAPL-MAG003 Integrates With Pause/resume mechanism is the implementation enabler for HITL gates mid-task
Agent Handoff Protocol EAAPL-MAG006 Peer Handoff payload includes checkpoint state for seamless context transfer between agents

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Evidence
Checkpointing Technology Maturity 5 Redis, DynamoDB, Cosmos DB battle-tested at hyperscale; Temporal/Durable Functions proven
Idempotency Pattern Maturity 5 Industry-standard pattern (Stripe, Twilio, etc.) with established implementation guidance
AI-Agent-Specific Integration 3 Agent framework integration (LangGraph, Temporal) maturing; some custom implementation still required
Human Pause/Resume UX 3 Management API well-defined; UI tooling for operator visibility still developing
Regulatory Evidence 4 CPS 230 and ISO 22301 mapping well-established; audit trail design proven

18. Revision History

Version Date Author Changes
1.0 2024-03-15 Architecture Board Initial publication
1.1 2024-06-01 Platform Engineering Added Temporal integration option; idempotency key management detail
1.2 2024-09-15 Reliability Engineering Added DR tiers; SLO table; recovery loop detection failure mode
1.3 2025-01-05 Architecture Board Added EU AI Act Art. 14 mapping; pause/resume governance artefacts
1.4 2025-04-20 Reliability Engineering Added CAS conflict resolution; Cosmos DB ETag implementation reference
← Back to LibraryMore Agentic AI