EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryMulti-Agent Systems
Proven
⇄ Compare

EAAPL-MAG003 — Human-in-the-Loop Agent

EAAPL-MAG003 — Human-in-the-Loop Agent

Status: Proven Tags: agent human-oversight eu-ai-act high-complexity Version: 2.0.0 Last Updated: 2026-06-12


1. Pattern Identity

Field Value
Pattern ID EAAPL-MAG003
Name Human-in-the-Loop Agent
Category Multi-Agent
Maturity Proven
Complexity High
Related Patterns EAAPL-MAG001 · EAAPL-MAG002 · EAAPL-MAG006 · EAAPL-INT007

2. Executive Summary

The Human-in-the-Loop (HITL) Agent pattern inserts mandatory human approval checkpoints into autonomous AI workflows before the agent executes irreversible actions or reaches consequential decisions. It is not a concession to lack of AI capability — it is a structural compliance and risk control. EU AI Act Article 14 requires that high-risk AI systems enable human operators to meaningfully intervene; this pattern operationalises that requirement as a first-class architectural component. The pattern governs: where in the workflow checkpoints are placed (before irreversible actions; after high-stakes reasoning completion); how checkpoints are presented to reviewers (plain-language summaries, not technical dumps); how the approval queue is managed (async, prioritised, SLA-bound); how timeouts are handled (pause then escalate then auto-cancel); and how every human decision is recorded for audit. Critically, the pattern distinguishes genuine oversight from compliance theatre — a checkpoint that shows reviewers an opaque JSON blob is not meaningful oversight regardless of whether it is technically present.


3. Problem Statement

3.1 Context

Autonomous AI agents acting without human checkpoints carry two distinct risks. First, operational risk: the agent may take an irreversible action (sending a customer email, committing a financial transaction, modifying a production database) based on reasoning that is subtly wrong. Second, regulatory risk: for high-risk AI systems under EU AI Act Annex III, autonomous action without human oversight is non-compliant regardless of the agent's actual performance.

3.2 Forces in Tension

  • Autonomy vs. oversight. More checkpoints reduce risk but increase latency and human reviewer burden. Too many checkpoints and reviewers become rubber-stampers who approve without reading — creating a false audit trail worse than no checkpoint.
  • Reviewer cognitive load vs. audit completeness. Reviewers need enough context to make genuine decisions. But providing all context makes reviews slow and exhausting.
  • Async throughput vs. reviewer responsiveness. An async approval queue handles burst load but requires reviewers to process queued items within SLA, which requires staffing and tooling.
  • Auto-cancel vs. partial completion. When a reviewer does not respond within SLA, the safest action (auto-cancel) loses work already done. Partial completion raises consistency risks.

3.3 Failure Modes Without This Pattern

Without HITL checkpoints on irreversible actions, a single agent reasoning error propagates to an irreversible real-world consequence before any human has the opportunity to intervene. Without structured audit logging, there is no evidence that oversight occurred, creating regulatory exposure. Without timeout handling, a slow reviewer blocks the entire workflow indefinitely.


4. Solution

4.1 HITL Agent Workflow

ARCHITECTURE DIAGRAM
flowchart TD subgraph Decision["Checkpoint Decision"] A[Agent Reasoning Complete] B{Checkpoint Required} end subgraph Review["Human Review"] C[Build Summary] D[Approval Queue] E{Reviewer Decision} end subgraph Outcome["Outcome"] F[Execute Action] G[Action Cancelled] H[Re-execute with Edits] I[Audit Record] end A --> B B -->|no| F B -->|yes| C --> D --> E E -->|approve| F --> I E -->|reject| G E -->|modify| H --> I E -->|timeout| J{Escalate or Cancel} J -->|escalate| K[Senior Queue] J -->|cancel| G style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#f3e8ff,stroke:#a855f7 style F fill:#d1fae5,stroke:#10b981 style G fill:#fee2e2,stroke:#ef4444 style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981 style J fill:#f3e8ff,stroke:#a855f7 style K fill:#d1fae5,stroke:#10b981

4.2 Checkpoint Classification

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Proposed Action"] A[Action to Evaluate] end subgraph Classification["Action Classification"] B{Action Type} C[External Communication] D[Irreversible DB Write] E[Financial Transaction] F[High-Stakes Reasoning] G[Read-Only or Reversible] end subgraph Queue["Checkpoint Queue"] H[Approval Queue] end A --> B B --> C --> H B --> D --> H B --> E --> H B --> F --> H B --> G --> I[Auto-Proceed] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#fef9c3,stroke:#eab308 style I fill:#d1fae5,stroke:#10b981

5. Structure

5.1 Component Catalogue

Component Responsibility Technology Options
Checkpoint Evaluator Determines if an action requires human approval Rule engine, policy-as-code
Summary Builder Translates agent reasoning into plain-language reviewer summary LLM with summarisation prompt
Approval Queue Stores pending checkpoints, manages priority and SLA Postgres + background worker, AWS SQS, Temporal workflow
Reviewer Interface Presents checkpoints to humans with approve/reject/modify controls Web app, Slack bot, email with signed links
Timeout Handler Monitors SLA deadlines, triggers escalation or auto-cancel Cron job, Temporal timer
Audit Logger Immutable record of every checkpoint and human decision Append-only Postgres table, AWS CloudTrail
Resumption Handler Resumes agent execution after approval with reviewer context injected Workflow continuation via stored state

5.2 Checkpoint Summary Schema

{
  "checkpointId": "uuid-v4",
  "taskId": "uuid-v4",
  "agentId": "contract-review-agent-v2",
  "checkpointType": "PRE_ACTION",
  "actionType": "SEND_EMAIL",
  "timestamp": "ISO-8601",
  "prioritySLA": {
    "level": "HIGH",
    "reviewByMs": 1800000
  },
  "humanReadableSummary": {
    "whatTheAgentWantsToDo": "Send a contract rejection email to vendor@acme.com",
    "whyItWantsToDoThis": "Clause 4.2 contains unlimited liability exceeding our risk policy",
    "dataUsed": ["Contract PDF uploaded 2026-06-11", "Risk policy v3.2 (internal)"],
    "alternatives": ["Request clause modification", "Escalate to legal team"]
  },
  "rawAgentReasoning": "...",
  "proposedActionPayload": { "to": "vendor@acme.com", "subject": "...", "body": "..." },
  "reviewerOptions": ["APPROVE", "REJECT", "MODIFY", "ESCALATE"]
}

6. Behaviour

6.1 Checkpoint Placement Strategy

Checkpoints are placed at two types of junctures:

BEFORE irreversible actions. An action is irreversible if it cannot be undone programmatically without side effects or human effort. The canonical list:

  • Sending any external communication (email, SMS, API call to a third-party system)
  • Writing to a production database with no soft-delete mechanism
  • Executing a financial transaction or payment authorisation
  • Creating or modifying a legal document to be shared externally
  • Deploying code or configuration changes to production
  • Provisioning or deprovisioning cloud infrastructure

AFTER completing high-stakes reasoning. For decisions where the agent's reasoning output will be consumed by downstream systems or used as trusted input, a post-reasoning checkpoint allows a reviewer to validate quality before it is consumed. This is critical for risk scores, legal opinions, medical recommendations, and security assessments.

What does NOT need a checkpoint:

  • Read-only operations (database queries, file reads, API GETs)
  • Reversible internal state changes (setting a flag, updating a draft)
  • Low-stakes formatting or summarisation tasks
  • Actions on sandboxed test data

6.2 Checkpoint Presentation Quality

This is the most commonly misconfigured aspect of the HITL pattern. A poorly designed checkpoint presenting reviewers with a raw JSON blob or a multi-page reasoning dump is compliance theatre. Reviewers will approve without reading.

The checkpoint summary must answer five questions in plain language:

  1. What does the agent want to do? (One sentence, action verb)
  2. Why does it want to do this? (Business reason in non-technical terms)
  3. What data did it use to reach this conclusion? (Named sources, not IDs)
  4. What will happen if you approve? (Concrete consequence)
  5. What are the alternatives? (At least one option other than approve or reject)

The summary must be generated by an LLM summarisation step, not assembled programmatically from field names. Test it by asking a non-technical reviewer whether they understand it — if they do not, the prompt needs revision.

6.3 Approval Queue Design

Async by default. Checkpoints must not block the reviewer's workflow. The agent suspends execution by persisting its state and exiting. When the reviewer acts, the agent resumes from saved state with the reviewer's decision injected into its context.

Priority tiers.

Tier Examples Review SLA
CRITICAL Financial transactions above threshold; production infrastructure changes 30 minutes
HIGH External communications; contract executions; risk decisions 4 hours
STANDARD Internal reports; draft documents 24 hours
LOW Informational summaries; non-consequential recommendations 72 hours

Queue implementation. Use a durable queue (not in-memory) so checkpoint items survive process restarts. Each item has: checkpointId, taskId, createdAt, reviewByMs, status, reviewerId, reviewedAt, reviewerComment.

6.4 Timeout Handling

When a reviewer does not act within SLA:

  1. Warning notification at 80% of SLA elapsed. Alert the reviewer via all configured channels (email, Slack, SMS).
  2. Escalation at 100% of SLA elapsed. Move the checkpoint to a senior reviewer queue. Notify the original reviewer it has been escalated.
  3. Auto-cancel at 150% of SLA elapsed (configurable). Cancel the pending action. Write an audit record with AUTO_CANCELLED status. Notify the task initiator.

The specific timeout policy must be configurable per checkpoint type. Financial transactions should never auto-approve on timeout. Low-stakes informational checkpoints may auto-approve after escalation.


7. Implementation Guide

7.1 Step-by-Step

Step 1 — Define your checkpoint policy. Create a policy document and policy-as-code implementation that defines which action types require checkpoints, which SLA tier applies to each, and the timeout behaviour. This document is also your EU AI Act Article 14 evidence artefact.

Step 2 — Build the checkpoint evaluator. A pure function that takes a proposed action and returns: requiresCheckpoint: boolean, checkpointType, priorityTier. Implement as a lookup against the action type and a risk policy table.

Step 3 — Build the summary builder. LLM call with a prompt that takes the agent's reasoning, proposed action, and data sources, and returns the five plain-language answers. Test with non-technical reviewers before deploying to production.

Step 4 — Build the approval queue. Use a database table with a polling worker or an event-driven queue. Implement the priority sort and SLA deadline fields. The queue must be durable — survive process restarts without losing pending checkpoints.

Step 5 — Build the reviewer interface. Minimum viable: a web page (or Slack bot) showing the five plain-language summary fields, the approve/reject/modify/escalate buttons, and a free-text comment field. Every reviewer decision must be authenticated with the reviewer's identity.

Step 6 — Build the timeout handler. A cron job or Temporal workflow that runs every minute, queries for checkpoints past their SLA, and triggers the appropriate response (warning, escalation, auto-cancel).

Step 7 — Build the resumption handler. When a reviewer approves or modifies, the handler retrieves the suspended agent state, injects the reviewer's decision as a message in the agent's context, and resumes execution.

7.2 Code Skeleton (TypeScript)

interface CheckpointPolicy {
  actionType: string;
  requiresCheckpoint: boolean;
  tier: "CRITICAL" | "HIGH" | "STANDARD" | "LOW";
  timeoutBehaviours: { warningPct: number; escalatePct: number; autoCancelPct: number };
}

const POLICY: CheckpointPolicy[] = [
  { actionType: "SEND_EMAIL", requiresCheckpoint: true, tier: "HIGH",
    timeoutBehaviours: { warningPct: 80, escalatePct: 100, autoCancelPct: 150 } },
  { actionType: "FINANCIAL_TRANSACTION", requiresCheckpoint: true, tier: "CRITICAL",
    timeoutBehaviours: { warningPct: 80, escalatePct: 100, autoCancelPct: 200 } },
  { actionType: "DB_READ", requiresCheckpoint: false, tier: "LOW",
    timeoutBehaviours: { warningPct: 100, escalatePct: 100, autoCancelPct: 100 } }
];

async function checkpointGate(
  action: ProposedAction,
  agentReasoning: string,
  agentState: SerializedAgentState
): Promise<"PROCEED" | "CANCELLED"> {
  const policy = POLICY.find(p => p.actionType === action.type);
  if (!policy?.requiresCheckpoint) return "PROCEED";

  const summary = await buildHumanReadableSummary(action, agentReasoning);
  const checkpointId = crypto.randomUUID();

  await db.checkpoints.insert({
    id: checkpointId,
    taskId: agentState.taskId,
    status: "PENDING",
    tier: policy.tier,
    reviewBySLA: Date.now() + tierToMs(policy.tier),
    summary,
    proposedAction: action,
    frozenAgentState: agentState
  });

  await notifyReviewers(checkpointId, policy.tier, summary);
  // Execution suspends here. Resumption is event-driven via webhook from reviewer UI.
  const decision = await waitForDecision(checkpointId);

  await db.checkpoints.update(checkpointId, {
    status: decision.action,
    reviewerId: decision.reviewerId,
    reviewedAt: new Date(),
    reviewerComment: decision.comment
  });

  await auditLog.append({ checkpointId, decision, timestamp: new Date().toISOString() });
  return decision.action === "APPROVE" ? "PROCEED" : "CANCELLED";
}

8. Observability

8.1 Key Metrics

Metric Description Alert Threshold
Approval queue depth Number of pending checkpoints > 50 (staffing issue)
Average review time by tier Mean time from created to reviewed > 90% of SLA per tier
Auto-cancel rate % of checkpoints that expire without review > 5%
Escalation rate % of checkpoints escalated to senior review > 10%
Reviewer approval rate % of checkpoints approved vs rejected/modified < 70% (high rejection indicates agent quality issue)
Rubber-stamp rate % of approvals with review time under 30s > 20% (oversight theatre indicator)

8.2 Rubber-Stamp Detection

The rubber-stamp rate metric is critical for genuine compliance. If reviewers are approving checkpoints in under 30 seconds consistently, they are not reading the summaries. Alert on this. Interventions: improve summary quality; reduce checkpoint frequency (too many checkpoints lead to reviewer fatigue); investigate whether checkpoints are too low-stakes to warrant human review.


9. Cost Governance

  • Summary builder LLM cost. Use a cheaper model tier for summary generation. The summary LLM call should not exceed 15% of the cost of the agent call that triggered the checkpoint.
  • Checkpoint volume control. If checkpoint volume is overwhelming reviewers, the solution is not to remove checkpoints — it is to invest in reviewer tooling and staffing. Never remove checkpoints from high-risk action types for cost reasons.
  • State storage cost. Suspended agent states can be large. Store only the minimum state needed for resumption. Implement a TTL on checkpoint records; auto-cancel and purge state after the maximum expected review window.

10. Security Considerations

10.1 Reviewer Identity Authentication

Every reviewer decision must be authenticated. An unauthenticated approval endpoint is a critical security vulnerability — an attacker who can send an HTTP request to the endpoint can bypass human oversight entirely. Require: authenticated session (SSO), MFA for CRITICAL tier checkpoints, reviewer identity recorded in audit log with session token fingerprint.

10.2 Approval Link Integrity

If checkpoints are presented via email links, sign the links with a time-limited HMAC token. Links must expire after the review SLA. Replayed or shared links must be rejected after first use.

10.3 Four-Eyes Principle

For very high-stakes checkpoints, require two independent reviewers to both approve before the action executes. Implement as a configurable requiredApprovals: 2 field on the checkpoint type. Neither reviewer should be able to see the other's decision before submitting their own.


11. Failure Modes and Mitigations

Failure Mode Detection Mitigation
Agent state lost on process restart Agent cannot resume after approval Persist full serialised state to durable store before creating checkpoint
Reviewer approves without reading Rubber-stamp rate above 20% Improve summary quality; add a required free-text acknowledgement field
SLA timeout auto-cancels legitimate work Auto-cancel rate spikes Increase SLA for affected tier; add proactive reviewer reminders
Adversarial input manipulates summary Reviewer approves harmful action based on misleading summary Run summary through safety classifier; include raw action payload for reviewer to inspect
Human review queue becomes bottleneck Queue depth above alert threshold Add reviewers; introduce tiered routing (junior for LOW/STANDARD, senior for HIGH/CRITICAL)
Audit log tampered Hash chain integrity check fails Use append-only log with cryptographic anchoring

12. Compliance and Governance

12.1 EU AI Act Article 14 — Human Oversight

Article 14 requires that high-risk AI systems be designed to allow human operators to understand capabilities and limitations, monitor operation, and intervene or interrupt. The HITL Agent pattern provides the architectural mechanism. For compliance evidence, each checkpoint record demonstrates: what the AI proposed, what information was presented to the human, what the human decided, and when.

Important: Article 14 requires that oversight is meaningful, not merely procedural. Regulatory examiners will look at reviewer dwell time, rejection rates, and modification rates. A system where 100% of checkpoints are approved instantly will attract scrutiny.

12.2 GDPR Data Minimisation

Checkpoint records may contain personal data from the agent's context. Apply data minimisation: include only the data elements necessary for the reviewer to make an informed decision. Apply the same retention period to checkpoint records as to the underlying data they reference.

12.3 Financial Services — MiFID II / SR 11-7

For AI systems making or informing investment decisions, every checkpoint approval is a supervisory control event under MiFID II. The audit log must identify the supervisor, their qualification, and their rationale. Approvals with no comment field entry should be flagged as inadequate supervision records.


13. Testing Strategy

13.1 Unit Tests

  • Checkpoint evaluator: for each action type in the policy table, assert correct requiresCheckpoint and tier values.
  • Timeout handler: given a checkpoint with reviewBySLA in the past, assert the correct escalation or auto-cancel action is triggered.
  • Summary builder: given a known agent reasoning and action, assert the five summary fields are non-empty and in plain language (using a stub LLM).

13.2 Integration Tests

  • Full checkpoint flow: agent reaches a checkpoint; record is created in the queue; mock reviewer approves; agent resumes and completes action.
  • Rejection flow: mock reviewer rejects; assert the proposed action is not executed and a rejection audit record is written.
  • Timeout flow: create a checkpoint with a past SLA deadline; assert the timeout handler escalates it; assert the original reviewer receives a notification.
  • Four-eyes flow: assert that a single approval is not sufficient to proceed; assert execution resumes only after second independent approval.

13.3 End-to-End Playwright Tests

  • Navigate to the reviewer interface; assert the checkpoint summary renders with all five plain-language fields visible.
  • Click Approve; assert the agent resumes and the downstream action (e.g., email stub receives the message) executes.
  • Click Reject; assert the agent halts and a cancellation record appears in the audit log.
  • Wait for a timeout (accelerated clock in test environment); assert the auto-cancel fires and the audit log records AUTO_CANCELLED.

14. Variants and Extensions

14.1 Streaming Checkpoint (Real-Time Interruption)

For agents with streaming output, implement a checkpoint that can interrupt a running agent mid-generation. The reviewer sees the partial output and can halt it before completion. Requires streaming interrupt support in the underlying agent framework.

14.2 AI-Assisted Review

For high-volume checkpoints, deploy a pre-reviewer AI that analyses the checkpoint and provides a recommended decision with a confidence score to the human reviewer. The human makes the final call. The pre-reviewer's recommendation must be flagged clearly as AI-generated and must not be pre-populated as the default selection.

14.3 Asynchronous Batch Review

For LOW-tier checkpoints, batch multiple pending items into a single review session presented as a list. The reviewer can bulk-approve low-risk items and drill into any that need closer inspection. Reduces per-checkpoint overhead for high-volume workflows.


15. Trade-off Analysis

Dimension HITL Checkpoint No Checkpoint Automated Rule Check
Risk control Highest None Moderate
Latency Highest (human in path) Lowest Low
Regulatory compliance Full (Article 14 evidence) Non-compliant for high-risk Partial
Reviewer burden High None None
Correctness for edge cases Highest (human judgment) Agent-only Limited by rule coverage

16. Known Implementations

Organisation Type Use Case Checkpoint Types Reported Outcome
Global bank AI-generated customer communications PRE_SEND_EMAIL Zero compliance violations over 24-month period; 4.2% rejection rate
Insurance carrier AI risk assessment routing POST_REASONING risk score 12% of AI risk scores modified by reviewers; downstream accuracy improved 8%
Healthcare system Clinical recommendation agent PRE_ACTION medication order Regulatory audit passed; zero patient safety incidents attributed to AI
Law firm Contract negotiation AI PRE_ACTION counter-offer send Average review time 8 minutes; client satisfaction 94%

Pattern ID Name Relationship
EAAPL-MAG001 Multi-Agent Orchestration HITL checkpoints inserted at orchestration decision points
EAAPL-MAG002 Supervisor Agent Supervisor escalates HUMAN_REVIEW outcomes to this pattern
EAAPL-MAG006 Agent Handoff Protocol Checkpoint records use handoff schema for state serialisation
EAAPL-INT007 AI Circuit Breaker Complements HITL by handling technical failures; HITL handles risk decisions

18. References

  1. EU AI Act (Regulation 2024/1689), Article 14: Human Oversight of High-Risk AI Systems
  2. EU AI Act Annex III: High-Risk AI System Classifications
  3. NIST AI RMF 1.0, Govern 4.2: Human Oversight Mechanisms
  4. Gartner, "Designing AI Systems for Regulatory Compliance," 2025 (ID: G00819234)
  5. Anthropic, "Responsible Scaling Policy," 2024 — anthropic.com/responsible-scaling-policy
  6. ISO/IEC 42001:2023, Section 6.1.2: AI Risk Treatment — Human Oversight Controls
  7. MiFID II (Directive 2014/65/EU), Article 17: Algorithmic Trading Controls
  8. SR 11-7: Guidance on Model Risk Management — Section IV: Ongoing Monitoring
  9. Shneiderman, B., "Human-Centered AI," Oxford University Press, 2022
  10. Wachter, S. et al., "Counterfactual Explanations Without Opening the Black Box," Harvard JOLT, 2018
← Back to LibraryMore Multi-Agent Systems