Proven

EAAPL-MAG003 — Human-in-the-Loop Agent

Status: Proven Tags: agent human-oversight eu-ai-act high-complexity Version: 2.0.0 Last Updated: 2026-06-12

1. Pattern Identity

Field	Value
Pattern ID	EAAPL-MAG003
Name	Human-in-the-Loop Agent
Category	Multi-Agent
Maturity	Proven
Complexity	High
Related Patterns	EAAPL-MAG001 · EAAPL-MAG002 · EAAPL-MAG006 · EAAPL-INT007

2. Executive Summary

The Human-in-the-Loop (HITL) Agent pattern inserts mandatory human approval checkpoints into autonomous AI workflows before the agent executes irreversible actions or reaches consequential decisions. It is not a concession to lack of AI capability — it is a structural compliance and risk control. EU AI Act Article 14 requires that high-risk AI systems enable human operators to meaningfully intervene; this pattern operationalises that requirement as a first-class architectural component. The pattern governs: where in the workflow checkpoints are placed (before irreversible actions; after high-stakes reasoning completion); how checkpoints are presented to reviewers (plain-language summaries, not technical dumps); how the approval queue is managed (async, prioritised, SLA-bound); how timeouts are handled (pause then escalate then auto-cancel); and how every human decision is recorded for audit. Critically, the pattern distinguishes genuine oversight from compliance theatre — a checkpoint that shows reviewers an opaque JSON blob is not meaningful oversight regardless of whether it is technically present.

3. Problem Statement

3.1 Context

Autonomous AI agents acting without human checkpoints carry two distinct risks. First, operational risk: the agent may take an irreversible action (sending a customer email, committing a financial transaction, modifying a production database) based on reasoning that is subtly wrong. Second, regulatory risk: for high-risk AI systems under EU AI Act Annex III, autonomous action without human oversight is non-compliant regardless of the agent's actual performance.

3.2 Forces in Tension

Autonomy vs. oversight. More checkpoints reduce risk but increase latency and human reviewer burden. Too many checkpoints and reviewers become rubber-stampers who approve without reading — creating a false audit trail worse than no checkpoint.
Reviewer cognitive load vs. audit completeness. Reviewers need enough context to make genuine decisions. But providing all context makes reviews slow and exhausting.
Async throughput vs. reviewer responsiveness. An async approval queue handles burst load but requires reviewers to process queued items within SLA, which requires staffing and tooling.
Auto-cancel vs. partial completion. When a reviewer does not respond within SLA, the safest action (auto-cancel) loses work already done. Partial completion raises consistency risks.

3.3 Failure Modes Without This Pattern

Without HITL checkpoints on irreversible actions, a single agent reasoning error propagates to an irreversible real-world consequence before any human has the opportunity to intervene. Without structured audit logging, there is no evidence that oversight occurred, creating regulatory exposure. Without timeout handling, a slow reviewer blocks the entire workflow indefinitely.

4. Solution

4.1 HITL Agent Workflow

ARCHITECTURE DIAGRAM

flowchart TD subgraph Decision["Checkpoint Decision"] A[Agent Reasoning Complete] B{Checkpoint Required} end subgraph Review["Human Review"] C[Build Summary] D[Approval Queue] E{Reviewer Decision} end subgraph Outcome["Outcome"] F[Execute Action] G[Action Cancelled] H[Re-execute with Edits] I[Audit Record] end A --> B B -->|no| F B -->|yes| C --> D --> E E -->|approve| F --> I E -->|reject| G E -->|modify| H --> I E -->|timeout| J{Escalate or Cancel} J -->|escalate| K[Senior Queue] J -->|cancel| G style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#f3e8ff,stroke:#a855f7 style F fill:#d1fae5,stroke:#10b981 style G fill:#fee2e2,stroke:#ef4444 style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981 style J fill:#f3e8ff,stroke:#a855f7 style K fill:#d1fae5,stroke:#10b981

4.2 Checkpoint Classification

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Proposed Action"] A[Action to Evaluate] end subgraph Classification["Action Classification"] B{Action Type} C[External Communication] D[Irreversible DB Write] E[Financial Transaction] F[High-Stakes Reasoning] G[Read-Only or Reversible] end subgraph Queue["Checkpoint Queue"] H[Approval Queue] end A --> B B --> C --> H B --> D --> H B --> E --> H B --> F --> H B --> G --> I[Auto-Proceed] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f0fdf4,stroke:#22c55e style H fill:#fef9c3,stroke:#eab308 style I fill:#d1fae5,stroke:#10b981

5. Structure

5.1 Component Catalogue

Component	Responsibility	Technology Options
Checkpoint Evaluator	Determines if an action requires human approval	Rule engine, policy-as-code
Summary Builder	Translates agent reasoning into plain-language reviewer summary	LLM with summarisation prompt
Approval Queue	Stores pending checkpoints, manages priority and SLA	Postgres + background worker, AWS SQS, Temporal workflow
Reviewer Interface	Presents checkpoints to humans with approve/reject/modify controls	Web app, Slack bot, email with signed links
Timeout Handler	Monitors SLA deadlines, triggers escalation or auto-cancel	Cron job, Temporal timer
Audit Logger	Immutable record of every checkpoint and human decision	Append-only Postgres table, AWS CloudTrail
Resumption Handler	Resumes agent execution after approval with reviewer context injected	Workflow continuation via stored state

5.2 Checkpoint Summary Schema

{
  "checkpointId": "uuid-v4",
  "taskId": "uuid-v4",
  "agentId": "contract-review-agent-v2",
  "checkpointType": "PRE_ACTION",
  "actionType": "SEND_EMAIL",
  "timestamp": "ISO-8601",
  "prioritySLA": {
    "level": "HIGH",
    "reviewByMs": 1800000
  },
  "humanReadableSummary": {
    "whatTheAgentWantsToDo": "Send a contract rejection email to vendor@acme.com",
    "whyItWantsToDoThis": "Clause 4.2 contains unlimited liability exceeding our risk policy",
    "dataUsed": ["Contract PDF uploaded 2026-06-11", "Risk policy v3.2 (internal)"],
    "alternatives": ["Request clause modification", "Escalate to legal team"]
  },
  "rawAgentReasoning": "...",
  "proposedActionPayload": { "to": "vendor@acme.com", "subject": "...", "body": "..." },
  "reviewerOptions": ["APPROVE", "REJECT", "MODIFY", "ESCALATE"]
}

6. Behaviour

6.1 Checkpoint Placement Strategy

Checkpoints are placed at two types of junctures:

BEFORE irreversible actions. An action is irreversible if it cannot be undone programmatically without side effects or human effort. The canonical list:

Sending any external communication (email, SMS, API call to a third-party system)
Writing to a production database with no soft-delete mechanism
Executing a financial transaction or payment authorisation
Creating or modifying a legal document to be shared externally
Deploying code or configuration changes to production
Provisioning or deprovisioning cloud infrastructure

AFTER completing high-stakes reasoning. For decisions where the agent's reasoning output will be consumed by downstream systems or used as trusted input, a post-reasoning checkpoint allows a reviewer to validate quality before it is consumed. This is critical for risk scores, legal opinions, medical recommendations, and security assessments.

What does NOT need a checkpoint:

Read-only operations (database queries, file reads, API GETs)
Reversible internal state changes (setting a flag, updating a draft)
Low-stakes formatting or summarisation tasks
Actions on sandboxed test data

6.2 Checkpoint Presentation Quality

This is the most commonly misconfigured aspect of the HITL pattern. A poorly designed checkpoint presenting reviewers with a raw JSON blob or a multi-page reasoning dump is compliance theatre. Reviewers will approve without reading.

The checkpoint summary must answer five questions in plain language:

What does the agent want to do? (One sentence, action verb)
Why does it want to do this? (Business reason in non-technical terms)
What data did it use to reach this conclusion? (Named sources, not IDs)
What will happen if you approve? (Concrete consequence)
What are the alternatives? (At least one option other than approve or reject)

The summary must be generated by an LLM summarisation step, not assembled programmatically from field names. Test it by asking a non-technical reviewer whether they understand it — if they do not, the prompt needs revision.

6.3 Approval Queue Design

Async by default. Checkpoints must not block the reviewer's workflow. The agent suspends execution by persisting its state and exiting. When the reviewer acts, the agent resumes from saved state with the reviewer's decision injected into its context.

Priority tiers.

Tier	Examples	Review SLA
CRITICAL	Financial transactions above threshold; production infrastructure changes	30 minutes
HIGH	External communications; contract executions; risk decisions	4 hours
STANDARD	Internal reports; draft documents	24 hours
LOW	Informational summaries; non-consequential recommendations	72 hours

Queue implementation. Use a durable queue (not in-memory) so checkpoint items survive process restarts. Each item has: checkpointId, taskId, createdAt, reviewByMs, status, reviewerId, reviewedAt, reviewerComment.

6.4 Timeout Handling

When a reviewer does not act within SLA:

Warning notification at 80% of SLA elapsed. Alert the reviewer via all configured channels (email, Slack, SMS).
Escalation at 100% of SLA elapsed. Move the checkpoint to a senior reviewer queue. Notify the original reviewer it has been escalated.
Auto-cancel at 150% of SLA elapsed (configurable). Cancel the pending action. Write an audit record with AUTO_CANCELLED status. Notify the task initiator.

The specific timeout policy must be configurable per checkpoint type. Financial transactions should never auto-approve on timeout. Low-stakes informational checkpoints may auto-approve after escalation.

7. Implementation Guide

7.1 Step-by-Step

Step 1 — Define your checkpoint policy. Create a policy document and policy-as-code implementation that defines which action types require checkpoints, which SLA tier applies to each, and the timeout behaviour. This document is also your EU AI Act Article 14 evidence artefact.

Step 2 — Build the checkpoint evaluator. A pure function that takes a proposed action and returns: requiresCheckpoint: boolean, checkpointType, priorityTier. Implement as a lookup against the action type and a risk policy table.

Step 3 — Build the summary builder. LLM call with a prompt that takes the agent's reasoning, proposed action, and data sources, and returns the five plain-language answers. Test with non-technical reviewers before deploying to production.

Step 4 — Build the approval queue. Use a database table with a polling worker or an event-driven queue. Implement the priority sort and SLA deadline fields. The queue must be durable — survive process restarts without losing pending checkpoints.

Step 5 — Build the reviewer interface. Minimum viable: a web page (or Slack bot) showing the five plain-language summary fields, the approve/reject/modify/escalate buttons, and a free-text comment field. Every reviewer decision must be authenticated with the reviewer's identity.

Step 6 — Build the timeout handler. A cron job or Temporal workflow that runs every minute, queries for checkpoints past their SLA, and triggers the appropriate response (warning, escalation, auto-cancel).

Step 7 — Build the resumption handler. When a reviewer approves or modifies, the handler retrieves the suspended agent state, injects the reviewer's decision as a message in the agent's context, and resumes execution.

7.2 Code Skeleton (TypeScript)

interface CheckpointPolicy {
  actionType: string;
  requiresCheckpoint: boolean;
  tier: "CRITICAL" | "HIGH" | "STANDARD" | "LOW";
  timeoutBehaviours: { warningPct: number; escalatePct: number; autoCancelPct: number };
}

const POLICY: CheckpointPolicy[] = [
  { actionType: "SEND_EMAIL", requiresCheckpoint: true, tier: "HIGH",
    timeoutBehaviours: { warningPct: 80, escalatePct: 100, autoCancelPct: 150 } },
  { actionType: "FINANCIAL_TRANSACTION", requiresCheckpoint: true, tier: "CRITICAL",
    timeoutBehaviours: { warningPct: 80, escalatePct: 100, autoCancelPct: 200 } },
  { actionType: "DB_READ", requiresCheckpoint: false, tier: "LOW",
    timeoutBehaviours: { warningPct: 100, escalatePct: 100, autoCancelPct: 100 } }
];

async function checkpointGate(
  action: ProposedAction,
  agentReasoning: string,
  agentState: SerializedAgentState
): Promise<"PROCEED" | "CANCELLED"> {
  const policy = POLICY.find(p => p.actionType === action.type);
  if (!policy?.requiresCheckpoint) return "PROCEED";

  const summary = await buildHumanReadableSummary(action, agentReasoning);
  const checkpointId = crypto.randomUUID();

  await db.checkpoints.insert({
    id: checkpointId,
    taskId: agentState.taskId,
    status: "PENDING",
    tier: policy.tier,
    reviewBySLA: Date.now() + tierToMs(policy.tier),
    summary,
    proposedAction: action,
    frozenAgentState: agentState
  });

  await notifyReviewers(checkpointId, policy.tier, summary);
  // Execution suspends here. Resumption is event-driven via webhook from reviewer UI.
  const decision = await waitForDecision(checkpointId);

  await db.checkpoints.update(checkpointId, {
    status: decision.action,
    reviewerId: decision.reviewerId,
    reviewedAt: new Date(),
    reviewerComment: decision.comment
  });

  await auditLog.append({ checkpointId, decision, timestamp: new Date().toISOString() });
  return decision.action === "APPROVE" ? "PROCEED" : "CANCELLED";
}

8. Observability

8.1 Key Metrics

Metric	Description	Alert Threshold
Approval queue depth	Number of pending checkpoints	> 50 (staffing issue)
Average review time by tier	Mean time from created to reviewed	> 90% of SLA per tier
Auto-cancel rate	% of checkpoints that expire without review	> 5%
Escalation rate	% of checkpoints escalated to senior review	> 10%
Reviewer approval rate	% of checkpoints approved vs rejected/modified	< 70% (high rejection indicates agent quality issue)
Rubber-stamp rate	% of approvals with review time under 30s	> 20% (oversight theatre indicator)

8.2 Rubber-Stamp Detection

The rubber-stamp rate metric is critical for genuine compliance. If reviewers are approving checkpoints in under 30 seconds consistently, they are not reading the summaries. Alert on this. Interventions: improve summary quality; reduce checkpoint frequency (too many checkpoints lead to reviewer fatigue); investigate whether checkpoints are too low-stakes to warrant human review.

9. Cost Governance

Summary builder LLM cost. Use a cheaper model tier for summary generation. The summary LLM call should not exceed 15% of the cost of the agent call that triggered the checkpoint.
Checkpoint volume control. If checkpoint volume is overwhelming reviewers, the solution is not to remove checkpoints — it is to invest in reviewer tooling and staffing. Never remove checkpoints from high-risk action types for cost reasons.
State storage cost. Suspended agent states can be large. Store only the minimum state needed for resumption. Implement a TTL on checkpoint records; auto-cancel and purge state after the maximum expected review window.

10. Security Considerations

10.1 Reviewer Identity Authentication

Every reviewer decision must be authenticated. An unauthenticated approval endpoint is a critical security vulnerability — an attacker who can send an HTTP request to the endpoint can bypass human oversight entirely. Require: authenticated session (SSO), MFA for CRITICAL tier checkpoints, reviewer identity recorded in audit log with session token fingerprint.

10.2 Approval Link Integrity

If checkpoints are presented via email links, sign the links with a time-limited HMAC token. Links must expire after the review SLA. Replayed or shared links must be rejected after first use.

10.3 Four-Eyes Principle

For very high-stakes checkpoints, require two independent reviewers to both approve before the action executes. Implement as a configurable requiredApprovals: 2 field on the checkpoint type. Neither reviewer should be able to see the other's decision before submitting their own.

11. Failure Modes and Mitigations

Failure Mode	Detection	Mitigation
Agent state lost on process restart	Agent cannot resume after approval	Persist full serialised state to durable store before creating checkpoint
Reviewer approves without reading	Rubber-stamp rate above 20%	Improve summary quality; add a required free-text acknowledgement field
SLA timeout auto-cancels legitimate work	Auto-cancel rate spikes	Increase SLA for affected tier; add proactive reviewer reminders
Adversarial input manipulates summary	Reviewer approves harmful action based on misleading summary	Run summary through safety classifier; include raw action payload for reviewer to inspect
Human review queue becomes bottleneck	Queue depth above alert threshold	Add reviewers; introduce tiered routing (junior for LOW/STANDARD, senior for HIGH/CRITICAL)
Audit log tampered	Hash chain integrity check fails	Use append-only log with cryptographic anchoring

12. Compliance and Governance

12.1 EU AI Act Article 14 — Human Oversight

Article 14 requires that high-risk AI systems be designed to allow human operators to understand capabilities and limitations, monitor operation, and intervene or interrupt. The HITL Agent pattern provides the architectural mechanism. For compliance evidence, each checkpoint record demonstrates: what the AI proposed, what information was presented to the human, what the human decided, and when.

Important: Article 14 requires that oversight is meaningful, not merely procedural. Regulatory examiners will look at reviewer dwell time, rejection rates, and modification rates. A system where 100% of checkpoints are approved instantly will attract scrutiny.

12.2 GDPR Data Minimisation

Checkpoint records may contain personal data from the agent's context. Apply data minimisation: include only the data elements necessary for the reviewer to make an informed decision. Apply the same retention period to checkpoint records as to the underlying data they reference.

12.3 Financial Services — MiFID II / SR 11-7

For AI systems making or informing investment decisions, every checkpoint approval is a supervisory control event under MiFID II. The audit log must identify the supervisor, their qualification, and their rationale. Approvals with no comment field entry should be flagged as inadequate supervision records.

13. Testing Strategy

13.1 Unit Tests

Checkpoint evaluator: for each action type in the policy table, assert correct requiresCheckpoint and tier values.
Timeout handler: given a checkpoint with reviewBySLA in the past, assert the correct escalation or auto-cancel action is triggered.
Summary builder: given a known agent reasoning and action, assert the five summary fields are non-empty and in plain language (using a stub LLM).

13.2 Integration Tests

Full checkpoint flow: agent reaches a checkpoint; record is created in the queue; mock reviewer approves; agent resumes and completes action.
Rejection flow: mock reviewer rejects; assert the proposed action is not executed and a rejection audit record is written.
Timeout flow: create a checkpoint with a past SLA deadline; assert the timeout handler escalates it; assert the original reviewer receives a notification.
Four-eyes flow: assert that a single approval is not sufficient to proceed; assert execution resumes only after second independent approval.

13.3 End-to-End Playwright Tests

Navigate to the reviewer interface; assert the checkpoint summary renders with all five plain-language fields visible.
Click Approve; assert the agent resumes and the downstream action (e.g., email stub receives the message) executes.
Click Reject; assert the agent halts and a cancellation record appears in the audit log.
Wait for a timeout (accelerated clock in test environment); assert the auto-cancel fires and the audit log records AUTO_CANCELLED.

14. Variants and Extensions

14.1 Streaming Checkpoint (Real-Time Interruption)

For agents with streaming output, implement a checkpoint that can interrupt a running agent mid-generation. The reviewer sees the partial output and can halt it before completion. Requires streaming interrupt support in the underlying agent framework.

14.2 AI-Assisted Review

For high-volume checkpoints, deploy a pre-reviewer AI that analyses the checkpoint and provides a recommended decision with a confidence score to the human reviewer. The human makes the final call. The pre-reviewer's recommendation must be flagged clearly as AI-generated and must not be pre-populated as the default selection.

14.3 Asynchronous Batch Review

For LOW-tier checkpoints, batch multiple pending items into a single review session presented as a list. The reviewer can bulk-approve low-risk items and drill into any that need closer inspection. Reduces per-checkpoint overhead for high-volume workflows.

15. Trade-off Analysis

Dimension	HITL Checkpoint	No Checkpoint	Automated Rule Check
Risk control	Highest	None	Moderate
Latency	Highest (human in path)	Lowest	Low
Regulatory compliance	Full (Article 14 evidence)	Non-compliant for high-risk	Partial
Reviewer burden	High	None	None
Correctness for edge cases	Highest (human judgment)	Agent-only	Limited by rule coverage

16. Known Implementations

Organisation Type	Use Case	Checkpoint Types	Reported Outcome
Global bank	AI-generated customer communications	PRE_SEND_EMAIL	Zero compliance violations over 24-month period; 4.2% rejection rate
Insurance carrier	AI risk assessment routing	POST_REASONING risk score	12% of AI risk scores modified by reviewers; downstream accuracy improved 8%
Healthcare system	Clinical recommendation agent	PRE_ACTION medication order	Regulatory audit passed; zero patient safety incidents attributed to AI
Law firm	Contract negotiation AI	PRE_ACTION counter-offer send	Average review time 8 minutes; client satisfaction 94%

Pattern ID	Name	Relationship
EAAPL-MAG001	Multi-Agent Orchestration	HITL checkpoints inserted at orchestration decision points
EAAPL-MAG002	Supervisor Agent	Supervisor escalates HUMAN_REVIEW outcomes to this pattern
EAAPL-MAG006	Agent Handoff Protocol	Checkpoint records use handoff schema for state serialisation
EAAPL-INT007	AI Circuit Breaker	Complements HITL by handling technical failures; HITL handles risk decisions

18. References

EU AI Act (Regulation 2024/1689), Article 14: Human Oversight of High-Risk AI Systems
EU AI Act Annex III: High-Risk AI System Classifications
NIST AI RMF 1.0, Govern 4.2: Human Oversight Mechanisms
Gartner, "Designing AI Systems for Regulatory Compliance," 2025 (ID: G00819234)
Anthropic, "Responsible Scaling Policy," 2024 — anthropic.com/responsible-scaling-policy
ISO/IEC 42001:2023, Section 6.1.2: AI Risk Treatment — Human Oversight Controls
MiFID II (Directive 2014/65/EU), Article 17: Algorithmic Trading Controls
SR 11-7: Guidance on Model Risk Management — Section IV: Ongoing Monitoring
Shneiderman, B., "Human-Centered AI," Oxford University Press, 2022
Wachter, S. et al., "Counterfactual Explanations Without Opening the Black Box," Harvard JOLT, 2018

Track this pattern for APRA/ASIC review

← Back to Library More Multi-Agent Systems →

EAAPL-MAG003 — Human-in-the-Loop Agent

EAAPL-MAG003 — Human-in-the-Loop Agent

1. Pattern Identity

2. Executive Summary

3. Problem Statement

3.1 Context

3.2 Forces in Tension

3.3 Failure Modes Without This Pattern

4. Solution

4.1 HITL Agent Workflow

4.2 Checkpoint Classification

5. Structure

5.1 Component Catalogue

5.2 Checkpoint Summary Schema

6. Behaviour

6.1 Checkpoint Placement Strategy

6.2 Checkpoint Presentation Quality

6.3 Approval Queue Design

6.4 Timeout Handling

7. Implementation Guide

7.1 Step-by-Step

7.2 Code Skeleton (TypeScript)

8. Observability

8.1 Key Metrics

8.2 Rubber-Stamp Detection

9. Cost Governance

10. Security Considerations

10.1 Reviewer Identity Authentication

10.2 Approval Link Integrity

10.3 Four-Eyes Principle

11. Failure Modes and Mitigations

12. Compliance and Governance

12.1 EU AI Act Article 14 — Human Oversight

12.2 GDPR Data Minimisation

12.3 Financial Services — MiFID II / SR 11-7

13. Testing Strategy

13.1 Unit Tests

13.2 Integration Tests

13.3 End-to-End Playwright Tests

14. Variants and Extensions

14.1 Streaming Checkpoint (Real-Time Interruption)

14.2 AI-Assisted Review

14.3 Asynchronous Batch Review

15. Trade-off Analysis

16. Known Implementations

17. Related Patterns

18. References