EAAPL-MAG003 — Human-in-the-Loop Agent
Status: Proven
Tags: agent human-oversight eu-ai-act high-complexity
Version: 2.0.0
Last Updated: 2026-06-12
1. Pattern Identity
| Field | Value |
|---|---|
| Pattern ID | EAAPL-MAG003 |
| Name | Human-in-the-Loop Agent |
| Category | Multi-Agent |
| Maturity | Proven |
| Complexity | High |
| Related Patterns | EAAPL-MAG001 · EAAPL-MAG002 · EAAPL-MAG006 · EAAPL-INT007 |
2. Executive Summary
The Human-in-the-Loop (HITL) Agent pattern inserts mandatory human approval checkpoints into autonomous AI workflows before the agent executes irreversible actions or reaches consequential decisions. It is not a concession to lack of AI capability — it is a structural compliance and risk control. EU AI Act Article 14 requires that high-risk AI systems enable human operators to meaningfully intervene; this pattern operationalises that requirement as a first-class architectural component. The pattern governs: where in the workflow checkpoints are placed (before irreversible actions; after high-stakes reasoning completion); how checkpoints are presented to reviewers (plain-language summaries, not technical dumps); how the approval queue is managed (async, prioritised, SLA-bound); how timeouts are handled (pause then escalate then auto-cancel); and how every human decision is recorded for audit. Critically, the pattern distinguishes genuine oversight from compliance theatre — a checkpoint that shows reviewers an opaque JSON blob is not meaningful oversight regardless of whether it is technically present.
3. Problem Statement
3.1 Context
Autonomous AI agents acting without human checkpoints carry two distinct risks. First, operational risk: the agent may take an irreversible action (sending a customer email, committing a financial transaction, modifying a production database) based on reasoning that is subtly wrong. Second, regulatory risk: for high-risk AI systems under EU AI Act Annex III, autonomous action without human oversight is non-compliant regardless of the agent's actual performance.
3.2 Forces in Tension
- Autonomy vs. oversight. More checkpoints reduce risk but increase latency and human reviewer burden. Too many checkpoints and reviewers become rubber-stampers who approve without reading — creating a false audit trail worse than no checkpoint.
- Reviewer cognitive load vs. audit completeness. Reviewers need enough context to make genuine decisions. But providing all context makes reviews slow and exhausting.
- Async throughput vs. reviewer responsiveness. An async approval queue handles burst load but requires reviewers to process queued items within SLA, which requires staffing and tooling.
- Auto-cancel vs. partial completion. When a reviewer does not respond within SLA, the safest action (auto-cancel) loses work already done. Partial completion raises consistency risks.
3.3 Failure Modes Without This Pattern
Without HITL checkpoints on irreversible actions, a single agent reasoning error propagates to an irreversible real-world consequence before any human has the opportunity to intervene. Without structured audit logging, there is no evidence that oversight occurred, creating regulatory exposure. Without timeout handling, a slow reviewer blocks the entire workflow indefinitely.
4. Solution
4.1 HITL Agent Workflow
4.2 Checkpoint Classification
5. Structure
5.1 Component Catalogue
| Component | Responsibility | Technology Options |
|---|---|---|
| Checkpoint Evaluator | Determines if an action requires human approval | Rule engine, policy-as-code |
| Summary Builder | Translates agent reasoning into plain-language reviewer summary | LLM with summarisation prompt |
| Approval Queue | Stores pending checkpoints, manages priority and SLA | Postgres + background worker, AWS SQS, Temporal workflow |
| Reviewer Interface | Presents checkpoints to humans with approve/reject/modify controls | Web app, Slack bot, email with signed links |
| Timeout Handler | Monitors SLA deadlines, triggers escalation or auto-cancel | Cron job, Temporal timer |
| Audit Logger | Immutable record of every checkpoint and human decision | Append-only Postgres table, AWS CloudTrail |
| Resumption Handler | Resumes agent execution after approval with reviewer context injected | Workflow continuation via stored state |
5.2 Checkpoint Summary Schema
{
"checkpointId": "uuid-v4",
"taskId": "uuid-v4",
"agentId": "contract-review-agent-v2",
"checkpointType": "PRE_ACTION",
"actionType": "SEND_EMAIL",
"timestamp": "ISO-8601",
"prioritySLA": {
"level": "HIGH",
"reviewByMs": 1800000
},
"humanReadableSummary": {
"whatTheAgentWantsToDo": "Send a contract rejection email to vendor@acme.com",
"whyItWantsToDoThis": "Clause 4.2 contains unlimited liability exceeding our risk policy",
"dataUsed": ["Contract PDF uploaded 2026-06-11", "Risk policy v3.2 (internal)"],
"alternatives": ["Request clause modification", "Escalate to legal team"]
},
"rawAgentReasoning": "...",
"proposedActionPayload": { "to": "vendor@acme.com", "subject": "...", "body": "..." },
"reviewerOptions": ["APPROVE", "REJECT", "MODIFY", "ESCALATE"]
}
6. Behaviour
6.1 Checkpoint Placement Strategy
Checkpoints are placed at two types of junctures:
BEFORE irreversible actions. An action is irreversible if it cannot be undone programmatically without side effects or human effort. The canonical list:
- Sending any external communication (email, SMS, API call to a third-party system)
- Writing to a production database with no soft-delete mechanism
- Executing a financial transaction or payment authorisation
- Creating or modifying a legal document to be shared externally
- Deploying code or configuration changes to production
- Provisioning or deprovisioning cloud infrastructure
AFTER completing high-stakes reasoning. For decisions where the agent's reasoning output will be consumed by downstream systems or used as trusted input, a post-reasoning checkpoint allows a reviewer to validate quality before it is consumed. This is critical for risk scores, legal opinions, medical recommendations, and security assessments.
What does NOT need a checkpoint:
- Read-only operations (database queries, file reads, API GETs)
- Reversible internal state changes (setting a flag, updating a draft)
- Low-stakes formatting or summarisation tasks
- Actions on sandboxed test data
6.2 Checkpoint Presentation Quality
This is the most commonly misconfigured aspect of the HITL pattern. A poorly designed checkpoint presenting reviewers with a raw JSON blob or a multi-page reasoning dump is compliance theatre. Reviewers will approve without reading.
The checkpoint summary must answer five questions in plain language:
- What does the agent want to do? (One sentence, action verb)
- Why does it want to do this? (Business reason in non-technical terms)
- What data did it use to reach this conclusion? (Named sources, not IDs)
- What will happen if you approve? (Concrete consequence)
- What are the alternatives? (At least one option other than approve or reject)
The summary must be generated by an LLM summarisation step, not assembled programmatically from field names. Test it by asking a non-technical reviewer whether they understand it — if they do not, the prompt needs revision.
6.3 Approval Queue Design
Async by default. Checkpoints must not block the reviewer's workflow. The agent suspends execution by persisting its state and exiting. When the reviewer acts, the agent resumes from saved state with the reviewer's decision injected into its context.
Priority tiers.
| Tier | Examples | Review SLA |
|---|---|---|
| CRITICAL | Financial transactions above threshold; production infrastructure changes | 30 minutes |
| HIGH | External communications; contract executions; risk decisions | 4 hours |
| STANDARD | Internal reports; draft documents | 24 hours |
| LOW | Informational summaries; non-consequential recommendations | 72 hours |
Queue implementation. Use a durable queue (not in-memory) so checkpoint items survive process restarts. Each item has: checkpointId, taskId, createdAt, reviewByMs, status, reviewerId, reviewedAt, reviewerComment.
6.4 Timeout Handling
When a reviewer does not act within SLA:
- Warning notification at 80% of SLA elapsed. Alert the reviewer via all configured channels (email, Slack, SMS).
- Escalation at 100% of SLA elapsed. Move the checkpoint to a senior reviewer queue. Notify the original reviewer it has been escalated.
- Auto-cancel at 150% of SLA elapsed (configurable). Cancel the pending action. Write an audit record with
AUTO_CANCELLEDstatus. Notify the task initiator.
The specific timeout policy must be configurable per checkpoint type. Financial transactions should never auto-approve on timeout. Low-stakes informational checkpoints may auto-approve after escalation.
7. Implementation Guide
7.1 Step-by-Step
Step 1 — Define your checkpoint policy. Create a policy document and policy-as-code implementation that defines which action types require checkpoints, which SLA tier applies to each, and the timeout behaviour. This document is also your EU AI Act Article 14 evidence artefact.
Step 2 — Build the checkpoint evaluator. A pure function that takes a proposed action and returns: requiresCheckpoint: boolean, checkpointType, priorityTier. Implement as a lookup against the action type and a risk policy table.
Step 3 — Build the summary builder. LLM call with a prompt that takes the agent's reasoning, proposed action, and data sources, and returns the five plain-language answers. Test with non-technical reviewers before deploying to production.
Step 4 — Build the approval queue. Use a database table with a polling worker or an event-driven queue. Implement the priority sort and SLA deadline fields. The queue must be durable — survive process restarts without losing pending checkpoints.
Step 5 — Build the reviewer interface. Minimum viable: a web page (or Slack bot) showing the five plain-language summary fields, the approve/reject/modify/escalate buttons, and a free-text comment field. Every reviewer decision must be authenticated with the reviewer's identity.
Step 6 — Build the timeout handler. A cron job or Temporal workflow that runs every minute, queries for checkpoints past their SLA, and triggers the appropriate response (warning, escalation, auto-cancel).
Step 7 — Build the resumption handler. When a reviewer approves or modifies, the handler retrieves the suspended agent state, injects the reviewer's decision as a message in the agent's context, and resumes execution.
7.2 Code Skeleton (TypeScript)
interface CheckpointPolicy {
actionType: string;
requiresCheckpoint: boolean;
tier: "CRITICAL" | "HIGH" | "STANDARD" | "LOW";
timeoutBehaviours: { warningPct: number; escalatePct: number; autoCancelPct: number };
}
const POLICY: CheckpointPolicy[] = [
{ actionType: "SEND_EMAIL", requiresCheckpoint: true, tier: "HIGH",
timeoutBehaviours: { warningPct: 80, escalatePct: 100, autoCancelPct: 150 } },
{ actionType: "FINANCIAL_TRANSACTION", requiresCheckpoint: true, tier: "CRITICAL",
timeoutBehaviours: { warningPct: 80, escalatePct: 100, autoCancelPct: 200 } },
{ actionType: "DB_READ", requiresCheckpoint: false, tier: "LOW",
timeoutBehaviours: { warningPct: 100, escalatePct: 100, autoCancelPct: 100 } }
];
async function checkpointGate(
action: ProposedAction,
agentReasoning: string,
agentState: SerializedAgentState
): Promise<"PROCEED" | "CANCELLED"> {
const policy = POLICY.find(p => p.actionType === action.type);
if (!policy?.requiresCheckpoint) return "PROCEED";
const summary = await buildHumanReadableSummary(action, agentReasoning);
const checkpointId = crypto.randomUUID();
await db.checkpoints.insert({
id: checkpointId,
taskId: agentState.taskId,
status: "PENDING",
tier: policy.tier,
reviewBySLA: Date.now() + tierToMs(policy.tier),
summary,
proposedAction: action,
frozenAgentState: agentState
});
await notifyReviewers(checkpointId, policy.tier, summary);
// Execution suspends here. Resumption is event-driven via webhook from reviewer UI.
const decision = await waitForDecision(checkpointId);
await db.checkpoints.update(checkpointId, {
status: decision.action,
reviewerId: decision.reviewerId,
reviewedAt: new Date(),
reviewerComment: decision.comment
});
await auditLog.append({ checkpointId, decision, timestamp: new Date().toISOString() });
return decision.action === "APPROVE" ? "PROCEED" : "CANCELLED";
}
8. Observability
8.1 Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Approval queue depth | Number of pending checkpoints | > 50 (staffing issue) |
| Average review time by tier | Mean time from created to reviewed | > 90% of SLA per tier |
| Auto-cancel rate | % of checkpoints that expire without review | > 5% |
| Escalation rate | % of checkpoints escalated to senior review | > 10% |
| Reviewer approval rate | % of checkpoints approved vs rejected/modified | < 70% (high rejection indicates agent quality issue) |
| Rubber-stamp rate | % of approvals with review time under 30s | > 20% (oversight theatre indicator) |
8.2 Rubber-Stamp Detection
The rubber-stamp rate metric is critical for genuine compliance. If reviewers are approving checkpoints in under 30 seconds consistently, they are not reading the summaries. Alert on this. Interventions: improve summary quality; reduce checkpoint frequency (too many checkpoints lead to reviewer fatigue); investigate whether checkpoints are too low-stakes to warrant human review.
9. Cost Governance
- Summary builder LLM cost. Use a cheaper model tier for summary generation. The summary LLM call should not exceed 15% of the cost of the agent call that triggered the checkpoint.
- Checkpoint volume control. If checkpoint volume is overwhelming reviewers, the solution is not to remove checkpoints — it is to invest in reviewer tooling and staffing. Never remove checkpoints from high-risk action types for cost reasons.
- State storage cost. Suspended agent states can be large. Store only the minimum state needed for resumption. Implement a TTL on checkpoint records; auto-cancel and purge state after the maximum expected review window.
10. Security Considerations
10.1 Reviewer Identity Authentication
Every reviewer decision must be authenticated. An unauthenticated approval endpoint is a critical security vulnerability — an attacker who can send an HTTP request to the endpoint can bypass human oversight entirely. Require: authenticated session (SSO), MFA for CRITICAL tier checkpoints, reviewer identity recorded in audit log with session token fingerprint.
10.2 Approval Link Integrity
If checkpoints are presented via email links, sign the links with a time-limited HMAC token. Links must expire after the review SLA. Replayed or shared links must be rejected after first use.
10.3 Four-Eyes Principle
For very high-stakes checkpoints, require two independent reviewers to both approve before the action executes. Implement as a configurable requiredApprovals: 2 field on the checkpoint type. Neither reviewer should be able to see the other's decision before submitting their own.
11. Failure Modes and Mitigations
| Failure Mode | Detection | Mitigation |
|---|---|---|
| Agent state lost on process restart | Agent cannot resume after approval | Persist full serialised state to durable store before creating checkpoint |
| Reviewer approves without reading | Rubber-stamp rate above 20% | Improve summary quality; add a required free-text acknowledgement field |
| SLA timeout auto-cancels legitimate work | Auto-cancel rate spikes | Increase SLA for affected tier; add proactive reviewer reminders |
| Adversarial input manipulates summary | Reviewer approves harmful action based on misleading summary | Run summary through safety classifier; include raw action payload for reviewer to inspect |
| Human review queue becomes bottleneck | Queue depth above alert threshold | Add reviewers; introduce tiered routing (junior for LOW/STANDARD, senior for HIGH/CRITICAL) |
| Audit log tampered | Hash chain integrity check fails | Use append-only log with cryptographic anchoring |
12. Compliance and Governance
12.1 EU AI Act Article 14 — Human Oversight
Article 14 requires that high-risk AI systems be designed to allow human operators to understand capabilities and limitations, monitor operation, and intervene or interrupt. The HITL Agent pattern provides the architectural mechanism. For compliance evidence, each checkpoint record demonstrates: what the AI proposed, what information was presented to the human, what the human decided, and when.
Important: Article 14 requires that oversight is meaningful, not merely procedural. Regulatory examiners will look at reviewer dwell time, rejection rates, and modification rates. A system where 100% of checkpoints are approved instantly will attract scrutiny.
12.2 GDPR Data Minimisation
Checkpoint records may contain personal data from the agent's context. Apply data minimisation: include only the data elements necessary for the reviewer to make an informed decision. Apply the same retention period to checkpoint records as to the underlying data they reference.
12.3 Financial Services — MiFID II / SR 11-7
For AI systems making or informing investment decisions, every checkpoint approval is a supervisory control event under MiFID II. The audit log must identify the supervisor, their qualification, and their rationale. Approvals with no comment field entry should be flagged as inadequate supervision records.
13. Testing Strategy
13.1 Unit Tests
- Checkpoint evaluator: for each action type in the policy table, assert correct
requiresCheckpointandtiervalues. - Timeout handler: given a checkpoint with
reviewBySLAin the past, assert the correct escalation or auto-cancel action is triggered. - Summary builder: given a known agent reasoning and action, assert the five summary fields are non-empty and in plain language (using a stub LLM).
13.2 Integration Tests
- Full checkpoint flow: agent reaches a checkpoint; record is created in the queue; mock reviewer approves; agent resumes and completes action.
- Rejection flow: mock reviewer rejects; assert the proposed action is not executed and a rejection audit record is written.
- Timeout flow: create a checkpoint with a past SLA deadline; assert the timeout handler escalates it; assert the original reviewer receives a notification.
- Four-eyes flow: assert that a single approval is not sufficient to proceed; assert execution resumes only after second independent approval.
13.3 End-to-End Playwright Tests
- Navigate to the reviewer interface; assert the checkpoint summary renders with all five plain-language fields visible.
- Click Approve; assert the agent resumes and the downstream action (e.g., email stub receives the message) executes.
- Click Reject; assert the agent halts and a cancellation record appears in the audit log.
- Wait for a timeout (accelerated clock in test environment); assert the auto-cancel fires and the audit log records
AUTO_CANCELLED.
14. Variants and Extensions
14.1 Streaming Checkpoint (Real-Time Interruption)
For agents with streaming output, implement a checkpoint that can interrupt a running agent mid-generation. The reviewer sees the partial output and can halt it before completion. Requires streaming interrupt support in the underlying agent framework.
14.2 AI-Assisted Review
For high-volume checkpoints, deploy a pre-reviewer AI that analyses the checkpoint and provides a recommended decision with a confidence score to the human reviewer. The human makes the final call. The pre-reviewer's recommendation must be flagged clearly as AI-generated and must not be pre-populated as the default selection.
14.3 Asynchronous Batch Review
For LOW-tier checkpoints, batch multiple pending items into a single review session presented as a list. The reviewer can bulk-approve low-risk items and drill into any that need closer inspection. Reduces per-checkpoint overhead for high-volume workflows.
15. Trade-off Analysis
| Dimension | HITL Checkpoint | No Checkpoint | Automated Rule Check |
|---|---|---|---|
| Risk control | Highest | None | Moderate |
| Latency | Highest (human in path) | Lowest | Low |
| Regulatory compliance | Full (Article 14 evidence) | Non-compliant for high-risk | Partial |
| Reviewer burden | High | None | None |
| Correctness for edge cases | Highest (human judgment) | Agent-only | Limited by rule coverage |
16. Known Implementations
| Organisation Type | Use Case | Checkpoint Types | Reported Outcome |
|---|---|---|---|
| Global bank | AI-generated customer communications | PRE_SEND_EMAIL | Zero compliance violations over 24-month period; 4.2% rejection rate |
| Insurance carrier | AI risk assessment routing | POST_REASONING risk score | 12% of AI risk scores modified by reviewers; downstream accuracy improved 8% |
| Healthcare system | Clinical recommendation agent | PRE_ACTION medication order | Regulatory audit passed; zero patient safety incidents attributed to AI |
| Law firm | Contract negotiation AI | PRE_ACTION counter-offer send | Average review time 8 minutes; client satisfaction 94% |
17. Related Patterns
| Pattern ID | Name | Relationship |
|---|---|---|
| EAAPL-MAG001 | Multi-Agent Orchestration | HITL checkpoints inserted at orchestration decision points |
| EAAPL-MAG002 | Supervisor Agent | Supervisor escalates HUMAN_REVIEW outcomes to this pattern |
| EAAPL-MAG006 | Agent Handoff Protocol | Checkpoint records use handoff schema for state serialisation |
| EAAPL-INT007 | AI Circuit Breaker | Complements HITL by handling technical failures; HITL handles risk decisions |
18. References
- EU AI Act (Regulation 2024/1689), Article 14: Human Oversight of High-Risk AI Systems
- EU AI Act Annex III: High-Risk AI System Classifications
- NIST AI RMF 1.0, Govern 4.2: Human Oversight Mechanisms
- Gartner, "Designing AI Systems for Regulatory Compliance," 2025 (ID: G00819234)
- Anthropic, "Responsible Scaling Policy," 2024 — anthropic.com/responsible-scaling-policy
- ISO/IEC 42001:2023, Section 6.1.2: AI Risk Treatment — Human Oversight Controls
- MiFID II (Directive 2014/65/EU), Article 17: Algorithmic Trading Controls
- SR 11-7: Guidance on Model Risk Management — Section IV: Ongoing Monitoring
- Shneiderman, B., "Human-Centered AI," Oxford University Press, 2022
- Wachter, S. et al., "Counterfactual Explanations Without Opening the Black Box," Harvard JOLT, 2018