EAAPL-MAG002 — Supervisor Agent
Status: Proven
Tags: agent orchestration human-oversight high-complexity
Version: 2.0.0
Last Updated: 2026-06-12
1. Pattern Identity
| Field | Value |
|---|---|
| Pattern ID | EAAPL-MAG002 |
| Name | Supervisor Agent |
| Category | Multi-Agent |
| Maturity | Proven |
| Complexity | High |
| Related Patterns | EAAPL-MAG001 · EAAPL-MAG003 · EAAPL-MAG006 · EAAPL-INT007 |
2. Executive Summary
The Supervisor Agent pattern establishes a hierarchical two-tier architecture in which a supervisor agent manages a pool of worker agents, each responsible for executing a bounded, well-defined subtask. Unlike the general orchestration pattern (EAAPL-MAG001), the supervisor is responsible not only for decomposition and dispatch but for continuous active oversight: monitoring worker progress against SLAs, validating each worker's output before incorporating it into the next reasoning step, recovering from worker failures through reassignment or escalation, and controlling the total cost of the worker pool. The critical distinguishing characteristic is the supervisor's validation gate: every worker output passes through a quality review before being used. This prevents hallucination compounding — the phenomenon where an incorrect intermediate output, when passed unchecked to the next worker, cascades into a deeply wrong final result. Enterprises deploy this pattern in high-stakes workflows (legal, financial, clinical, security) where intermediate errors carry regulatory or financial consequences.
3. Problem Statement
3.1 Context
In complex AI workflows, intermediate agent outputs are frequently passed forward without review. This is acceptable for low-stakes tasks where end-to-end correctness is verifiable cheaply. For high-stakes domains, a single fabricated clause in a contract analysis, a single incorrect risk score in a credit workflow, or a single misidentified vulnerability in a security audit can invalidate the entire downstream result — and may not be detected until human review occurs hours or days later.
3.2 Forces in Tension
- Throughput vs. correctness. Validating every worker output adds latency and cost. Skipping validation allows errors to compound.
- Autonomy vs. control. Workers that act independently are efficient but unsupervised. Workers that constantly check back with the supervisor are safe but slow.
- Static pool vs. dynamic spawn. A static worker pool has predictable cost and cold-start time but cannot scale to burst demand. Dynamic spawning scales elastically but introduces provisioning latency.
- Cost containment vs. completeness. More workers running in parallel completes the task faster but multiplies token spend.
3.3 Failure Modes Without This Pattern
Without a validation gate, hallucinated worker outputs become ground truth for subsequent reasoning steps, producing confidently wrong final outputs. Without SLA monitoring, a single slow worker holds up synthesis indefinitely. Without a capability registry, task assignments default to the cheapest or most-available worker rather than the most-appropriate one.
4. Solution
4.1 Supervisor-Worker Architecture
4.2 Worker Selection Flow
5. Structure
5.1 Component Catalogue
| Component | Responsibility | Technology Options |
|---|---|---|
| Supervisor Agent | Plan, assign, monitor, validate, synthesise | LLM with supervisor prompt, LangGraph StateGraph |
| Capability Registry | Worker discovery, capability matching, cost/latency metadata | In-memory map, Redis, service registry |
| Worker Pool | Bounded-capability task execution | LLM + tools, specialised models, RPA bots |
| Validation Gate | Quality review of each worker output before use | LLM-as-judge, JSON schema validator, business rule engine |
| SLA Monitor | Track per-worker deadlines, trigger reassignment on breach | Cron + deadline field in task message |
| Result Synthesiser | Combine validated outputs into final response | LLM synthesis prompt |
| Cost Controller | Track and enforce token budget across the worker pool | Middleware that intercepts LLM calls |
5.2 Capability Registry Schema
{
"workerId": "legal-clause-extractor-v2",
"capabilities": ["contract-analysis", "clause-extraction", "risk-identification"],
"acceptedInputSchema": "schemas/contract-input.json",
"outputSchema": "schemas/clause-extraction-output.json",
"model": "claude-3-7-sonnet",
"averageLatencyMs": 4200,
"costPerCallEstimateUSD": 0.024,
"maxConcurrentTasks": 5,
"healthEndpoint": "https://workers/legal-clause-extractor/health"
}
6. Behaviour
6.1 Supervisor Responsibilities
Receive and decompose task. The supervisor receives a task specification and breaks it into subtasks using its planning prompt. Each subtask maps to one or more capability tags in the registry.
Select and assign workers. For each subtask, the supervisor queries the capability registry for workers whose capability tags match. Selection criteria: capability match (required), current load (prefer idle), cost estimate (prefer cheaper if quality is equivalent), SLA compatibility (reject if worker's average latency would breach the subtask deadline).
Monitor progress. The supervisor maintains a live state map: { subtaskId → { workerId, assignedAt, deadlineMs, status } }. A background monitor checks for deadline breaches every N seconds. On breach, the supervisor either reassigns (if another capable worker is available) or marks the subtask as timed-out.
Validate worker outputs. This is the critical differentiator. For each worker output, the supervisor invokes a validation check before incorporating the result. Validation is multi-layered: schema validation (structured output format); factual consistency check (does the output contradict known facts from the input?); completeness check (did the worker address all required aspects of the subtask?); confidence scoring (if the worker returns a confidence score below threshold, flag for additional review).
Recover from failures. See Section 6.3.
Synthesise final result. Once all validated subtask results are collected (or gracefully degraded), the supervisor invokes a synthesis step that combines them into a coherent final output, explicitly noting any subtask gaps.
6.2 Quality Review — Preventing Hallucination Compounding
The validation gate is not optional. The most common failure mode in multi-agent systems is a hallucinated intermediate output being treated as ground truth by the next agent in the chain. The supervisor's validation gate breaks this chain.
Validation prompt structure:
You are a quality reviewer. Given the following subtask specification and worker output,
determine:
1. SCHEMA: Does the output conform to the expected JSON schema? [PASS/FAIL]
2. COMPLETENESS: Did the worker address all required aspects of the subtask? [PASS/FAIL + details]
3. CONSISTENCY: Does the output contradict any facts in the original input? [PASS/FAIL + details]
4. CONFIDENCE: Rate your confidence in the output quality from 1-10.
Return JSON: { "schemaPass": bool, "completenessPass": bool, "consistencyPass": bool,
"confidence": int, "issues": [], "recommendation": "ACCEPT|REJECT|HUMAN_REVIEW" }
A REJECT causes a retry with a correction prompt appended to the worker's context. A HUMAN_REVIEW recommendation escalates to the human review queue. An ACCEPT with confidence below 7 is flagged in the final output as low-confidence.
6.3 Worker Failure Recovery
| Failure Type | Recovery Action |
|---|---|
| Worker timeout (deadline breach) | Reassign to next-best capable worker. If none available, mark subtask as degraded. |
| Worker returns invalid schema | Retry once with schema correction in prompt. On second failure, reject and attempt reassignment. |
| Worker quality validation fails | Retry with correction prompt citing specific validation failures. On second failure, escalate to HUMAN_REVIEW. |
| Worker crashes / goes unhealthy | Circuit breaker (EAAPL-INT007) opens. Reassign to healthy worker. Alert on-call. |
| No capable worker available | Return NO_CAPABLE_WORKER error for that subtask. Supervisor flags in final output. |
| Budget exceeded before completion | Complete highest-priority subtasks. Mark remaining as BUDGET_EXCEEDED. Return partial with explicit warning. |
6.4 Dynamic Pool Management
Auto-scale. When all static pool workers for a capability type are at capacity and a new subtask arrives, the supervisor requests a dynamic worker spawn if the remaining task budget allows. Dynamic workers are terminated after completing their assigned subtask (or after an idle timeout of 60s) to control cost.
Scale-in. The supervisor tracks utilisation across the pool. Workers with zero tasks for > 5 minutes (configurable) are released to reduce standing cost.
Cost ceiling enforcement. The supervisor tracks cumulative token spend across all workers. When cumulative spend reaches 80% of the task budget, the supervisor switches to single-model-tier workers for remaining subtasks.
7. Implementation Guide
7.1 Step-by-Step
Step 1 — Build the Capability Registry. Start with a static JSON file or database table. Each entry describes a worker type, its capabilities, schemas, and cost/latency profile. The supervisor reads this at startup and caches it.
Step 2 — Implement the Supervisor Planning Prompt. The supervisor system prompt must include the capability registry as context, instructing it to return a structured plan mapping each subtask to a specific worker type. Include the instruction: "Always prefer lower-cost workers when quality requirements are equivalent."
Step 3 — Implement the Validation Gate. Build the validation prompt as a separate LLM call after each worker response. Use a cheaper, faster model for validation (e.g., Claude Haiku, GPT-4o-mini) to limit the cost overhead of the validation step to approximately 15% of the worker call cost.
Step 4 — Implement the SLA Monitor. Use a lightweight polling loop or event-driven timer that checks all in-flight subtask deadlines every 5 seconds. On deadline breach, the monitor calls supervisor.reassign(subtaskId).
Step 5 — Implement Result Synthesis. The synthesis step should be a single LLM call with all validated subtask results and an explicit instruction to note any missing or low-confidence domains.
7.2 Code Skeleton (TypeScript)
interface WorkerAssignment {
subtaskId: string;
workerId: string;
assignedAt: number;
deadlineMs: number;
status: "running" | "completed" | "failed" | "timeout";
}
class SupervisorAgent {
private assignments: Map<string, WorkerAssignment> = new Map();
private costSpent = 0;
async supervise(task: string, budgetUSD: number): Promise<SupervisorResult> {
const plan = await this.plan(task);
const workerPool = await this.selectWorkers(plan.subtasks);
for (const subtask of plan.subtasks) {
const worker = workerPool.get(subtask.requiredCapability);
if (!worker) throw new Error(`NO_CAPABLE_WORKER:${subtask.id}`);
const assignment: WorkerAssignment = {
subtaskId: subtask.id,
workerId: worker.id,
assignedAt: Date.now(),
deadlineMs: Date.now() + worker.averageLatencyMs * 2,
status: "running"
};
this.assignments.set(subtask.id, assignment);
const rawOutput = await this.executeWithTimeout(worker, subtask, assignment.deadlineMs);
const validation = await this.validate(subtask, rawOutput);
if (validation.recommendation === "REJECT") {
const retryOutput = await this.retryWithCorrection(worker, subtask, rawOutput, validation.issues);
const retryValidation = await this.validate(subtask, retryOutput);
if (retryValidation.recommendation !== "ACCEPT") {
await this.escalateToHuman(subtask, retryOutput, retryValidation);
continue;
}
}
if (validation.recommendation === "HUMAN_REVIEW") {
await this.escalateToHuman(subtask, rawOutput, validation);
continue;
}
this.assignments.get(subtask.id)!.status = "completed";
this.costSpent += worker.costPerCallEstimateUSD;
if (this.costSpent > budgetUSD) {
return this.gracefulDegradation("BUDGET_EXCEEDED");
}
}
return this.synthesise();
}
private async validate(subtask: SubTask, output: unknown): Promise<ValidationResult> {
return validationLLM.invoke({ subtask, output });
}
}
8. Observability
8.1 Supervisor-Level Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Validation pass rate | % of worker outputs that pass validation first attempt | < 85% |
| Worker reassignment rate | % of subtasks requiring reassignment | > 10% |
| Supervisor plan latency | Time to produce worker assignment plan | > 5s |
| Human escalation rate | % of subtasks escalated to human review | > 5% |
| Pool utilisation | % of static pool workers busy at any time | > 90% sustained |
| Cost efficiency ratio | Total task cost / estimated cost at plan time | > 1.5× (overspend) |
8.2 Per-Worker Metrics
Each worker emits spans with: workerId, subtaskId, taskId, input/output token counts, model version, latency, and validation result. These spans roll up to the supervisor trace.
9. Cost Governance
- Validation cost overhead. Use a cheaper model tier for the validation gate. Target < 15% overhead on the worker call cost.
- Dynamic worker budget gate. Before spawning a dynamic worker, check that
estimatedWorkerCost < costBudgetRemaining × 0.5. Never spend more than 50% of remaining budget on a single worker spawn. - Worker model tiering. Maintain a tiered cost map: Tier 1 (critical subtasks — frontier model), Tier 2 (standard subtasks — mid-tier model), Tier 3 (simple tasks — efficient model). The supervisor assigns tiers based on subtask criticality flags in the plan.
- Synthesis model. The final synthesis step should always use at minimum a Tier 2 model to ensure coherent final output quality regardless of which worker tiers were used.
10. Security Considerations
10.1 Worker Isolation
Workers must not share memory state or credentials with each other. Each worker invocation is stateless. Shared data passes only through the supervisor's validated result store, never through direct worker-to-worker communication.
10.2 Supervisor Prompt Integrity
The supervisor prompt contains the capability registry and task context. If an attacker can inject into the task input, they could attempt to hijack worker selection. Mitigations: sanitise all task inputs before passing to the supervisor; use a structured input schema with strict type validation; never allow free-form user input to modify the supervisor's system prompt.
10.3 Validation Gate Bypass Attempts
A sophisticated adversarial input may be crafted to produce output that passes the validation gate while still being harmful. Supplement LLM-based validation with deterministic rule-based checks for known sensitive patterns (e.g., PII in output when output should not contain PII; external URLs in code output when no URLs are expected).
11. Failure Modes and Mitigations
| Failure Mode | Detection | Mitigation |
|---|---|---|
| Validation gate is too lenient | High human correction rate post-output | Increase validation prompt stringency; add targeted rules for known failure patterns |
| Supervisor itself hallucinates a plan | Plan schema validation fails | Validate plan is DAG; validate all referenced worker types exist in registry |
| All workers of a type are busy | Assignment queue growing; SLA breach | Auto-scale trigger; alert; fallback to higher-tier worker if available |
| Validation adds unacceptable latency | P95 latency breaches SLA | Use faster validation model; parallelise validation with next worker dispatch |
| Cost overrun from retries | Budget exceeded before task completion | Cap retries at 2 per subtask; mark third failure as NEEDS_HUMAN and move on |
| Dynamic worker spawn failure | Spawn request returns error | Retry once; if still failing, use available static pool worker even if sub-optimal capability match |
12. Compliance and Governance
12.1 EU AI Act Article 14 — Human Oversight
The supervisor's escalation-to-human pathway is a compliance artefact demonstrating that human oversight is structurally embedded, not optional. For high-risk AI systems, every instance where a worker output was escalated to human review must be logged with: the subtask description, the worker's output, the validation failure reason, the human reviewer identity, and the human's decision. This log is the regulatory evidence of meaningful human control.
12.2 Audit Trail Requirements
Every supervisor run produces an immutable audit record containing: task ID, task description, full plan, every worker assignment with timestamps, every validation result, every retry and its reason, every human escalation, the final synthesised output, and total cost. This record must be tamper-evident (append-only store or cryptographic hash chain) for regulated use cases.
13. Testing Strategy
13.1 Unit Tests
- Capability registry lookup: given a subtask with capability tag X, assert the correct worker type is selected.
- Validation gate: given a known-good and known-bad worker output, assert correct ACCEPT/REJECT classification.
- SLA monitor: given an assignment with a deadline 100ms in the past, assert reassignment is triggered.
13.2 Integration Tests
- Full supervisor run with stub workers: one worker returns invalid schema on first call, valid on second. Assert final output is produced and retry counter is incremented.
- Budget exhaustion: stub workers report high cost. Assert supervisor stops assigning new subtasks when budget ceiling is reached and returns partial result.
- Human escalation path: stub worker fails validation twice. Assert escalation event is written to the human review queue and subtask is marked as escalated.
13.3 Load Tests
- Simulate 50 concurrent supervisor runs sharing a static worker pool of 10. Assert p95 latency remains within SLA and no worker is assigned more than
maxConcurrentTaskssubtasks simultaneously.
13.4 End-to-End Tests (Playwright)
For each supported high-stakes task type, run a live end-to-end test. Assert: validation gate fires for each worker output; at least one subtask is retried (using a stub that initially returns a structurally incomplete output); human escalation queue receives escalation events; final output schema is valid.
14. Variants and Extensions
14.1 Peer Supervisor Hierarchy
For very large tasks, multiple supervisors can operate in parallel, each managing a sub-pool of workers, with a meta-supervisor coordinating between them. Maximum hierarchy depth: 2 tiers of supervisors to avoid management overhead exceeding task value.
14.2 Specialised Supervisor per Domain
Rather than a general-purpose supervisor, deploy domain-specific supervisors (legal supervisor, financial supervisor, code review supervisor) each with a registry tuned to their domain. A routing layer directs incoming tasks to the correct domain supervisor.
14.3 Continuous Supervision (Streaming)
For real-time workflows, the supervisor monitors streaming worker outputs and interrupts a worker mid-generation if the partial output shows signs of heading off-track (e.g., generating content outside scope). Requires streaming support in the LLM provider API.
15. Trade-off Analysis
| Dimension | Supervisor Agent | Basic Orchestration | No Supervision |
|---|---|---|---|
| Output quality | Highest (validation gate) | Moderate | Lowest |
| Latency | Highest (validation overhead) | Moderate | Lowest |
| Cost | Highest (validation LLM calls) | Moderate | Lowest |
| Error recovery | Structured (reassign/escalate) | Basic (retry) | None |
| Compliance suitability | Highest | Moderate | Not suitable for regulated use |
Use supervisor pattern when: task errors have regulatory, financial, or safety consequences; intermediate results feed further reasoning steps; you need an audit trail of quality decisions.
Use basic orchestration when: tasks are lower-stakes; end-to-end validation of the final output is sufficient; latency budget is tight.
16. Known Implementations
| Organisation Type | Use Case | Worker Pool Size | Reported Outcome |
|---|---|---|---|
| Global insurance carrier | Policy document risk analysis | 6 specialist workers | Validation gate catches 23% of worker outputs; final error rate < 0.5% |
| Pharmaceutical company | Regulatory submission drafting | 8 specialist workers | Human escalation rate 3.2%; 0 regulatory rejections in 12 months |
| Tier-1 investment bank | Credit memo automation | 4 specialist workers | 65% reduction in analyst prep time; SR 11-7 audit passed |
| Healthcare network | Prior authorisation review | 5 specialist workers | 18% improvement in approval accuracy vs. unvalidated pipeline |
17. Related Patterns
| Pattern ID | Name | Relationship |
|---|---|---|
| EAAPL-MAG001 | Multi-Agent Orchestration | Foundation pattern; supervisor adds active oversight and validation |
| EAAPL-MAG003 | Human-in-the-Loop Agent | Used as the escalation endpoint for HUMAN_REVIEW outcomes |
| EAAPL-MAG006 | Agent Handoff Protocol | Defines message schema for supervisor-to-worker and worker-to-supervisor communication |
| EAAPL-INT007 | AI Circuit Breaker | Applied per worker type to handle worker health failures |
| EAAPL-MAG005 | Debate Agent | Supervisor can invoke debate between two workers on a contested subtask |
18. References
- Gartner, "AI Agent Topologies: From Orchestration to Supervision," 2025 (ID: G00817884)
- Anthropic, "Building Effective Agents," 2025 — anthropic.com/research/building-effective-agents
- Microsoft AutoGen: Agent Supervision Patterns — microsoft.github.io/autogen/docs/topics/supervision
- LangGraph: Supervisor Multi-Agent Architecture — langchain-ai.github.io/langgraph/tutorials/multi_agent/agent_supervisor
- EU AI Act (Regulation 2024/1689), Article 14: Human Oversight of High-Risk AI Systems
- NIST AI RMF 1.0, Map 5.2: Operator Monitoring and Human Review
- SR 11-7: Guidance on Model Risk Management — federalreserve.gov/supervisionreg/srletters/sr1107.htm
- Liang et al., "Encouraging Divergent Thinking in Large Language Models through Debate," 2023 — arxiv.org/abs/2305.19118
- Chase, H., "LangChain Expression Language and Multi-Agent Patterns," 2024
- AWS, "Building Reliable AI Agents with Step Functions and Bedrock," 2025 — aws.amazon.com/blogs/machine-learning