Proven

EAAPL-MAG002 — Supervisor Agent

Status: Proven Tags: agent orchestration human-oversight high-complexity Version: 2.0.0 Last Updated: 2026-06-12

1. Pattern Identity

Field	Value
Pattern ID	EAAPL-MAG002
Name	Supervisor Agent
Category	Multi-Agent
Maturity	Proven
Complexity	High
Related Patterns	EAAPL-MAG001 · EAAPL-MAG003 · EAAPL-MAG006 · EAAPL-INT007

2. Executive Summary

The Supervisor Agent pattern establishes a hierarchical two-tier architecture in which a supervisor agent manages a pool of worker agents, each responsible for executing a bounded, well-defined subtask. Unlike the general orchestration pattern (EAAPL-MAG001), the supervisor is responsible not only for decomposition and dispatch but for continuous active oversight: monitoring worker progress against SLAs, validating each worker's output before incorporating it into the next reasoning step, recovering from worker failures through reassignment or escalation, and controlling the total cost of the worker pool. The critical distinguishing characteristic is the supervisor's validation gate: every worker output passes through a quality review before being used. This prevents hallucination compounding — the phenomenon where an incorrect intermediate output, when passed unchecked to the next worker, cascades into a deeply wrong final result. Enterprises deploy this pattern in high-stakes workflows (legal, financial, clinical, security) where intermediate errors carry regulatory or financial consequences.

3. Problem Statement

3.1 Context

In complex AI workflows, intermediate agent outputs are frequently passed forward without review. This is acceptable for low-stakes tasks where end-to-end correctness is verifiable cheaply. For high-stakes domains, a single fabricated clause in a contract analysis, a single incorrect risk score in a credit workflow, or a single misidentified vulnerability in a security audit can invalidate the entire downstream result — and may not be detected until human review occurs hours or days later.

3.2 Forces in Tension

Throughput vs. correctness. Validating every worker output adds latency and cost. Skipping validation allows errors to compound.
Autonomy vs. control. Workers that act independently are efficient but unsupervised. Workers that constantly check back with the supervisor are safe but slow.
Static pool vs. dynamic spawn. A static worker pool has predictable cost and cold-start time but cannot scale to burst demand. Dynamic spawning scales elastically but introduces provisioning latency.
Cost containment vs. completeness. More workers running in parallel completes the task faster but multiplies token spend.

3.3 Failure Modes Without This Pattern

Without a validation gate, hallucinated worker outputs become ground truth for subsequent reasoning steps, producing confidently wrong final outputs. Without SLA monitoring, a single slow worker holds up synthesis indefinitely. Without a capability registry, task assignments default to the cheapest or most-available worker rather than the most-appropriate one.

4. Solution

4.1 Supervisor-Worker Architecture

ARCHITECTURE DIAGRAM

flowchart TD subgraph Control["Supervisor Control"] A[Task Assignment] B[Supervisor Agent] C[Capability Registry] end subgraph Workers["Worker Pool"] D[Worker 1] E[Worker 2] F[Worker 3] end subgraph Validation["Result Handling"] G{Validation Gate} H[Result Synthesis] I[Human Review Queue] end A --> B --> C --> B B --> D --> G B --> E --> G B --> F --> G G -->|pass| H --> J[Final Output] G -->|fail| I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f3e8ff,stroke:#a855f7 style H fill:#f0fdf4,stroke:#22c55e style I fill:#fee2e2,stroke:#ef4444 style J fill:#d1fae5,stroke:#10b981

4.2 Worker Selection Flow

ARCHITECTURE DIAGRAM

flowchart TD subgraph Selection["Worker Selection"] A[Subtask Arrives] B[Capability Registry] C{Matching Workers} end subgraph Allocation["Worker Allocation"] D{Static Pool Free} E[Assign Static Worker] F{Budget Allows Spawn} G[Dynamic Spawn Worker] end A --> B --> C C -->|yes| D C -->|no| H[No Worker Error] D -->|yes| E --> I[Worker Executing] D -->|no| F F -->|yes| G --> I F -->|no| J[Budget Error] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#fef9c3,stroke:#eab308 style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#f0fdf4,stroke:#22c55e style H fill:#fee2e2,stroke:#ef4444 style I fill:#d1fae5,stroke:#10b981 style J fill:#fee2e2,stroke:#ef4444

5. Structure

5.1 Component Catalogue

Component	Responsibility	Technology Options
Supervisor Agent	Plan, assign, monitor, validate, synthesise	LLM with supervisor prompt, LangGraph StateGraph
Capability Registry	Worker discovery, capability matching, cost/latency metadata	In-memory map, Redis, service registry
Worker Pool	Bounded-capability task execution	LLM + tools, specialised models, RPA bots
Validation Gate	Quality review of each worker output before use	LLM-as-judge, JSON schema validator, business rule engine
SLA Monitor	Track per-worker deadlines, trigger reassignment on breach	Cron + deadline field in task message
Result Synthesiser	Combine validated outputs into final response	LLM synthesis prompt
Cost Controller	Track and enforce token budget across the worker pool	Middleware that intercepts LLM calls

5.2 Capability Registry Schema

{
  "workerId": "legal-clause-extractor-v2",
  "capabilities": ["contract-analysis", "clause-extraction", "risk-identification"],
  "acceptedInputSchema": "schemas/contract-input.json",
  "outputSchema": "schemas/clause-extraction-output.json",
  "model": "claude-3-7-sonnet",
  "averageLatencyMs": 4200,
  "costPerCallEstimateUSD": 0.024,
  "maxConcurrentTasks": 5,
  "healthEndpoint": "https://workers/legal-clause-extractor/health"
}

6. Behaviour

6.1 Supervisor Responsibilities

Receive and decompose task. The supervisor receives a task specification and breaks it into subtasks using its planning prompt. Each subtask maps to one or more capability tags in the registry.

Select and assign workers. For each subtask, the supervisor queries the capability registry for workers whose capability tags match. Selection criteria: capability match (required), current load (prefer idle), cost estimate (prefer cheaper if quality is equivalent), SLA compatibility (reject if worker's average latency would breach the subtask deadline).

Monitor progress. The supervisor maintains a live state map: { subtaskId → { workerId, assignedAt, deadlineMs, status } }. A background monitor checks for deadline breaches every N seconds. On breach, the supervisor either reassigns (if another capable worker is available) or marks the subtask as timed-out.

Validate worker outputs. This is the critical differentiator. For each worker output, the supervisor invokes a validation check before incorporating the result. Validation is multi-layered: schema validation (structured output format); factual consistency check (does the output contradict known facts from the input?); completeness check (did the worker address all required aspects of the subtask?); confidence scoring (if the worker returns a confidence score below threshold, flag for additional review).

Recover from failures. See Section 6.3.

Synthesise final result. Once all validated subtask results are collected (or gracefully degraded), the supervisor invokes a synthesis step that combines them into a coherent final output, explicitly noting any subtask gaps.

6.2 Quality Review — Preventing Hallucination Compounding

The validation gate is not optional. The most common failure mode in multi-agent systems is a hallucinated intermediate output being treated as ground truth by the next agent in the chain. The supervisor's validation gate breaks this chain.

Validation prompt structure:

You are a quality reviewer. Given the following subtask specification and worker output,
determine:
1. SCHEMA: Does the output conform to the expected JSON schema? [PASS/FAIL]
2. COMPLETENESS: Did the worker address all required aspects of the subtask? [PASS/FAIL + details]
3. CONSISTENCY: Does the output contradict any facts in the original input? [PASS/FAIL + details]
4. CONFIDENCE: Rate your confidence in the output quality from 1-10.

Return JSON: { "schemaPass": bool, "completenessPass": bool, "consistencyPass": bool,
               "confidence": int, "issues": [], "recommendation": "ACCEPT|REJECT|HUMAN_REVIEW" }

A REJECT causes a retry with a correction prompt appended to the worker's context. A HUMAN_REVIEW recommendation escalates to the human review queue. An ACCEPT with confidence below 7 is flagged in the final output as low-confidence.

6.3 Worker Failure Recovery

Failure Type	Recovery Action
Worker timeout (deadline breach)	Reassign to next-best capable worker. If none available, mark subtask as degraded.
Worker returns invalid schema	Retry once with schema correction in prompt. On second failure, reject and attempt reassignment.
Worker quality validation fails	Retry with correction prompt citing specific validation failures. On second failure, escalate to HUMAN_REVIEW.
Worker crashes / goes unhealthy	Circuit breaker (EAAPL-INT007) opens. Reassign to healthy worker. Alert on-call.
No capable worker available	Return `NO_CAPABLE_WORKER` error for that subtask. Supervisor flags in final output.
Budget exceeded before completion	Complete highest-priority subtasks. Mark remaining as `BUDGET_EXCEEDED`. Return partial with explicit warning.

6.4 Dynamic Pool Management

Auto-scale. When all static pool workers for a capability type are at capacity and a new subtask arrives, the supervisor requests a dynamic worker spawn if the remaining task budget allows. Dynamic workers are terminated after completing their assigned subtask (or after an idle timeout of 60s) to control cost.

Scale-in. The supervisor tracks utilisation across the pool. Workers with zero tasks for > 5 minutes (configurable) are released to reduce standing cost.

Cost ceiling enforcement. The supervisor tracks cumulative token spend across all workers. When cumulative spend reaches 80% of the task budget, the supervisor switches to single-model-tier workers for remaining subtasks.

7. Implementation Guide

7.1 Step-by-Step

Step 1 — Build the Capability Registry. Start with a static JSON file or database table. Each entry describes a worker type, its capabilities, schemas, and cost/latency profile. The supervisor reads this at startup and caches it.

Step 2 — Implement the Supervisor Planning Prompt. The supervisor system prompt must include the capability registry as context, instructing it to return a structured plan mapping each subtask to a specific worker type. Include the instruction: "Always prefer lower-cost workers when quality requirements are equivalent."

Step 3 — Implement the Validation Gate. Build the validation prompt as a separate LLM call after each worker response. Use a cheaper, faster model for validation (e.g., Claude Haiku, GPT-4o-mini) to limit the cost overhead of the validation step to approximately 15% of the worker call cost.

Step 4 — Implement the SLA Monitor. Use a lightweight polling loop or event-driven timer that checks all in-flight subtask deadlines every 5 seconds. On deadline breach, the monitor calls supervisor.reassign(subtaskId).

Step 5 — Implement Result Synthesis. The synthesis step should be a single LLM call with all validated subtask results and an explicit instruction to note any missing or low-confidence domains.

7.2 Code Skeleton (TypeScript)

interface WorkerAssignment {
  subtaskId: string;
  workerId: string;
  assignedAt: number;
  deadlineMs: number;
  status: "running" | "completed" | "failed" | "timeout";
}

class SupervisorAgent {
  private assignments: Map<string, WorkerAssignment> = new Map();
  private costSpent = 0;

  async supervise(task: string, budgetUSD: number): Promise<SupervisorResult> {
    const plan = await this.plan(task);
    const workerPool = await this.selectWorkers(plan.subtasks);

    for (const subtask of plan.subtasks) {
      const worker = workerPool.get(subtask.requiredCapability);
      if (!worker) throw new Error(`NO_CAPABLE_WORKER:${subtask.id}`);

      const assignment: WorkerAssignment = {
        subtaskId: subtask.id,
        workerId: worker.id,
        assignedAt: Date.now(),
        deadlineMs: Date.now() + worker.averageLatencyMs * 2,
        status: "running"
      };
      this.assignments.set(subtask.id, assignment);

      const rawOutput = await this.executeWithTimeout(worker, subtask, assignment.deadlineMs);
      const validation = await this.validate(subtask, rawOutput);

      if (validation.recommendation === "REJECT") {
        const retryOutput = await this.retryWithCorrection(worker, subtask, rawOutput, validation.issues);
        const retryValidation = await this.validate(subtask, retryOutput);
        if (retryValidation.recommendation !== "ACCEPT") {
          await this.escalateToHuman(subtask, retryOutput, retryValidation);
          continue;
        }
      }
      if (validation.recommendation === "HUMAN_REVIEW") {
        await this.escalateToHuman(subtask, rawOutput, validation);
        continue;
      }
      this.assignments.get(subtask.id)!.status = "completed";
      this.costSpent += worker.costPerCallEstimateUSD;
      if (this.costSpent > budgetUSD) {
        return this.gracefulDegradation("BUDGET_EXCEEDED");
      }
    }
    return this.synthesise();
  }

  private async validate(subtask: SubTask, output: unknown): Promise<ValidationResult> {
    return validationLLM.invoke({ subtask, output });
  }
}

8. Observability

8.1 Supervisor-Level Metrics

Metric	Description	Alert Threshold
Validation pass rate	% of worker outputs that pass validation first attempt	< 85%
Worker reassignment rate	% of subtasks requiring reassignment	> 10%
Supervisor plan latency	Time to produce worker assignment plan	> 5s
Human escalation rate	% of subtasks escalated to human review	> 5%
Pool utilisation	% of static pool workers busy at any time	> 90% sustained
Cost efficiency ratio	Total task cost / estimated cost at plan time	> 1.5× (overspend)

8.2 Per-Worker Metrics

Each worker emits spans with: workerId, subtaskId, taskId, input/output token counts, model version, latency, and validation result. These spans roll up to the supervisor trace.

9. Cost Governance

Validation cost overhead. Use a cheaper model tier for the validation gate. Target < 15% overhead on the worker call cost.
Dynamic worker budget gate. Before spawning a dynamic worker, check that estimatedWorkerCost < costBudgetRemaining × 0.5. Never spend more than 50% of remaining budget on a single worker spawn.
Worker model tiering. Maintain a tiered cost map: Tier 1 (critical subtasks — frontier model), Tier 2 (standard subtasks — mid-tier model), Tier 3 (simple tasks — efficient model). The supervisor assigns tiers based on subtask criticality flags in the plan.
Synthesis model. The final synthesis step should always use at minimum a Tier 2 model to ensure coherent final output quality regardless of which worker tiers were used.

10. Security Considerations

10.1 Worker Isolation

Workers must not share memory state or credentials with each other. Each worker invocation is stateless. Shared data passes only through the supervisor's validated result store, never through direct worker-to-worker communication.

10.2 Supervisor Prompt Integrity

The supervisor prompt contains the capability registry and task context. If an attacker can inject into the task input, they could attempt to hijack worker selection. Mitigations: sanitise all task inputs before passing to the supervisor; use a structured input schema with strict type validation; never allow free-form user input to modify the supervisor's system prompt.

10.3 Validation Gate Bypass Attempts

A sophisticated adversarial input may be crafted to produce output that passes the validation gate while still being harmful. Supplement LLM-based validation with deterministic rule-based checks for known sensitive patterns (e.g., PII in output when output should not contain PII; external URLs in code output when no URLs are expected).

11. Failure Modes and Mitigations

Failure Mode	Detection	Mitigation
Validation gate is too lenient	High human correction rate post-output	Increase validation prompt stringency; add targeted rules for known failure patterns
Supervisor itself hallucinates a plan	Plan schema validation fails	Validate plan is DAG; validate all referenced worker types exist in registry
All workers of a type are busy	Assignment queue growing; SLA breach	Auto-scale trigger; alert; fallback to higher-tier worker if available
Validation adds unacceptable latency	P95 latency breaches SLA	Use faster validation model; parallelise validation with next worker dispatch
Cost overrun from retries	Budget exceeded before task completion	Cap retries at 2 per subtask; mark third failure as `NEEDS_HUMAN` and move on
Dynamic worker spawn failure	Spawn request returns error	Retry once; if still failing, use available static pool worker even if sub-optimal capability match

12. Compliance and Governance

12.1 EU AI Act Article 14 — Human Oversight

The supervisor's escalation-to-human pathway is a compliance artefact demonstrating that human oversight is structurally embedded, not optional. For high-risk AI systems, every instance where a worker output was escalated to human review must be logged with: the subtask description, the worker's output, the validation failure reason, the human reviewer identity, and the human's decision. This log is the regulatory evidence of meaningful human control.

12.2 Audit Trail Requirements

Every supervisor run produces an immutable audit record containing: task ID, task description, full plan, every worker assignment with timestamps, every validation result, every retry and its reason, every human escalation, the final synthesised output, and total cost. This record must be tamper-evident (append-only store or cryptographic hash chain) for regulated use cases.

13. Testing Strategy

13.1 Unit Tests

Capability registry lookup: given a subtask with capability tag X, assert the correct worker type is selected.
Validation gate: given a known-good and known-bad worker output, assert correct ACCEPT/REJECT classification.
SLA monitor: given an assignment with a deadline 100ms in the past, assert reassignment is triggered.

13.2 Integration Tests

Full supervisor run with stub workers: one worker returns invalid schema on first call, valid on second. Assert final output is produced and retry counter is incremented.
Budget exhaustion: stub workers report high cost. Assert supervisor stops assigning new subtasks when budget ceiling is reached and returns partial result.
Human escalation path: stub worker fails validation twice. Assert escalation event is written to the human review queue and subtask is marked as escalated.

13.3 Load Tests

Simulate 50 concurrent supervisor runs sharing a static worker pool of 10. Assert p95 latency remains within SLA and no worker is assigned more than maxConcurrentTasks subtasks simultaneously.

13.4 End-to-End Tests (Playwright)

For each supported high-stakes task type, run a live end-to-end test. Assert: validation gate fires for each worker output; at least one subtask is retried (using a stub that initially returns a structurally incomplete output); human escalation queue receives escalation events; final output schema is valid.

14. Variants and Extensions

14.1 Peer Supervisor Hierarchy

For very large tasks, multiple supervisors can operate in parallel, each managing a sub-pool of workers, with a meta-supervisor coordinating between them. Maximum hierarchy depth: 2 tiers of supervisors to avoid management overhead exceeding task value.

14.2 Specialised Supervisor per Domain

Rather than a general-purpose supervisor, deploy domain-specific supervisors (legal supervisor, financial supervisor, code review supervisor) each with a registry tuned to their domain. A routing layer directs incoming tasks to the correct domain supervisor.

14.3 Continuous Supervision (Streaming)

For real-time workflows, the supervisor monitors streaming worker outputs and interrupts a worker mid-generation if the partial output shows signs of heading off-track (e.g., generating content outside scope). Requires streaming support in the LLM provider API.

15. Trade-off Analysis

Dimension	Supervisor Agent	Basic Orchestration	No Supervision
Output quality	Highest (validation gate)	Moderate	Lowest
Latency	Highest (validation overhead)	Moderate	Lowest
Cost	Highest (validation LLM calls)	Moderate	Lowest
Error recovery	Structured (reassign/escalate)	Basic (retry)	None
Compliance suitability	Highest	Moderate	Not suitable for regulated use

Use supervisor pattern when: task errors have regulatory, financial, or safety consequences; intermediate results feed further reasoning steps; you need an audit trail of quality decisions.

Use basic orchestration when: tasks are lower-stakes; end-to-end validation of the final output is sufficient; latency budget is tight.

16. Known Implementations

Organisation Type	Use Case	Worker Pool Size	Reported Outcome
Global insurance carrier	Policy document risk analysis	6 specialist workers	Validation gate catches 23% of worker outputs; final error rate < 0.5%
Pharmaceutical company	Regulatory submission drafting	8 specialist workers	Human escalation rate 3.2%; 0 regulatory rejections in 12 months
Tier-1 investment bank	Credit memo automation	4 specialist workers	65% reduction in analyst prep time; SR 11-7 audit passed
Healthcare network	Prior authorisation review	5 specialist workers	18% improvement in approval accuracy vs. unvalidated pipeline

Pattern ID	Name	Relationship
EAAPL-MAG001	Multi-Agent Orchestration	Foundation pattern; supervisor adds active oversight and validation
EAAPL-MAG003	Human-in-the-Loop Agent	Used as the escalation endpoint for HUMAN_REVIEW outcomes
EAAPL-MAG006	Agent Handoff Protocol	Defines message schema for supervisor-to-worker and worker-to-supervisor communication
EAAPL-INT007	AI Circuit Breaker	Applied per worker type to handle worker health failures
EAAPL-MAG005	Debate Agent	Supervisor can invoke debate between two workers on a contested subtask

18. References

Gartner, "AI Agent Topologies: From Orchestration to Supervision," 2025 (ID: G00817884)
Anthropic, "Building Effective Agents," 2025 — anthropic.com/research/building-effective-agents
Microsoft AutoGen: Agent Supervision Patterns — microsoft.github.io/autogen/docs/topics/supervision
LangGraph: Supervisor Multi-Agent Architecture — langchain-ai.github.io/langgraph/tutorials/multi_agent/agent_supervisor
EU AI Act (Regulation 2024/1689), Article 14: Human Oversight of High-Risk AI Systems
NIST AI RMF 1.0, Map 5.2: Operator Monitoring and Human Review
SR 11-7: Guidance on Model Risk Management — federalreserve.gov/supervisionreg/srletters/sr1107.htm
Liang et al., "Encouraging Divergent Thinking in Large Language Models through Debate," 2023 — arxiv.org/abs/2305.19118
Chase, H., "LangChain Expression Language and Multi-Agent Patterns," 2024
AWS, "Building Reliable AI Agents with Step Functions and Bedrock," 2025 — aws.amazon.com/blogs/machine-learning

Track this pattern for APRA/ASIC review

← Back to Library More Multi-Agent Systems →

EAAPL-MAG002 — Supervisor Agent

EAAPL-MAG002 — Supervisor Agent

1. Pattern Identity

2. Executive Summary

3. Problem Statement

3.1 Context

3.2 Forces in Tension

3.3 Failure Modes Without This Pattern

4. Solution

4.1 Supervisor-Worker Architecture

4.2 Worker Selection Flow

5. Structure

5.1 Component Catalogue

5.2 Capability Registry Schema

6. Behaviour

6.1 Supervisor Responsibilities

6.2 Quality Review — Preventing Hallucination Compounding

6.3 Worker Failure Recovery

6.4 Dynamic Pool Management

7. Implementation Guide

7.1 Step-by-Step

7.2 Code Skeleton (TypeScript)

8. Observability

8.1 Supervisor-Level Metrics

8.2 Per-Worker Metrics

9. Cost Governance

10. Security Considerations

10.1 Worker Isolation

10.2 Supervisor Prompt Integrity

10.3 Validation Gate Bypass Attempts

11. Failure Modes and Mitigations

12. Compliance and Governance

12.1 EU AI Act Article 14 — Human Oversight

12.2 Audit Trail Requirements

13. Testing Strategy

13.1 Unit Tests

13.2 Integration Tests

13.3 Load Tests

13.4 End-to-End Tests (Playwright)

14. Variants and Extensions

14.1 Peer Supervisor Hierarchy

14.2 Specialised Supervisor per Domain

14.3 Continuous Supervision (Streaming)

15. Trade-off Analysis

16. Known Implementations

17. Related Patterns

18. References