EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryMulti-Agent Systems
Proven
⇄ Compare

EAAPL-MAG002 — Supervisor Agent

EAAPL-MAG002 — Supervisor Agent

Status: Proven Tags: agent orchestration human-oversight high-complexity Version: 2.0.0 Last Updated: 2026-06-12


1. Pattern Identity

Field Value
Pattern ID EAAPL-MAG002
Name Supervisor Agent
Category Multi-Agent
Maturity Proven
Complexity High
Related Patterns EAAPL-MAG001 · EAAPL-MAG003 · EAAPL-MAG006 · EAAPL-INT007

2. Executive Summary

The Supervisor Agent pattern establishes a hierarchical two-tier architecture in which a supervisor agent manages a pool of worker agents, each responsible for executing a bounded, well-defined subtask. Unlike the general orchestration pattern (EAAPL-MAG001), the supervisor is responsible not only for decomposition and dispatch but for continuous active oversight: monitoring worker progress against SLAs, validating each worker's output before incorporating it into the next reasoning step, recovering from worker failures through reassignment or escalation, and controlling the total cost of the worker pool. The critical distinguishing characteristic is the supervisor's validation gate: every worker output passes through a quality review before being used. This prevents hallucination compounding — the phenomenon where an incorrect intermediate output, when passed unchecked to the next worker, cascades into a deeply wrong final result. Enterprises deploy this pattern in high-stakes workflows (legal, financial, clinical, security) where intermediate errors carry regulatory or financial consequences.


3. Problem Statement

3.1 Context

In complex AI workflows, intermediate agent outputs are frequently passed forward without review. This is acceptable for low-stakes tasks where end-to-end correctness is verifiable cheaply. For high-stakes domains, a single fabricated clause in a contract analysis, a single incorrect risk score in a credit workflow, or a single misidentified vulnerability in a security audit can invalidate the entire downstream result — and may not be detected until human review occurs hours or days later.

3.2 Forces in Tension

  • Throughput vs. correctness. Validating every worker output adds latency and cost. Skipping validation allows errors to compound.
  • Autonomy vs. control. Workers that act independently are efficient but unsupervised. Workers that constantly check back with the supervisor are safe but slow.
  • Static pool vs. dynamic spawn. A static worker pool has predictable cost and cold-start time but cannot scale to burst demand. Dynamic spawning scales elastically but introduces provisioning latency.
  • Cost containment vs. completeness. More workers running in parallel completes the task faster but multiplies token spend.

3.3 Failure Modes Without This Pattern

Without a validation gate, hallucinated worker outputs become ground truth for subsequent reasoning steps, producing confidently wrong final outputs. Without SLA monitoring, a single slow worker holds up synthesis indefinitely. Without a capability registry, task assignments default to the cheapest or most-available worker rather than the most-appropriate one.


4. Solution

4.1 Supervisor-Worker Architecture

ARCHITECTURE DIAGRAM
flowchart TD subgraph Control["Supervisor Control"] A[Task Assignment] B[Supervisor Agent] C[Capability Registry] end subgraph Workers["Worker Pool"] D[Worker 1] E[Worker 2] F[Worker 3] end subgraph Validation["Result Handling"] G{Validation Gate} H[Result Synthesis] I[Human Review Queue] end A --> B --> C --> B B --> D --> G B --> E --> G B --> F --> G G -->|pass| H --> J[Final Output] G -->|fail| I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f3e8ff,stroke:#a855f7 style H fill:#f0fdf4,stroke:#22c55e style I fill:#fee2e2,stroke:#ef4444 style J fill:#d1fae5,stroke:#10b981

4.2 Worker Selection Flow

ARCHITECTURE DIAGRAM
flowchart TD subgraph Selection["Worker Selection"] A[Subtask Arrives] B[Capability Registry] C{Matching Workers} end subgraph Allocation["Worker Allocation"] D{Static Pool Free} E[Assign Static Worker] F{Budget Allows Spawn} G[Dynamic Spawn Worker] end A --> B --> C C -->|yes| D C -->|no| H[No Worker Error] D -->|yes| E --> I[Worker Executing] D -->|no| F F -->|yes| G --> I F -->|no| J[Budget Error] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#fef9c3,stroke:#eab308 style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#f0fdf4,stroke:#22c55e style H fill:#fee2e2,stroke:#ef4444 style I fill:#d1fae5,stroke:#10b981 style J fill:#fee2e2,stroke:#ef4444

5. Structure

5.1 Component Catalogue

Component Responsibility Technology Options
Supervisor Agent Plan, assign, monitor, validate, synthesise LLM with supervisor prompt, LangGraph StateGraph
Capability Registry Worker discovery, capability matching, cost/latency metadata In-memory map, Redis, service registry
Worker Pool Bounded-capability task execution LLM + tools, specialised models, RPA bots
Validation Gate Quality review of each worker output before use LLM-as-judge, JSON schema validator, business rule engine
SLA Monitor Track per-worker deadlines, trigger reassignment on breach Cron + deadline field in task message
Result Synthesiser Combine validated outputs into final response LLM synthesis prompt
Cost Controller Track and enforce token budget across the worker pool Middleware that intercepts LLM calls

5.2 Capability Registry Schema

{
  "workerId": "legal-clause-extractor-v2",
  "capabilities": ["contract-analysis", "clause-extraction", "risk-identification"],
  "acceptedInputSchema": "schemas/contract-input.json",
  "outputSchema": "schemas/clause-extraction-output.json",
  "model": "claude-3-7-sonnet",
  "averageLatencyMs": 4200,
  "costPerCallEstimateUSD": 0.024,
  "maxConcurrentTasks": 5,
  "healthEndpoint": "https://workers/legal-clause-extractor/health"
}

6. Behaviour

6.1 Supervisor Responsibilities

Receive and decompose task. The supervisor receives a task specification and breaks it into subtasks using its planning prompt. Each subtask maps to one or more capability tags in the registry.

Select and assign workers. For each subtask, the supervisor queries the capability registry for workers whose capability tags match. Selection criteria: capability match (required), current load (prefer idle), cost estimate (prefer cheaper if quality is equivalent), SLA compatibility (reject if worker's average latency would breach the subtask deadline).

Monitor progress. The supervisor maintains a live state map: { subtaskId → { workerId, assignedAt, deadlineMs, status } }. A background monitor checks for deadline breaches every N seconds. On breach, the supervisor either reassigns (if another capable worker is available) or marks the subtask as timed-out.

Validate worker outputs. This is the critical differentiator. For each worker output, the supervisor invokes a validation check before incorporating the result. Validation is multi-layered: schema validation (structured output format); factual consistency check (does the output contradict known facts from the input?); completeness check (did the worker address all required aspects of the subtask?); confidence scoring (if the worker returns a confidence score below threshold, flag for additional review).

Recover from failures. See Section 6.3.

Synthesise final result. Once all validated subtask results are collected (or gracefully degraded), the supervisor invokes a synthesis step that combines them into a coherent final output, explicitly noting any subtask gaps.

6.2 Quality Review — Preventing Hallucination Compounding

The validation gate is not optional. The most common failure mode in multi-agent systems is a hallucinated intermediate output being treated as ground truth by the next agent in the chain. The supervisor's validation gate breaks this chain.

Validation prompt structure:

You are a quality reviewer. Given the following subtask specification and worker output,
determine:
1. SCHEMA: Does the output conform to the expected JSON schema? [PASS/FAIL]
2. COMPLETENESS: Did the worker address all required aspects of the subtask? [PASS/FAIL + details]
3. CONSISTENCY: Does the output contradict any facts in the original input? [PASS/FAIL + details]
4. CONFIDENCE: Rate your confidence in the output quality from 1-10.

Return JSON: { "schemaPass": bool, "completenessPass": bool, "consistencyPass": bool,
               "confidence": int, "issues": [], "recommendation": "ACCEPT|REJECT|HUMAN_REVIEW" }

A REJECT causes a retry with a correction prompt appended to the worker's context. A HUMAN_REVIEW recommendation escalates to the human review queue. An ACCEPT with confidence below 7 is flagged in the final output as low-confidence.

6.3 Worker Failure Recovery

Failure Type Recovery Action
Worker timeout (deadline breach) Reassign to next-best capable worker. If none available, mark subtask as degraded.
Worker returns invalid schema Retry once with schema correction in prompt. On second failure, reject and attempt reassignment.
Worker quality validation fails Retry with correction prompt citing specific validation failures. On second failure, escalate to HUMAN_REVIEW.
Worker crashes / goes unhealthy Circuit breaker (EAAPL-INT007) opens. Reassign to healthy worker. Alert on-call.
No capable worker available Return NO_CAPABLE_WORKER error for that subtask. Supervisor flags in final output.
Budget exceeded before completion Complete highest-priority subtasks. Mark remaining as BUDGET_EXCEEDED. Return partial with explicit warning.

6.4 Dynamic Pool Management

Auto-scale. When all static pool workers for a capability type are at capacity and a new subtask arrives, the supervisor requests a dynamic worker spawn if the remaining task budget allows. Dynamic workers are terminated after completing their assigned subtask (or after an idle timeout of 60s) to control cost.

Scale-in. The supervisor tracks utilisation across the pool. Workers with zero tasks for > 5 minutes (configurable) are released to reduce standing cost.

Cost ceiling enforcement. The supervisor tracks cumulative token spend across all workers. When cumulative spend reaches 80% of the task budget, the supervisor switches to single-model-tier workers for remaining subtasks.


7. Implementation Guide

7.1 Step-by-Step

Step 1 — Build the Capability Registry. Start with a static JSON file or database table. Each entry describes a worker type, its capabilities, schemas, and cost/latency profile. The supervisor reads this at startup and caches it.

Step 2 — Implement the Supervisor Planning Prompt. The supervisor system prompt must include the capability registry as context, instructing it to return a structured plan mapping each subtask to a specific worker type. Include the instruction: "Always prefer lower-cost workers when quality requirements are equivalent."

Step 3 — Implement the Validation Gate. Build the validation prompt as a separate LLM call after each worker response. Use a cheaper, faster model for validation (e.g., Claude Haiku, GPT-4o-mini) to limit the cost overhead of the validation step to approximately 15% of the worker call cost.

Step 4 — Implement the SLA Monitor. Use a lightweight polling loop or event-driven timer that checks all in-flight subtask deadlines every 5 seconds. On deadline breach, the monitor calls supervisor.reassign(subtaskId).

Step 5 — Implement Result Synthesis. The synthesis step should be a single LLM call with all validated subtask results and an explicit instruction to note any missing or low-confidence domains.

7.2 Code Skeleton (TypeScript)

interface WorkerAssignment {
  subtaskId: string;
  workerId: string;
  assignedAt: number;
  deadlineMs: number;
  status: "running" | "completed" | "failed" | "timeout";
}

class SupervisorAgent {
  private assignments: Map<string, WorkerAssignment> = new Map();
  private costSpent = 0;

  async supervise(task: string, budgetUSD: number): Promise<SupervisorResult> {
    const plan = await this.plan(task);
    const workerPool = await this.selectWorkers(plan.subtasks);

    for (const subtask of plan.subtasks) {
      const worker = workerPool.get(subtask.requiredCapability);
      if (!worker) throw new Error(`NO_CAPABLE_WORKER:${subtask.id}`);

      const assignment: WorkerAssignment = {
        subtaskId: subtask.id,
        workerId: worker.id,
        assignedAt: Date.now(),
        deadlineMs: Date.now() + worker.averageLatencyMs * 2,
        status: "running"
      };
      this.assignments.set(subtask.id, assignment);

      const rawOutput = await this.executeWithTimeout(worker, subtask, assignment.deadlineMs);
      const validation = await this.validate(subtask, rawOutput);

      if (validation.recommendation === "REJECT") {
        const retryOutput = await this.retryWithCorrection(worker, subtask, rawOutput, validation.issues);
        const retryValidation = await this.validate(subtask, retryOutput);
        if (retryValidation.recommendation !== "ACCEPT") {
          await this.escalateToHuman(subtask, retryOutput, retryValidation);
          continue;
        }
      }
      if (validation.recommendation === "HUMAN_REVIEW") {
        await this.escalateToHuman(subtask, rawOutput, validation);
        continue;
      }
      this.assignments.get(subtask.id)!.status = "completed";
      this.costSpent += worker.costPerCallEstimateUSD;
      if (this.costSpent > budgetUSD) {
        return this.gracefulDegradation("BUDGET_EXCEEDED");
      }
    }
    return this.synthesise();
  }

  private async validate(subtask: SubTask, output: unknown): Promise<ValidationResult> {
    return validationLLM.invoke({ subtask, output });
  }
}

8. Observability

8.1 Supervisor-Level Metrics

Metric Description Alert Threshold
Validation pass rate % of worker outputs that pass validation first attempt < 85%
Worker reassignment rate % of subtasks requiring reassignment > 10%
Supervisor plan latency Time to produce worker assignment plan > 5s
Human escalation rate % of subtasks escalated to human review > 5%
Pool utilisation % of static pool workers busy at any time > 90% sustained
Cost efficiency ratio Total task cost / estimated cost at plan time > 1.5× (overspend)

8.2 Per-Worker Metrics

Each worker emits spans with: workerId, subtaskId, taskId, input/output token counts, model version, latency, and validation result. These spans roll up to the supervisor trace.


9. Cost Governance

  • Validation cost overhead. Use a cheaper model tier for the validation gate. Target < 15% overhead on the worker call cost.
  • Dynamic worker budget gate. Before spawning a dynamic worker, check that estimatedWorkerCost < costBudgetRemaining × 0.5. Never spend more than 50% of remaining budget on a single worker spawn.
  • Worker model tiering. Maintain a tiered cost map: Tier 1 (critical subtasks — frontier model), Tier 2 (standard subtasks — mid-tier model), Tier 3 (simple tasks — efficient model). The supervisor assigns tiers based on subtask criticality flags in the plan.
  • Synthesis model. The final synthesis step should always use at minimum a Tier 2 model to ensure coherent final output quality regardless of which worker tiers were used.

10. Security Considerations

10.1 Worker Isolation

Workers must not share memory state or credentials with each other. Each worker invocation is stateless. Shared data passes only through the supervisor's validated result store, never through direct worker-to-worker communication.

10.2 Supervisor Prompt Integrity

The supervisor prompt contains the capability registry and task context. If an attacker can inject into the task input, they could attempt to hijack worker selection. Mitigations: sanitise all task inputs before passing to the supervisor; use a structured input schema with strict type validation; never allow free-form user input to modify the supervisor's system prompt.

10.3 Validation Gate Bypass Attempts

A sophisticated adversarial input may be crafted to produce output that passes the validation gate while still being harmful. Supplement LLM-based validation with deterministic rule-based checks for known sensitive patterns (e.g., PII in output when output should not contain PII; external URLs in code output when no URLs are expected).


11. Failure Modes and Mitigations

Failure Mode Detection Mitigation
Validation gate is too lenient High human correction rate post-output Increase validation prompt stringency; add targeted rules for known failure patterns
Supervisor itself hallucinates a plan Plan schema validation fails Validate plan is DAG; validate all referenced worker types exist in registry
All workers of a type are busy Assignment queue growing; SLA breach Auto-scale trigger; alert; fallback to higher-tier worker if available
Validation adds unacceptable latency P95 latency breaches SLA Use faster validation model; parallelise validation with next worker dispatch
Cost overrun from retries Budget exceeded before task completion Cap retries at 2 per subtask; mark third failure as NEEDS_HUMAN and move on
Dynamic worker spawn failure Spawn request returns error Retry once; if still failing, use available static pool worker even if sub-optimal capability match

12. Compliance and Governance

12.1 EU AI Act Article 14 — Human Oversight

The supervisor's escalation-to-human pathway is a compliance artefact demonstrating that human oversight is structurally embedded, not optional. For high-risk AI systems, every instance where a worker output was escalated to human review must be logged with: the subtask description, the worker's output, the validation failure reason, the human reviewer identity, and the human's decision. This log is the regulatory evidence of meaningful human control.

12.2 Audit Trail Requirements

Every supervisor run produces an immutable audit record containing: task ID, task description, full plan, every worker assignment with timestamps, every validation result, every retry and its reason, every human escalation, the final synthesised output, and total cost. This record must be tamper-evident (append-only store or cryptographic hash chain) for regulated use cases.


13. Testing Strategy

13.1 Unit Tests

  • Capability registry lookup: given a subtask with capability tag X, assert the correct worker type is selected.
  • Validation gate: given a known-good and known-bad worker output, assert correct ACCEPT/REJECT classification.
  • SLA monitor: given an assignment with a deadline 100ms in the past, assert reassignment is triggered.

13.2 Integration Tests

  • Full supervisor run with stub workers: one worker returns invalid schema on first call, valid on second. Assert final output is produced and retry counter is incremented.
  • Budget exhaustion: stub workers report high cost. Assert supervisor stops assigning new subtasks when budget ceiling is reached and returns partial result.
  • Human escalation path: stub worker fails validation twice. Assert escalation event is written to the human review queue and subtask is marked as escalated.

13.3 Load Tests

  • Simulate 50 concurrent supervisor runs sharing a static worker pool of 10. Assert p95 latency remains within SLA and no worker is assigned more than maxConcurrentTasks subtasks simultaneously.

13.4 End-to-End Tests (Playwright)

For each supported high-stakes task type, run a live end-to-end test. Assert: validation gate fires for each worker output; at least one subtask is retried (using a stub that initially returns a structurally incomplete output); human escalation queue receives escalation events; final output schema is valid.


14. Variants and Extensions

14.1 Peer Supervisor Hierarchy

For very large tasks, multiple supervisors can operate in parallel, each managing a sub-pool of workers, with a meta-supervisor coordinating between them. Maximum hierarchy depth: 2 tiers of supervisors to avoid management overhead exceeding task value.

14.2 Specialised Supervisor per Domain

Rather than a general-purpose supervisor, deploy domain-specific supervisors (legal supervisor, financial supervisor, code review supervisor) each with a registry tuned to their domain. A routing layer directs incoming tasks to the correct domain supervisor.

14.3 Continuous Supervision (Streaming)

For real-time workflows, the supervisor monitors streaming worker outputs and interrupts a worker mid-generation if the partial output shows signs of heading off-track (e.g., generating content outside scope). Requires streaming support in the LLM provider API.


15. Trade-off Analysis

Dimension Supervisor Agent Basic Orchestration No Supervision
Output quality Highest (validation gate) Moderate Lowest
Latency Highest (validation overhead) Moderate Lowest
Cost Highest (validation LLM calls) Moderate Lowest
Error recovery Structured (reassign/escalate) Basic (retry) None
Compliance suitability Highest Moderate Not suitable for regulated use

Use supervisor pattern when: task errors have regulatory, financial, or safety consequences; intermediate results feed further reasoning steps; you need an audit trail of quality decisions.

Use basic orchestration when: tasks are lower-stakes; end-to-end validation of the final output is sufficient; latency budget is tight.


16. Known Implementations

Organisation Type Use Case Worker Pool Size Reported Outcome
Global insurance carrier Policy document risk analysis 6 specialist workers Validation gate catches 23% of worker outputs; final error rate < 0.5%
Pharmaceutical company Regulatory submission drafting 8 specialist workers Human escalation rate 3.2%; 0 regulatory rejections in 12 months
Tier-1 investment bank Credit memo automation 4 specialist workers 65% reduction in analyst prep time; SR 11-7 audit passed
Healthcare network Prior authorisation review 5 specialist workers 18% improvement in approval accuracy vs. unvalidated pipeline

Pattern ID Name Relationship
EAAPL-MAG001 Multi-Agent Orchestration Foundation pattern; supervisor adds active oversight and validation
EAAPL-MAG003 Human-in-the-Loop Agent Used as the escalation endpoint for HUMAN_REVIEW outcomes
EAAPL-MAG006 Agent Handoff Protocol Defines message schema for supervisor-to-worker and worker-to-supervisor communication
EAAPL-INT007 AI Circuit Breaker Applied per worker type to handle worker health failures
EAAPL-MAG005 Debate Agent Supervisor can invoke debate between two workers on a contested subtask

18. References

  1. Gartner, "AI Agent Topologies: From Orchestration to Supervision," 2025 (ID: G00817884)
  2. Anthropic, "Building Effective Agents," 2025 — anthropic.com/research/building-effective-agents
  3. Microsoft AutoGen: Agent Supervision Patterns — microsoft.github.io/autogen/docs/topics/supervision
  4. LangGraph: Supervisor Multi-Agent Architecture — langchain-ai.github.io/langgraph/tutorials/multi_agent/agent_supervisor
  5. EU AI Act (Regulation 2024/1689), Article 14: Human Oversight of High-Risk AI Systems
  6. NIST AI RMF 1.0, Map 5.2: Operator Monitoring and Human Review
  7. SR 11-7: Guidance on Model Risk Management — federalreserve.gov/supervisionreg/srletters/sr1107.htm
  8. Liang et al., "Encouraging Divergent Thinking in Large Language Models through Debate," 2023 — arxiv.org/abs/2305.19118
  9. Chase, H., "LangChain Expression Language and Multi-Agent Patterns," 2024
  10. AWS, "Building Reliable AI Agents with Step Functions and Bedrock," 2025 — aws.amazon.com/blogs/machine-learning
← Back to LibraryMore Multi-Agent Systems