EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryMulti-Agent Systems
Proven
⇄ Compare

EAAPL-MAG004 — Agent Swarm

EAAPL-MAG004 — Agent Swarm

Status: Emerging Tags: agent orchestration high-complexity enterprise-only Version: 2.0.0 Last Updated: 2026-06-12


1. Pattern Identity

Field Value
Pattern ID EAAPL-MAG004
Name Agent Swarm
Category Multi-Agent
Maturity Emerging
Complexity High
Related Patterns EAAPL-MAG001 · EAAPL-MAG002 · EAAPL-MAG003 · EAAPL-MAG006

2. Executive Summary

The Agent Swarm pattern coordinates a population of peer agents operating without a central controller. Rather than a supervisor decomposing and assigning work, swarm agents observe a shared world state (a "blackboard"), self-assign to available tasks based on local rules, deposit results back onto the blackboard, and leave markers (stigmergic signals) that guide subsequent agent behaviour. Coordination is emergent rather than designed. This produces a system that degrades gracefully — losing any single agent does not halt the swarm — and scales horizontally without a coordination bottleneck. The price is reduced predictability and harder observability: emergent behaviour can be difficult to explain, and convergence is probabilistic rather than deterministic. Agent swarms are an enterprise-grade pattern only for organisations that have established multi-agent orchestration maturity (EAAPL-MAG001, EAAPL-MAG002) and have invested in swarm-level observability infrastructure. They are inappropriate for regulated workflows requiring deterministic audit trails of decision logic.


3. Problem Statement

3.1 Context

Centralised orchestration (EAAPL-MAG001, EAAPL-MAG002) introduces a single point of failure and a coordination bottleneck at the orchestrator. For massively parallel workloads — indexing millions of documents, distributed web research across thousands of URLs, large-scale code repository analysis — the orchestrator becomes the limiting factor in throughput. Furthermore, if the orchestrator fails, all in-flight work is at risk. A decentralised architecture that eliminates the orchestrator bottleneck is needed for these at-scale use cases.

3.2 Forces in Tension

  • Resilience vs. predictability. Removing central control eliminates the single point of failure but makes the execution path non-deterministic. You cannot replay exactly what happened.
  • Throughput vs. coordination. Peer agents each make local decisions quickly but may duplicate work or create oscillation loops without careful stigmergy design.
  • Scalability vs. observability. Adding more agents improves throughput but multiplies the observability challenge — aggregating and interpreting signals from hundreds of agents requires dedicated infrastructure.
  • Emergent quality vs. guaranteed quality. Swarm results emerge from the aggregate of many agent outputs. Quality is probabilistically higher for large tasks but cannot be guaranteed for any specific subtask.

3.3 Failure Modes Without This Pattern

Without swarm architecture, highly parallel workloads require either a very large orchestrator (single point of failure, expensive) or nested orchestration hierarchies (complex, slow). The swarm pattern specifically addresses the throughput ceiling and the single-point-of-failure problem that centralised orchestration cannot efficiently solve at scale.


4. Solution

4.1 Swarm Architecture Overview

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Task Entry"] A[Task Posted to Blackboard] end subgraph Swarm["Swarm Agents"] B[Agent Alpha] C[Agent Beta] D[Agent Gamma] E[Agent Delta] end subgraph Shared["Shared Blackboard"] F[Task Queue] G[Results Store] H[Stigmergy Markers] end subgraph Output["Convergence"] I{Termination Check} J[Swarm Output Synthesiser] K[Final Result] end A --> F F --> B F --> C F --> D F --> E B --> G C --> G D --> G E --> H H --> F G --> I I -->|not done| F I -->|done| J --> K style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#fef9c3,stroke:#eab308 style I fill:#f3e8ff,stroke:#a855f7 style J fill:#f0fdf4,stroke:#22c55e style K fill:#d1fae5,stroke:#10b981

4.2 Stigmergy Signal Flow

ARCHITECTURE DIAGRAM
flowchart TD subgraph Agent["Agent Processing"] A[Agent Reads Blackboard] B[Claims Available Task] C[Executes Task] D[Deposits Result] E[Deposits Stigmergy Marker] end subgraph Board["Blackboard State"] F[Task: Unclaimed] G[Task: In-Progress] H[Task: Complete] I[Marker: HotZone] J[Marker: Explored] end A --> B B --> G C --> D --> H D --> E --> I I --> F style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981 style I fill:#f3e8ff,stroke:#a855f7 style J fill:#fef9c3,stroke:#eab308

5. Structure

5.1 Component Catalogue

Component Responsibility Technology Options
Blackboard Shared world state — tasks, results, markers Redis, DynamoDB, Postgres
Swarm Agents Self-directed task execution based on blackboard state LLM instances with tool access
Stigmergy Engine Manages markers that guide agent self-selection Weighted counters on the blackboard
Termination Monitor Detects convergence and triggers synthesis Background process checking blackboard state
Swarm Synthesiser Aggregates all agent results into a final output LLM with aggregation prompt
Swarm Observability Aggregates signals from all agents OpenTelemetry collector, time-series DB

5.2 Blackboard Record Schema

{
  "taskId": "uuid-v4",
  "taskType": "document-chunk-analysis",
  "status": "UNCLAIMED | IN_PROGRESS | COMPLETE | FAILED",
  "payload": { "chunkId": "...", "text": "..." },
  "claimedBy": "agent-uuid-or-null",
  "claimedAt": "ISO-8601-or-null",
  "completedAt": "ISO-8601-or-null",
  "result": { "entities": [], "sentiment": "...", "summary": "..." },
  "stigmergy": {
    "hotZone": 3,
    "explored": true,
    "explorationDepth": 2
  },
  "ttlMs": 300000
}

6. Behaviour

6.1 Shared Blackboard Communication

The blackboard is the sole communication channel between agents. Agents do not communicate directly with each other. The blackboard exposes:

  • Task queue. Ordered list of unclaimed tasks with priority and TTL.
  • Results store. Completed task records including agent output.
  • Stigmergy markers. Weighted signals left by agents to indicate areas of high or low value for further exploration.

Agent task selection uses an atomic claim operation (compare-and-swap on status: UNCLAIMED -> IN_PROGRESS with claimedBy: agentId). This prevents two agents from claiming the same task. If a claim fails (another agent beat them to it), the agent immediately re-evaluates the blackboard for the next available task.

6.2 Stigmergy

Stigmergy is the mechanism by which agents indirectly influence each other's behaviour through environmental markers, without direct communication. In an AI swarm:

  • Positive pheromone (hot zone marker): an agent that finds a highly productive task area (e.g., a document section with many relevant entities) increments a hotZone counter on that area. Other agents probabilistically bias their task selection toward high hot-zone areas.
  • Negative pheromone (explored marker): an agent that exhausts a task area marks it as explored: true. Other agents deprioritise already-explored areas.
  • Marker decay. Stigmergy markers decay over time (TTL-based counter reduction). This prevents the swarm from permanently fixating on a historically productive area that is no longer relevant. Decay rate is a tuning parameter.

6.3 Consensus Without Central Coordinator

For tasks requiring agreement among agents (e.g., document classification where multiple agents analyse the same document and must agree on a label):

  1. Each agent deposits its classification result on the blackboard.
  2. After N agents have deposited results (N is the consensus threshold), the termination monitor reads all results.
  3. If majority agreement exists (> 50% for binary, configurable for multi-class), the consensus result is recorded.
  4. If no consensus: spawn an additional agent with the full set of disagreeing results in context, asking it to adjudicate.

6.4 Swarm Stability Controls

Termination conditions. The swarm terminates when one of: all tasks in the blackboard are in COMPLETE or FAILED status; a wall-clock deadline is reached; the remaining unclaimed task count falls below a minimum threshold; the quality score of results reaches a target threshold.

Convergence detection. The termination monitor tracks the rate of new results being deposited. If the rate drops below a minimum threshold for a sustained period (configurable: e.g., fewer than 5 results per minute for 3 consecutive minutes), the swarm is declared converged even if tasks remain, indicating they are likely infeasible or blocked.

Anti-oscillation. Oscillation occurs when agents repeatedly claim and release the same tasks without making progress. Detect by tracking the number of IN_PROGRESS -> UNCLAIMED transitions per task. A task that has been claimed and abandoned more than 3 times is marked FAILED and removed from circulation.

Agent health monitoring. An agent that has been in IN_PROGRESS state for longer than 2× the expected task duration is presumed crashed. Its claimed tasks are returned to UNCLAIMED status for other agents to pick up.


7. Implementation Guide

7.1 Step-by-Step

Step 1 — Design the blackboard schema. Define your task record, result record, and stigmergy marker fields. Ensure the claim operation is atomic at the database level (use a transaction or conditional write).

Step 2 — Define agent selection logic. Each agent runs a loop: read blackboard → select best unclaimed task (weighted by priority + stigmergy) → atomic claim → execute → deposit result + update markers → repeat.

Step 3 — Implement termination conditions. Decide your termination criteria before deploying. Unclear termination is the most common swarm failure mode.

Step 4 — Implement marker decay. Run a background process that reduces stigmergy marker values by a decay factor every N seconds. Without decay, the swarm becomes permanently biased toward early high-value areas.

Step 5 — Build swarm observability. Before deploying to production, ensure you can answer: how many agents are currently active, what is the task completion rate, what is the current blackboard depth, and are any tasks oscillating?

Step 6 — Implement the swarm synthesiser. After termination, a single synthesis agent reads all completed results from the blackboard and produces the final output. This is the one centralised step in an otherwise decentralised architecture.

7.2 Code Skeleton (TypeScript)

class SwarmAgent {
  private agentId = crypto.randomUUID();

  async run(blackboard: Blackboard, maxIterations = 1000): Promise<void> {
    for (let i = 0; i < maxIterations; i++) {
      const task = await blackboard.claimNextTask(this.agentId);
      if (!task) {
        await sleep(500); // No tasks available, backoff
        continue;
      }

      const span = tracer.startSpan("swarm.agent.execute", { taskId: task.taskId, agentId: this.agentId });
      try {
        const result = await this.executeTask(task);
        await blackboard.depositResult(task.taskId, result);
        await blackboard.updateStigmergy(task.taskId, {
          hotZone: result.entityCount > 10 ? 3 : 1,
          explored: true
        });
        span.setStatus({ code: "OK" });
      } catch (e) {
        await blackboard.markFailed(task.taskId, this.agentId, String(e));
        span.setStatus({ code: "ERROR", message: String(e) });
      } finally {
        span.end();
      }
    }
  }

  private async executeTask(task: BlackboardTask): Promise<TaskResult> {
    return agentLLM.invoke({
      system: "You are a document analysis agent. Extract entities, sentiment, and key facts.",
      user: task.payload.text
    });
  }
}

// Launch swarm
const swarm = Array.from({ length: 20 }, () => new SwarmAgent());
await Promise.all(swarm.map(agent => agent.run(blackboard)));

8. Observability

8.1 Swarm-Level Metrics

The challenge of swarm observability is that individual agent traces are necessary but not sufficient — you need aggregate swarm health metrics in addition to per-agent spans.

Metric Description Alert Threshold
Active agent count Agents currently executing tasks < configured minimum (swarm shrinking unexpectedly)
Task completion rate Tasks completed per minute < 10% of initial rate sustained for 5m
Blackboard depth Unclaimed tasks remaining > 0 after termination deadline
Oscillating task rate Tasks claimed and abandoned > 3 times > 5% of total tasks
Convergence progress % of tasks in COMPLETE or FAILED state Used for progress estimation
Stigmergy concentration Whether 80% of agent activity is concentrated on 20% of tasks High concentration may indicate suboptimal coverage

8.2 Trace Aggregation

Each agent emits OpenTelemetry spans with the swarm run ID as the root trace context. The trace aggregation system must be able to: group spans by swarm run ID; show the timeline of task claims and completions across all agents; identify which agents had the highest error rates; show the evolution of the blackboard state over time.


9. Cost Governance

  • Agent count ceiling. Set a hard maximum on the number of agents that can run concurrently for a single swarm run. Without this ceiling, a runaway swarm can exhaust token budgets in minutes.
  • Per-task token budget. Each task on the blackboard has a maxTokensPerExecution field. Agents must honour this limit.
  • Swarm budget envelope. Set a total token budget for the entire swarm run. The termination monitor halts the swarm when this budget is reached, even if tasks remain.
  • Model tiering per task type. Simple tasks (chunked text extraction) use efficient models; complex tasks (cross-document reasoning) use frontier models. Encode the required model tier in the task record.

10. Security Considerations

10.1 Blackboard Isolation

The blackboard stores all task payloads and results. It must enforce tenant isolation — agents from one tenant must not read tasks or results belonging to another. Implement row-level security or key-prefix namespace separation.

10.2 Agent Identity

Each agent must authenticate to the blackboard using a short-lived token scoped to the current swarm run. Tokens expire when the swarm run ends. This prevents orphaned agents from continuing to access the blackboard after the run concludes.

10.3 Prompt Injection via Blackboard

Task payloads read from the blackboard may contain adversarial content. Sanitise task payloads before injecting them into agent prompts. Never allow task payload content to appear in the agent's system prompt — only in the user turn, clearly demarcated.


11. Failure Modes and Mitigations

Failure Mode Detection Mitigation
Swarm fails to converge Completion rate drops to near zero before all tasks complete Convergence detection triggers early termination; synthesiser works with partial results
Oscillating tasks block progress Oscillation rate above threshold Mark oscillating tasks as FAILED after 3 abandoned claims
Swarm fixates on one area Stigmergy concentration above threshold Increase marker decay rate; cap hot-zone score maximum
Agent flood (too many agents spawn) Cost spike alert Hard agent count ceiling per swarm run
Blackboard becomes consistency bottleneck Claim operation latency spikes Shard blackboard by task type; use optimistic locking
Human oversight loses track of emergent behaviour No swarm-level audit trail Swarm synthesiser must produce a narrative explaining which areas were explored and which were missed

12. Compliance and Governance

12.1 Auditability of Emergent Behaviour

The principal compliance challenge of the swarm pattern is that the execution path is non-deterministic — the same input will produce a different order of agent operations on each run. For regulated use cases requiring a deterministic audit trail, the swarm pattern is inappropriate. The centralised orchestration pattern (EAAPL-MAG001) or supervisor agent pattern (EAAPL-MAG002) should be used instead.

For enterprise use cases where swarm is appropriate (non-regulated, large-scale analysis), the audit record must capture: the full blackboard state at start and end of run; the aggregate list of tasks completed and failed; the final synthesised output; and the swarm run parameters (agent count, termination conditions, budget).

12.2 Human Oversight Integration

Because swarm behaviour is emergent and difficult to predict, human oversight must occur at the swarm output level rather than at individual agent decision points. Integrate EAAPL-MAG003 as a post-swarm checkpoint: before the swarm synthesiser's output is consumed by downstream systems, a human reviewer validates the aggregate result and approves publication.


13. Testing Strategy

13.1 Unit Tests

  • Atomic claim operation: two concurrent agents attempt to claim the same task; assert exactly one succeeds.
  • Stigmergy decay: a blackboard marker is written; after decay interval, assert the value has decreased by the expected factor.
  • Anti-oscillation: a task is claimed and abandoned 3 times; assert it is marked FAILED and removed from circulation.
  • Termination: all tasks transition to COMPLETE; assert the termination monitor fires and triggers synthesis.

13.2 Integration Tests

  • Swarm run with 5 agents and 50 pre-loaded tasks; assert all tasks complete within a configurable time limit.
  • Swarm run with one agent crashed mid-run; assert its claimed tasks are reclaimed by other agents and completed.
  • Swarm run with budget ceiling set to exhaust after 30 tasks; assert the swarm halts at the budget ceiling and returns a partial result.

13.3 Chaos Tests

  • Kill 50% of agents mid-run; assert remaining agents complete all tasks (possibly with increased latency).
  • Corrupt the blackboard state for 10% of task records; assert corrupted tasks are marked failed and do not block swarm completion.

13.4 Observability Tests

  • Assert that after a swarm run, the trace aggregation system contains spans from all active agents grouped under the swarm run ID.
  • Assert that the swarm summary metric (tasks completed / tasks total) reaches 100% or reports the correct partial completion rate.

14. Variants and Extensions

14.1 Hierarchical Swarm

A swarm that produces sub-tasks deposits them onto a secondary blackboard consumed by a child swarm. Enables recursive decomposition without a central orchestrator. Maximum depth: 2 levels recommended.

14.2 Swarm with Referee Agent

A single referee agent monitors swarm output quality in real time (without blocking swarm execution). If quality falls below threshold (e.g., too many agent results contradicting each other), the referee posts a correction task onto the blackboard for the swarm to address.

14.3 Hybrid Swarm-Orchestrator

A central orchestrator handles task decomposition and final synthesis; the execution of individual subtasks is delegated to a swarm of peer agents rather than assigned by the orchestrator. Preserves orchestrator observability for decomposition and synthesis while gaining swarm resilience for execution.


15. Trade-off Analysis

Dimension Agent Swarm Centralised Orchestration Supervisor Agent
Throughput ceiling None (horizontal scale) Limited by orchestrator Limited by supervisor
Single point of failure None Orchestrator Supervisor
Predictability Low (emergent) High (deterministic) High
Observability complexity High Moderate Moderate
Compliance suitability Low (non-regulated only) High Highest
Minimum viable team maturity High Moderate Moderate

16. Known Implementations

Organisation Type Use Case Swarm Size Reported Outcome
Legal tech platform Large-scale contract corpus analysis (10K+ docs) 50 agents 14× throughput vs orchestrated approach; 3% missed task rate
Research institution Distributed literature review across 100K papers 100 agents Covered 94% of relevant papers in 4 hours vs 3 days manually
E-commerce Product catalogue enrichment (1M+ SKUs) 200 agents 99.2% task completion rate; 0.8% required human review
Cybersecurity firm Distributed vulnerability scanning across large codebase 30 agents 8× faster than sequential scan; false positive rate 2.1%

Pattern ID Name Relationship
EAAPL-MAG001 Multi-Agent Orchestration Centralised alternative; recommended for regulated or lower-scale use cases
EAAPL-MAG002 Supervisor Agent Hybrid: supervisor handles quality gates; swarm handles parallel execution
EAAPL-MAG003 Human-in-the-Loop Agent Applied at swarm output level for post-synthesis human validation
EAAPL-MAG006 Agent Handoff Protocol Informs blackboard task record schema design

18. References

  1. Gartner, "Emergent AI Architectures: Beyond Orchestration," 2025 (ID: G00821567)
  2. Dorigo, M. and Stutzle, T., "Ant Colony Optimization," MIT Press, 2004
  3. Bonabeau, E. et al., "Swarm Intelligence: From Natural to Artificial Systems," Oxford University Press, 1999
  4. Microsoft Research, "Magnetic-One: A Generalist Multi-Agent System for Solving Complex Tasks," 2024
  5. AutoGen: Enabling Next-Generation Large Language Model Applications — arxiv.org/abs/2308.08155
  6. LangGraph: Multi-Agent Networks — langchain-ai.github.io/langgraph/tutorials/multi_agent/multi-agent-network
  7. Anthropic, "Building Effective Agents," 2025 — anthropic.com/research/building-effective-agents
  8. NIST SP 800-204D: Strategies for the Integration of Software Supply Chains (emergent system auditability principles)
  9. Wooldridge, M., "An Introduction to MultiAgent Systems," 2nd ed., Wiley, 2009
  10. OpenTelemetry Specification: Trace Context Propagation — opentelemetry.io/docs/reference/specification/trace
← Back to LibraryMore Multi-Agent Systems