EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAgentic Workflows
Mature
⇄ Compare

Mixture of Agents

📄 Agentic WorkflowsEU AI ActISO/IEC 42001

[EAAPL-WRK008] Mixture of Agents

Category: Agentic Workflows Sub-category: Ensemble Architecture Version: 1.0 Maturity: Emerging Tags: mixture-of-agents, ensemble, consensus, multi-model, voting, aggregation, diverse-sampling Regulatory Relevance: ISO 42001 §8.4, EU AI Act (Art. 15), NIST AI RMF (MEASURE 2.5)


1. Executive Summary

The Mixture of Agents (MoA) Pattern defines an architecture in which multiple independent agents — potentially using different LLM providers, models, or sampling configurations — produce independent outputs for the same task, and a dedicated aggregator synthesises a final result from these diverse outputs. Unlike Fan-Out/Fan-In (EAAPL-WRK003), which fans out for coverage or throughput, MoA fans out for quality improvement through diversity: different models make different errors, and a well-calibrated aggregation can produce a result that consistently outperforms any individual model. Published benchmarks demonstrate 5–20% quality improvement over single-model approaches on complex reasoning and analysis tasks.

For CIO/CTO audiences: this is the AI equivalent of getting multiple independent expert opinions before making an important decision. A law firm, a medical panel, and a financial advisory board all use this approach — not because any individual expert is wrong, but because independent experts catch different errors and the consensus or synthesis is more reliable than any individual view. The cost is proportional to the number of agents (3 agents costs 3× as much as 1), so MoA is reserved for tasks where the quality improvement justifies the cost: high-stakes decisions, regulated analyses, outputs that will not be reviewed by a human expert.


2. Problem Statement

Business Problem

High-stakes enterprise decisions — legal risk assessments, clinical summaries, regulatory compliance determinations, executive briefings — require output reliability that a single LLM invocation cannot guarantee. The error rate of any individual model, even the most capable available, is non-zero and the failure mode is invisible: the model produces a confident, plausible-sounding output that contains material errors.

Technical Problem

Single-model outputs exhibit correlated errors: if the model makes a mistake in a particular domain or reasoning pattern, it makes the same mistake consistently. There is no mechanism within a single inference call to detect and correct errors that are within the model's systematic blind spots. Reflection (EAAPL-AGT006) addresses some errors but shares the same model's limitations in the critique phase.

Symptoms of Absence

  • High-stakes outputs require expensive human expert review because single-model reliability is insufficient
  • No diversity in output generation; all inference calls to the same model are correlated
  • Quality ceiling is the individual model's capability, with no ensemble improvement possible
  • Single-model failure modes are invisible until outputs are reviewed

Cost of Inaction

  • Quality Risk: Single-model errors in high-stakes outputs create compliance and liability exposure
  • Human Review Bottleneck: Expert review of every output is the only quality gate, creating a throughput bottleneck
  • Opportunity: Peers using ensemble approaches achieve demonstrably better quality without proportionally higher cost for high-value tasks

3. Context

When to Apply

  • Task output quality has material financial, legal, or safety consequences
  • The task has an articulable quality benchmark against which improvement can be measured
  • The additional cost (N× model inference) is justified by quality improvement and risk reduction
  • Multiple capable models are available (different providers or model families)
  • Aggregation strategy is clearly defined and produces consistently better results than individual outputs

When NOT to Apply

  • High-volume, low-stakes tasks where N× cost is not justified
  • Tasks requiring real-time responses with hard latency constraints
  • Tasks where all available models have the same systematic blind spots (diversity is the source of benefit)
  • Tasks that are too subjective to define a quality improvement metric

Prerequisites

  • Access to multiple independent LLM models/providers
  • Defined aggregation strategy (voting, synthesis, best-of-N)
  • Quality benchmark for measuring MoA improvement over single-model baseline
  • Fan-Out infrastructure (EAAPL-WRK003) for parallel worker execution

Industry Applicability

Industry MoA Use Case Quality Benefit
Legal Contract risk assessment Different models identify different risk clauses; synthesis is more comprehensive
Financial Services Analyst report generation Diverse reasoning catches analytical blind spots; synthesis is more balanced
Healthcare Clinical decision support Different models apply different clinical guidelines; synthesis reconciles differences
Government Policy impact assessment Multi-model diversity catches different stakeholder implications
Cybersecurity Threat analysis Different models identify different attack vectors; union analysis is more comprehensive

4. Architecture Overview

The MoA architecture has three layers: parallel proposers, optional discussion/critique, and an aggregator.

Proposer Layer Multiple proposer agents independently process the same task input and produce independent output proposals. The diversity between proposers is the source of quality improvement. Diversity can be achieved through: (a) different LLM providers (GPT-4o + Claude 3.5 Sonnet + Gemini 1.5 Pro), (b) different model sizes from the same provider, (c) same model with different sampling temperatures, (d) same model with different system prompts (each with a different expert persona), or (e) same model with different context window positions (to avoid position bias).

Optional Discussion Phase In higher-quality configurations, proposers are shown each other's outputs and produce a revised proposal that takes account of what others observed. This multi-round "discussion" mimics the expert panel model and can improve consensus quality. It adds latency and cost (each proposer makes an additional inference call per discussion round).

Aggregation Layer The Aggregator receives all proposer outputs and produces a final synthesised result. Aggregation strategies:

  • Majority Voting: For categorical decisions (classify this document as "high/medium/low risk"), take the majority vote. Simple, deterministic, explainable.
  • Weighted Voting: Weight votes by each proposer's historical accuracy on this task type. Requires tracked historical accuracy.
  • Union Aggregation: For coverage-oriented tasks (find all risk flags), take the union of all proposer findings. Each finding attributed to the proposer(s) that identified it.
  • LLM Synthesis: An aggregator LLM receives all proposer outputs and synthesises a unified final output. Highest quality, adds cost and latency. The aggregator is instructed to identify agreement, reconcile disagreements, and explain reasoning.
  • Best-of-N Selection: Score all proposers' outputs on a quality rubric and return the highest-scoring output. Preserves full fidelity of the best output; does not synthesise.

Confidence and Agreement Reporting The aggregation result includes metadata: the number of proposers that agreed (for voting strategies), the agreement rate (for structured outputs), and individual proposer outputs (for audit and transparency). Low agreement (e.g., < 2/3 proposers agree) is a signal that the task is genuinely ambiguous or that the proposers have conflicting knowledge, and this signal should be reported to the caller and, for high-stakes tasks, escalated to human review.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Task Input"] A[Task + Quality Objective] end subgraph Proposers["Proposer Layer - Parallel Execution"] P1[Proposer A] P2[Proposer B] P3[Proposer C] PN[Proposer N] end subgraph Discussion["Optional Discussion Phase"] D[Peer Review + Revise] end subgraph Aggregation["Aggregation Layer"] C[Result Collector] E{Aggregation Strategy} F[Voting / Union] G[LLM Synthesis] H{Agreement Threshold?} end subgraph Output["Output"] I[Final Result] J[Low Agreement Escalation] end A --> P1 & P2 & P3 & PN P1 & P2 & P3 & PN --> D D --> C C --> E E -->|deterministic| F E -->|synthesis| G F --> H G --> H H -->|sufficient| I H -->|low agreement| J

6. Components

Component Type Responsibility Technology Options Criticality
Proposer Pool AI Components N independent agents producing independent proposals Different LLM providers; different models; different personas Critical
Fan-Out Dispatcher Orchestration Submits all proposer tasks concurrently EAAPL-WRK003 implementation Critical
Discussion Coordinator AI Orchestration Optional: routes peer outputs to each proposer for revision Custom multi-round coordination loop Medium
Result Collector State Waits for all proposer completions; handles partial failures Async futures; Step Functions; Durable Functions Critical
Aggregation Engine AI/Logic Component Applies chosen aggregation strategy Custom voting/union logic; LLM synthesis call Critical
Agreement Evaluator Logic Component Computes agreement rate; triggers escalation on low agreement Custom; configurable threshold per task type High
Quality Benchmark Evaluator Quality Control Periodically evaluates MoA output quality vs. single-model baseline Offline evaluation pipeline; human labelled benchmark High

7. Data Flow

Step Actor Action Output
1 Caller Submits high-stakes task: "Assess legal enforceability risk of clause 14 in contract C-4921" Task with quality objective: "comprehensive legal risk assessment"
2 Fan-Out Dispatcher Dispatches same task to 3 proposers concurrently 3 parallel invocations (GPT-4o, Claude 3.5, Gemini 1.5)
3 Proposer A (GPT-4o) Produces risk assessment {risks: ["limitation of liability clause unenforceable under ACL §64"], severity: "high"}
4 Proposer B (Claude 3.5) Produces risk assessment {risks: ["limitation of liability clause unenforceable under ACL §64", "notice clause ambiguous"], severity: "high"}
5 Proposer C (Gemini 1.5) Produces risk assessment {risks: ["notice clause ambiguous"], severity: "medium"}
6 Result Collector All 3 results received Proposer outputs aggregated
7 Agreement Evaluator ACL §64 finding: 2/3 agree (67%). Notice clause: 2/3 agree. Severity: 2/3 "high". Agreement: 67% on key findings
8 Aggregation Engine (synthesis) LLM synthesises: confirms ACL §64 (2/3) + notice clause (2/3); rates overall severity "high" Synthesised result
9 Caller Receives result with metadata: {final_assessment, agreement_rate: 0.67, proposer_outputs: [A, B, C]} Final report

Error Flow

Error Detection Recovery
Proposer failure (one model unavailable) Partial failure detection Return partial with 2/N proposers; flag reduced confidence
All proposers fail Complete failure Fallback to single-model with failure metadata
Low agreement triggers escalation Agreement threshold Route to human expert queue with all proposer outputs

8. Security Considerations

Cross-Provider Data Sharing

  • In multi-provider MoA, task data is sent to multiple LLM provider APIs
  • Mitigation: Review data classification against each provider's data processing agreement; use single-provider MoA (different models or sampling configurations) for sensitive data that cannot be sent to multiple providers

OWASP LLM Top 10

OWASP LLM Risk MoA Applicability Mitigation
LLM01 Prompt Injection Task input sent to N providers; single poisoned input affects all proposers similarly Input sanitisation before dispatch; diversity does not defend against shared input injection
LLM09 Overreliance High agreement creates false confidence in potentially shared errors Agreement rate ≠ accuracy guarantee; quality benchmark required; human review for critical outputs
LLM04 Model DoS N× API calls per task can exhaust rate limits Rate limiting per provider; fan-out concurrency limits
LLM06 Sensitive Information Task data sent to multiple providers Data classification gate before multi-provider dispatch

9. Governance Considerations

Multi-Provider Data Governance

  • For each provider used in the proposer pool, verify data processing agreements, data residency requirements, and model training opt-out options
  • For APRA-regulated data, confirm each provider is an approved supply chain entity

Governance Artefacts

Artefact Owner Frequency Purpose
Proposer Pool Configuration AI Platform On change; quarterly review Documents which models/providers are in the proposer pool and their configuration
MoA Quality Benchmark ML Engineering Monthly Compares MoA output quality vs. single-model baseline; validates investment
Provider Data Processing Agreements Legal On provider addition Confirms data handling compliance for each proposer
Low-Agreement Escalation Log Compliance Per escalation event Tracks tasks escalated due to low proposer agreement

10. Operational Considerations

SLOs

SLO Target Window Alert
MoA completion rate (all proposers succeed) ≥ 95% 24-hour rolling < 90% triggers P2
Average proposer agreement rate ≥ 60% (task-type dependent) 24-hour rolling Trending down triggers P3 review
Quality improvement over single-model baseline ≥ 10% improvement on benchmark Monthly eval < 5% improvement questions MoA ROI
p95 wall-clock latency (parallel proposers) ≤ slowest proposer p95 × 1.3 1-hour rolling Exceeds 2× triggers P2

11. Cost Considerations

Configuration Proposers Cost vs. Single Model Quality Improvement
2-proposer (same model, different temperature) 2 +5–10%
3-proposer (different providers) 3 2.5–3× (provider price differences) +10–20%
3-proposer + discussion 3 + 3 discussion 4–5× +15–25%
4-proposer + LLM synthesis 4 + 1 5–6× +15–25%

Optimisation Strategy

  • Reserve MoA for tasks where single-model error rate is unacceptably high (> 5% material error rate)
  • Use same-provider multi-temperature configuration as cheapest diversity strategy
  • Route routine tasks to single-model; route high-stakes tasks to MoA via Router/Dispatcher (EAAPL-WRK004)

12. Trade-Off Analysis

Option Quality Cost Latency Explainability Best For
A: 3-proposer + synthesis (Recommended for high-stakes) Very High High Medium Medium High-stakes outputs; justified by risk reduction
B: 2-proposer + voting High Low High Medium-stakes; need explainable agreement
C: Single model + reflection (EAAPL-AGT006) High Low Medium High Most production tasks
D: Best-of-N selection High Low High Candidate generation; when best individual suffices

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Proposer model downtime (provider outage) Low–Medium Medium — reduced diversity Health check per provider Complete task with available proposers; flag reduced confidence
Correlated errors across proposers (shared training data blind spot) Medium High — false consensus Quality benchmark testing Maximise architectural diversity (different providers, not just different models)
Aggregator synthesis introduces new errors Low–Medium Medium — synthesis degrades quality Aggregation quality benchmark Benchmark aggregator quality; fallback to voting if synthesis degrades
Low agreement always triggers escalation (poorly calibrated threshold) Medium Medium — unnecessary human escalation Escalation rate monitoring Tune agreement threshold per task type based on historical data

14. Regulatory Considerations

EU AI Act

  • Art. 15 (Accuracy and Robustness): MoA directly implements the accuracy and robustness requirement for high-risk AI systems; the quality benchmark demonstrates measurable improvement.

ISO 42001

  • §8.4: Multi-provider proposer pool requires supply chain AI governance; each provider must be assessed under the organisation's AI supplier framework.

NIST AI RMF

  • MEASURE 2.5: The MoA quality benchmark (measured improvement over single-model baseline) directly implements the AI performance measurement requirement.

15. Reference Implementations

AWS

Component Service
Proposer Pool Amazon Bedrock (Anthropic Claude + Amazon Nova + Mistral) via unified API
Fan-Out AWS Step Functions Map state
Aggregation (LLM synthesis) Amazon Bedrock (Claude 3.5 Sonnet)
Result Store Amazon DynamoDB

Azure

Component Service
Proposer Pool Azure OpenAI (GPT-4o) + Azure AI Models (Llama, Mistral)
Fan-Out Azure Durable Functions
Aggregation Azure OpenAI synthesis call

On-Premises

Component Technology
Proposer Pool vLLM serving multiple models (Llama 3.1 70B, Mistral Large) + OpenAI API
Orchestration Python asyncio; LangGraph multi-agent graph

Pattern ID Relationship Type Notes
Parallel Fan-Out/Fan-In EAAPL-WRK003 Base Pattern MoA specialises fan-out/fan-in for quality improvement through model diversity
Multi-Agent Orchestration EAAPL-MAG001 Peer Orchestration for coordinated multi-agent work; MoA for independent parallel quality improvement
Reflexive Agent EAAPL-AGT006 Alternative Self-critique quality improvement; lower cost than MoA; MoA for higher stakes
Router/Dispatcher EAAPL-WRK004 Integrates With Dispatcher routes high-stakes tasks to MoA; routine tasks to single model

17. Maturity Assessment

Overall Maturity: Emerging

Dimension Score (1–5) Evidence
Research Foundation 4 MoA paper (Wang et al., 2024); ensemble LLM literature; strong academic evidence
Production Deployment 3 Early production deployments in research/legal tools; enterprise adoption starting
Framework Support 2 Custom implementations common; no dominant framework abstraction yet
Cost Optimisation 3 Provider pricing dynamics evolving; ROI models maturing
Aggregation Tooling 2 Custom synthesis aggregators common; no standard aggregation library

18. Revision History

Version Date Author Changes
1.0 2025-06-13 Architecture Board Initial publication in Agentic Workflows category
← Back to LibraryMore Agentic Workflows