Emerging

Mixture of Agents

Agentic WorkflowsEU AI ActISO/IEC 42001

[EAAPL-WRK008] Mixture of Agents

Category: Agentic Workflows Sub-category: Ensemble Architecture Version: 1.0 Maturity: Emerging Tags: mixture-of-agents, ensemble, consensus, multi-model, voting, aggregation, diverse-sampling Regulatory Relevance: ISO 42001 §8.4, EU AI Act (Art. 15), NIST AI RMF (MEASURE 2.5)

1. Executive Summary

The Mixture of Agents (MoA) Pattern defines an architecture in which multiple independent agents — potentially using different LLM providers, models, or sampling configurations — produce independent outputs for the same task, and a dedicated aggregator synthesises a final result from these diverse outputs. Unlike Fan-Out/Fan-In (EAAPL-WRK003), which fans out for coverage or throughput, MoA fans out for quality improvement through diversity: different models make different errors, and a well-calibrated aggregation can produce a result that consistently outperforms any individual model. Published benchmarks demonstrate 5–20% quality improvement over single-model approaches on complex reasoning and analysis tasks.

For CIO/CTO audiences: this is the AI equivalent of getting multiple independent expert opinions before making an important decision. A law firm, a medical panel, and a financial advisory board all use this approach — not because any individual expert is wrong, but because independent experts catch different errors and the consensus or synthesis is more reliable than any individual view. The cost is proportional to the number of agents (3 agents costs 3× as much as 1), so MoA is reserved for tasks where the quality improvement justifies the cost: high-stakes decisions, regulated analyses, outputs that will not be reviewed by a human expert.

2. Problem Statement

Business Problem

High-stakes enterprise decisions — legal risk assessments, clinical summaries, regulatory compliance determinations, executive briefings — require output reliability that a single LLM invocation cannot guarantee. The error rate of any individual model, even the most capable available, is non-zero and the failure mode is invisible: the model produces a confident, plausible-sounding output that contains material errors.

Technical Problem

Single-model outputs exhibit correlated errors: if the model makes a mistake in a particular domain or reasoning pattern, it makes the same mistake consistently. There is no mechanism within a single inference call to detect and correct errors that are within the model's systematic blind spots. Reflection (EAAPL-AGT006) addresses some errors but shares the same model's limitations in the critique phase.

Symptoms of Absence

High-stakes outputs require expensive human expert review because single-model reliability is insufficient
No diversity in output generation; all inference calls to the same model are correlated
Quality ceiling is the individual model's capability, with no ensemble improvement possible
Single-model failure modes are invisible until outputs are reviewed

Cost of Inaction

Quality Risk: Single-model errors in high-stakes outputs create compliance and liability exposure
Human Review Bottleneck: Expert review of every output is the only quality gate, creating a throughput bottleneck
Opportunity: Peers using ensemble approaches achieve demonstrably better quality without proportionally higher cost for high-value tasks

3. Context

When to Apply

Task output quality has material financial, legal, or safety consequences
The task has an articulable quality benchmark against which improvement can be measured
The additional cost (N× model inference) is justified by quality improvement and risk reduction
Multiple capable models are available (different providers or model families)
Aggregation strategy is clearly defined and produces consistently better results than individual outputs

When NOT to Apply

High-volume, low-stakes tasks where N× cost is not justified
Tasks requiring real-time responses with hard latency constraints
Tasks where all available models have the same systematic blind spots (diversity is the source of benefit)
Tasks that are too subjective to define a quality improvement metric

Prerequisites

Access to multiple independent LLM models/providers
Defined aggregation strategy (voting, synthesis, best-of-N)
Quality benchmark for measuring MoA improvement over single-model baseline
Fan-Out infrastructure (EAAPL-WRK003) for parallel worker execution

Industry Applicability

Industry	MoA Use Case	Quality Benefit
Legal	Contract risk assessment	Different models identify different risk clauses; synthesis is more comprehensive
Financial Services	Analyst report generation	Diverse reasoning catches analytical blind spots; synthesis is more balanced
Healthcare	Clinical decision support	Different models apply different clinical guidelines; synthesis reconciles differences
Government	Policy impact assessment	Multi-model diversity catches different stakeholder implications
Cybersecurity	Threat analysis	Different models identify different attack vectors; union analysis is more comprehensive

4. Architecture Overview

The MoA architecture has three layers: parallel proposers, optional discussion/critique, and an aggregator.

Proposer Layer Multiple proposer agents independently process the same task input and produce independent output proposals. The diversity between proposers is the source of quality improvement. Diversity can be achieved through: (a) different LLM providers (GPT-4o + Claude 3.5 Sonnet + Gemini 1.5 Pro), (b) different model sizes from the same provider, (c) same model with different sampling temperatures, (d) same model with different system prompts (each with a different expert persona), or (e) same model with different context window positions (to avoid position bias).

Optional Discussion Phase In higher-quality configurations, proposers are shown each other's outputs and produce a revised proposal that takes account of what others observed. This multi-round "discussion" mimics the expert panel model and can improve consensus quality. It adds latency and cost (each proposer makes an additional inference call per discussion round).

Aggregation Layer The Aggregator receives all proposer outputs and produces a final synthesised result. Aggregation strategies:

Majority Voting: For categorical decisions (classify this document as "high/medium/low risk"), take the majority vote. Simple, deterministic, explainable.
Weighted Voting: Weight votes by each proposer's historical accuracy on this task type. Requires tracked historical accuracy.
Union Aggregation: For coverage-oriented tasks (find all risk flags), take the union of all proposer findings. Each finding attributed to the proposer(s) that identified it.
LLM Synthesis: An aggregator LLM receives all proposer outputs and synthesises a unified final output. Highest quality, adds cost and latency. The aggregator is instructed to identify agreement, reconcile disagreements, and explain reasoning.
Best-of-N Selection: Score all proposers' outputs on a quality rubric and return the highest-scoring output. Preserves full fidelity of the best output; does not synthesise.

Confidence and Agreement Reporting The aggregation result includes metadata: the number of proposers that agreed (for voting strategies), the agreement rate (for structured outputs), and individual proposer outputs (for audit and transparency). Low agreement (e.g., < 2/3 proposers agree) is a signal that the task is genuinely ambiguous or that the proposers have conflicting knowledge, and this signal should be reported to the caller and, for high-stakes tasks, escalated to human review.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Task Input"] A[Task + Quality Objective] end subgraph Proposers["Proposer Layer - Parallel Execution"] P1[Proposer A] P2[Proposer B] P3[Proposer C] PN[Proposer N] end subgraph Discussion["Optional Discussion Phase"] D[Peer Review + Revise] end subgraph Aggregation["Aggregation Layer"] C[Result Collector] E{Aggregation Strategy} F[Voting / Union] G[LLM Synthesis] H{Agreement Threshold?} end subgraph Output["Output"] I[Final Result] J[Low Agreement Escalation] end A --> P1 & P2 & P3 & PN P1 & P2 & P3 & PN --> D D --> C C --> E E -->|deterministic| F E -->|synthesis| G F --> H G --> H H -->|sufficient| I H -->|low agreement| J

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Proposer Pool	AI Components	N independent agents producing independent proposals	Different LLM providers; different models; different personas	Critical
Fan-Out Dispatcher	Orchestration	Submits all proposer tasks concurrently	EAAPL-WRK003 implementation	Critical
Discussion Coordinator	AI Orchestration	Optional: routes peer outputs to each proposer for revision	Custom multi-round coordination loop	Medium
Result Collector	State	Waits for all proposer completions; handles partial failures	Async futures; Step Functions; Durable Functions	Critical
Aggregation Engine	AI/Logic Component	Applies chosen aggregation strategy	Custom voting/union logic; LLM synthesis call	Critical
Agreement Evaluator	Logic Component	Computes agreement rate; triggers escalation on low agreement	Custom; configurable threshold per task type	High
Quality Benchmark Evaluator	Quality Control	Periodically evaluates MoA output quality vs. single-model baseline	Offline evaluation pipeline; human labelled benchmark	High

7. Data Flow

Step	Actor	Action	Output
1	Caller	Submits high-stakes task: "Assess legal enforceability risk of clause 14 in contract C-4921"	Task with quality objective: "comprehensive legal risk assessment"
2	Fan-Out Dispatcher	Dispatches same task to 3 proposers concurrently	3 parallel invocations (GPT-4o, Claude 3.5, Gemini 1.5)
3	Proposer A (GPT-4o)	Produces risk assessment	`{risks: ["limitation of liability clause unenforceable under ACL §64"], severity: "high"}`
4	Proposer B (Claude 3.5)	Produces risk assessment	`{risks: ["limitation of liability clause unenforceable under ACL §64", "notice clause ambiguous"], severity: "high"}`
5	Proposer C (Gemini 1.5)	Produces risk assessment	`{risks: ["notice clause ambiguous"], severity: "medium"}`
6	Result Collector	All 3 results received	Proposer outputs aggregated
7	Agreement Evaluator	ACL §64 finding: 2/3 agree (67%). Notice clause: 2/3 agree. Severity: 2/3 "high".	Agreement: 67% on key findings
8	Aggregation Engine (synthesis)	LLM synthesises: confirms ACL §64 (2/3) + notice clause (2/3); rates overall severity "high"	Synthesised result
9	Caller	Receives result with metadata: `{final_assessment, agreement_rate: 0.67, proposer_outputs: [A, B, C]}`	Final report

Error Flow

Error	Detection	Recovery
Proposer failure (one model unavailable)	Partial failure detection	Return partial with 2/N proposers; flag reduced confidence
All proposers fail	Complete failure	Fallback to single-model with failure metadata
Low agreement triggers escalation	Agreement threshold	Route to human expert queue with all proposer outputs

8. Security Considerations

Cross-Provider Data Sharing

In multi-provider MoA, task data is sent to multiple LLM provider APIs
Mitigation: Review data classification against each provider's data processing agreement; use single-provider MoA (different models or sampling configurations) for sensitive data that cannot be sent to multiple providers

OWASP LLM Top 10

OWASP LLM Risk	MoA Applicability	Mitigation
LLM01 Prompt Injection	Task input sent to N providers; single poisoned input affects all proposers similarly	Input sanitisation before dispatch; diversity does not defend against shared input injection
LLM09 Overreliance	High agreement creates false confidence in potentially shared errors	Agreement rate ≠ accuracy guarantee; quality benchmark required; human review for critical outputs
LLM04 Model DoS	N× API calls per task can exhaust rate limits	Rate limiting per provider; fan-out concurrency limits
LLM06 Sensitive Information	Task data sent to multiple providers	Data classification gate before multi-provider dispatch

9. Governance Considerations

Multi-Provider Data Governance

For each provider used in the proposer pool, verify data processing agreements, data residency requirements, and model training opt-out options
For APRA-regulated data, confirm each provider is an approved supply chain entity

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Proposer Pool Configuration	AI Platform	On change; quarterly review	Documents which models/providers are in the proposer pool and their configuration
MoA Quality Benchmark	ML Engineering	Monthly	Compares MoA output quality vs. single-model baseline; validates investment
Provider Data Processing Agreements	Legal	On provider addition	Confirms data handling compliance for each proposer
Low-Agreement Escalation Log	Compliance	Per escalation event	Tracks tasks escalated due to low proposer agreement

10. Operational Considerations

SLOs

SLO	Target	Window	Alert
MoA completion rate (all proposers succeed)	≥ 95%	24-hour rolling	< 90% triggers P2
Average proposer agreement rate	≥ 60% (task-type dependent)	24-hour rolling	Trending down triggers P3 review
Quality improvement over single-model baseline	≥ 10% improvement on benchmark	Monthly eval	< 5% improvement questions MoA ROI
p95 wall-clock latency (parallel proposers)	≤ slowest proposer p95 × 1.3	1-hour rolling	Exceeds 2× triggers P2

11. Cost Considerations

Configuration	Proposers	Cost vs. Single Model	Quality Improvement
2-proposer (same model, different temperature)	2	2×	+5–10%
3-proposer (different providers)	3	2.5–3× (provider price differences)	+10–20%
3-proposer + discussion	3 + 3 discussion	4–5×	+15–25%
4-proposer + LLM synthesis	4 + 1	5–6×	+15–25%

Optimisation Strategy

Reserve MoA for tasks where single-model error rate is unacceptably high (> 5% material error rate)
Use same-provider multi-temperature configuration as cheapest diversity strategy
Route routine tasks to single-model; route high-stakes tasks to MoA via Router/Dispatcher (EAAPL-WRK004)

12. Trade-Off Analysis

Option	Quality	Cost	Latency	Explainability	Best For
A: 3-proposer + synthesis (Recommended for high-stakes)	Very High	High	Medium	Medium	High-stakes outputs; justified by risk reduction
B: 2-proposer + voting	High	2×	Low	High	Medium-stakes; need explainable agreement
C: Single model + reflection (EAAPL-AGT006)	High	Low	Medium	High	Most production tasks
D: Best-of-N selection	High	N×	Low	High	Candidate generation; when best individual suffices

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Proposer model downtime (provider outage)	Low–Medium	Medium — reduced diversity	Health check per provider	Complete task with available proposers; flag reduced confidence
Correlated errors across proposers (shared training data blind spot)	Medium	High — false consensus	Quality benchmark testing	Maximise architectural diversity (different providers, not just different models)
Aggregator synthesis introduces new errors	Low–Medium	Medium — synthesis degrades quality	Aggregation quality benchmark	Benchmark aggregator quality; fallback to voting if synthesis degrades
Low agreement always triggers escalation (poorly calibrated threshold)	Medium	Medium — unnecessary human escalation	Escalation rate monitoring	Tune agreement threshold per task type based on historical data

14. Regulatory Considerations

EU AI Act

Art. 15 (Accuracy and Robustness): MoA directly implements the accuracy and robustness requirement for high-risk AI systems; the quality benchmark demonstrates measurable improvement.

ISO 42001

§8.4: Multi-provider proposer pool requires supply chain AI governance; each provider must be assessed under the organisation's AI supplier framework.

NIST AI RMF

MEASURE 2.5: The MoA quality benchmark (measured improvement over single-model baseline) directly implements the AI performance measurement requirement.

15. Reference Implementations

AWS

Component	Service
Proposer Pool	Amazon Bedrock (Anthropic Claude + Amazon Nova + Mistral) via unified API
Fan-Out	AWS Step Functions Map state
Aggregation (LLM synthesis)	Amazon Bedrock (Claude 3.5 Sonnet)
Result Store	Amazon DynamoDB

Azure

Component	Service
Proposer Pool	Azure OpenAI (GPT-4o) + Azure AI Models (Llama, Mistral)
Fan-Out	Azure Durable Functions
Aggregation	Azure OpenAI synthesis call

On-Premises

Component	Technology
Proposer Pool	vLLM serving multiple models (Llama 3.1 70B, Mistral Large) + OpenAI API
Orchestration	Python asyncio; LangGraph multi-agent graph

Pattern	ID	Relationship Type	Notes
Parallel Fan-Out/Fan-In	EAAPL-WRK003	Base Pattern	MoA specialises fan-out/fan-in for quality improvement through model diversity
Multi-Agent Orchestration	EAAPL-MAG001	Peer	Orchestration for coordinated multi-agent work; MoA for independent parallel quality improvement
Reflexive Agent	EAAPL-AGT006	Alternative	Self-critique quality improvement; lower cost than MoA; MoA for higher stakes
Router/Dispatcher	EAAPL-WRK004	Integrates With	Dispatcher routes high-stakes tasks to MoA; routine tasks to single model

17. Maturity Assessment

Overall Maturity: Emerging

Dimension	Score (1–5)	Evidence
Research Foundation	4	MoA paper (Wang et al., 2024); ensemble LLM literature; strong academic evidence
Production Deployment	3	Early production deployments in research/legal tools; enterprise adoption starting
Framework Support	2	Custom implementations common; no dominant framework abstraction yet
Cost Optimisation	3	Provider pricing dynamics evolving; ROI models maturing
Aggregation Tooling	2	Custom synthesis aggregators common; no standard aggregation library

18. Revision History

Version	Date	Author	Changes
1.0	2025-06-13	Architecture Board	Initial publication in Agentic Workflows category

Track this pattern for APRA/ASIC review

← Back to Library More Agentic Workflows →