[EAAPL-WRK008] Mixture of Agents
Category: Agentic Workflows
Sub-category: Ensemble Architecture
Version: 1.0
Maturity: Emerging
Tags: mixture-of-agents, ensemble, consensus, multi-model, voting, aggregation, diverse-sampling
Regulatory Relevance: ISO 42001 §8.4, EU AI Act (Art. 15), NIST AI RMF (MEASURE 2.5)
1. Executive Summary
The Mixture of Agents (MoA) Pattern defines an architecture in which multiple independent agents — potentially using different LLM providers, models, or sampling configurations — produce independent outputs for the same task, and a dedicated aggregator synthesises a final result from these diverse outputs. Unlike Fan-Out/Fan-In (EAAPL-WRK003), which fans out for coverage or throughput, MoA fans out for quality improvement through diversity: different models make different errors, and a well-calibrated aggregation can produce a result that consistently outperforms any individual model. Published benchmarks demonstrate 5–20% quality improvement over single-model approaches on complex reasoning and analysis tasks.
For CIO/CTO audiences: this is the AI equivalent of getting multiple independent expert opinions before making an important decision. A law firm, a medical panel, and a financial advisory board all use this approach — not because any individual expert is wrong, but because independent experts catch different errors and the consensus or synthesis is more reliable than any individual view. The cost is proportional to the number of agents (3 agents costs 3× as much as 1), so MoA is reserved for tasks where the quality improvement justifies the cost: high-stakes decisions, regulated analyses, outputs that will not be reviewed by a human expert.
2. Problem Statement
Business Problem
High-stakes enterprise decisions — legal risk assessments, clinical summaries, regulatory compliance determinations, executive briefings — require output reliability that a single LLM invocation cannot guarantee. The error rate of any individual model, even the most capable available, is non-zero and the failure mode is invisible: the model produces a confident, plausible-sounding output that contains material errors.
Technical Problem
Single-model outputs exhibit correlated errors: if the model makes a mistake in a particular domain or reasoning pattern, it makes the same mistake consistently. There is no mechanism within a single inference call to detect and correct errors that are within the model's systematic blind spots. Reflection (EAAPL-AGT006) addresses some errors but shares the same model's limitations in the critique phase.
Symptoms of Absence
- High-stakes outputs require expensive human expert review because single-model reliability is insufficient
- No diversity in output generation; all inference calls to the same model are correlated
- Quality ceiling is the individual model's capability, with no ensemble improvement possible
- Single-model failure modes are invisible until outputs are reviewed
Cost of Inaction
- Quality Risk: Single-model errors in high-stakes outputs create compliance and liability exposure
- Human Review Bottleneck: Expert review of every output is the only quality gate, creating a throughput bottleneck
- Opportunity: Peers using ensemble approaches achieve demonstrably better quality without proportionally higher cost for high-value tasks
3. Context
When to Apply
- Task output quality has material financial, legal, or safety consequences
- The task has an articulable quality benchmark against which improvement can be measured
- The additional cost (N× model inference) is justified by quality improvement and risk reduction
- Multiple capable models are available (different providers or model families)
- Aggregation strategy is clearly defined and produces consistently better results than individual outputs
When NOT to Apply
- High-volume, low-stakes tasks where N× cost is not justified
- Tasks requiring real-time responses with hard latency constraints
- Tasks where all available models have the same systematic blind spots (diversity is the source of benefit)
- Tasks that are too subjective to define a quality improvement metric
Prerequisites
- Access to multiple independent LLM models/providers
- Defined aggregation strategy (voting, synthesis, best-of-N)
- Quality benchmark for measuring MoA improvement over single-model baseline
- Fan-Out infrastructure (EAAPL-WRK003) for parallel worker execution
Industry Applicability
| Industry |
MoA Use Case |
Quality Benefit |
| Legal |
Contract risk assessment |
Different models identify different risk clauses; synthesis is more comprehensive |
| Financial Services |
Analyst report generation |
Diverse reasoning catches analytical blind spots; synthesis is more balanced |
| Healthcare |
Clinical decision support |
Different models apply different clinical guidelines; synthesis reconciles differences |
| Government |
Policy impact assessment |
Multi-model diversity catches different stakeholder implications |
| Cybersecurity |
Threat analysis |
Different models identify different attack vectors; union analysis is more comprehensive |
4. Architecture Overview
The MoA architecture has three layers: parallel proposers, optional discussion/critique, and an aggregator.
Proposer Layer
Multiple proposer agents independently process the same task input and produce independent output proposals. The diversity between proposers is the source of quality improvement. Diversity can be achieved through: (a) different LLM providers (GPT-4o + Claude 3.5 Sonnet + Gemini 1.5 Pro), (b) different model sizes from the same provider, (c) same model with different sampling temperatures, (d) same model with different system prompts (each with a different expert persona), or (e) same model with different context window positions (to avoid position bias).
Optional Discussion Phase
In higher-quality configurations, proposers are shown each other's outputs and produce a revised proposal that takes account of what others observed. This multi-round "discussion" mimics the expert panel model and can improve consensus quality. It adds latency and cost (each proposer makes an additional inference call per discussion round).
Aggregation Layer
The Aggregator receives all proposer outputs and produces a final synthesised result. Aggregation strategies:
- Majority Voting: For categorical decisions (classify this document as "high/medium/low risk"), take the majority vote. Simple, deterministic, explainable.
- Weighted Voting: Weight votes by each proposer's historical accuracy on this task type. Requires tracked historical accuracy.
- Union Aggregation: For coverage-oriented tasks (find all risk flags), take the union of all proposer findings. Each finding attributed to the proposer(s) that identified it.
- LLM Synthesis: An aggregator LLM receives all proposer outputs and synthesises a unified final output. Highest quality, adds cost and latency. The aggregator is instructed to identify agreement, reconcile disagreements, and explain reasoning.
- Best-of-N Selection: Score all proposers' outputs on a quality rubric and return the highest-scoring output. Preserves full fidelity of the best output; does not synthesise.
Confidence and Agreement Reporting
The aggregation result includes metadata: the number of proposers that agreed (for voting strategies), the agreement rate (for structured outputs), and individual proposer outputs (for audit and transparency). Low agreement (e.g., < 2/3 proposers agree) is a signal that the task is genuinely ambiguous or that the proposers have conflicting knowledge, and this signal should be reported to the caller and, for high-stakes tasks, escalated to human review.
5. Architecture Diagram
flowchart TD
subgraph Input["Task Input"]
A[Task + Quality Objective]
end
subgraph Proposers["Proposer Layer - Parallel Execution"]
P1[Proposer A]
P2[Proposer B]
P3[Proposer C]
PN[Proposer N]
end
subgraph Discussion["Optional Discussion Phase"]
D[Peer Review + Revise]
end
subgraph Aggregation["Aggregation Layer"]
C[Result Collector]
E{Aggregation Strategy}
F[Voting / Union]
G[LLM Synthesis]
H{Agreement Threshold?}
end
subgraph Output["Output"]
I[Final Result]
J[Low Agreement Escalation]
end
A --> P1 & P2 & P3 & PN
P1 & P2 & P3 & PN --> D
D --> C
C --> E
E -->|deterministic| F
E -->|synthesis| G
F --> H
G --> H
H -->|sufficient| I
H -->|low agreement| J
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Proposer Pool |
AI Components |
N independent agents producing independent proposals |
Different LLM providers; different models; different personas |
Critical |
| Fan-Out Dispatcher |
Orchestration |
Submits all proposer tasks concurrently |
EAAPL-WRK003 implementation |
Critical |
| Discussion Coordinator |
AI Orchestration |
Optional: routes peer outputs to each proposer for revision |
Custom multi-round coordination loop |
Medium |
| Result Collector |
State |
Waits for all proposer completions; handles partial failures |
Async futures; Step Functions; Durable Functions |
Critical |
| Aggregation Engine |
AI/Logic Component |
Applies chosen aggregation strategy |
Custom voting/union logic; LLM synthesis call |
Critical |
| Agreement Evaluator |
Logic Component |
Computes agreement rate; triggers escalation on low agreement |
Custom; configurable threshold per task type |
High |
| Quality Benchmark Evaluator |
Quality Control |
Periodically evaluates MoA output quality vs. single-model baseline |
Offline evaluation pipeline; human labelled benchmark |
High |
7. Data Flow
| Step |
Actor |
Action |
Output |
| 1 |
Caller |
Submits high-stakes task: "Assess legal enforceability risk of clause 14 in contract C-4921" |
Task with quality objective: "comprehensive legal risk assessment" |
| 2 |
Fan-Out Dispatcher |
Dispatches same task to 3 proposers concurrently |
3 parallel invocations (GPT-4o, Claude 3.5, Gemini 1.5) |
| 3 |
Proposer A (GPT-4o) |
Produces risk assessment |
{risks: ["limitation of liability clause unenforceable under ACL §64"], severity: "high"} |
| 4 |
Proposer B (Claude 3.5) |
Produces risk assessment |
{risks: ["limitation of liability clause unenforceable under ACL §64", "notice clause ambiguous"], severity: "high"} |
| 5 |
Proposer C (Gemini 1.5) |
Produces risk assessment |
{risks: ["notice clause ambiguous"], severity: "medium"} |
| 6 |
Result Collector |
All 3 results received |
Proposer outputs aggregated |
| 7 |
Agreement Evaluator |
ACL §64 finding: 2/3 agree (67%). Notice clause: 2/3 agree. Severity: 2/3 "high". |
Agreement: 67% on key findings |
| 8 |
Aggregation Engine (synthesis) |
LLM synthesises: confirms ACL §64 (2/3) + notice clause (2/3); rates overall severity "high" |
Synthesised result |
| 9 |
Caller |
Receives result with metadata: {final_assessment, agreement_rate: 0.67, proposer_outputs: [A, B, C]} |
Final report |
Error Flow
| Error |
Detection |
Recovery |
| Proposer failure (one model unavailable) |
Partial failure detection |
Return partial with 2/N proposers; flag reduced confidence |
| All proposers fail |
Complete failure |
Fallback to single-model with failure metadata |
| Low agreement triggers escalation |
Agreement threshold |
Route to human expert queue with all proposer outputs |
8. Security Considerations
Cross-Provider Data Sharing
- In multi-provider MoA, task data is sent to multiple LLM provider APIs
- Mitigation: Review data classification against each provider's data processing agreement; use single-provider MoA (different models or sampling configurations) for sensitive data that cannot be sent to multiple providers
OWASP LLM Top 10
| OWASP LLM Risk |
MoA Applicability |
Mitigation |
| LLM01 Prompt Injection |
Task input sent to N providers; single poisoned input affects all proposers similarly |
Input sanitisation before dispatch; diversity does not defend against shared input injection |
| LLM09 Overreliance |
High agreement creates false confidence in potentially shared errors |
Agreement rate ≠ accuracy guarantee; quality benchmark required; human review for critical outputs |
| LLM04 Model DoS |
N× API calls per task can exhaust rate limits |
Rate limiting per provider; fan-out concurrency limits |
| LLM06 Sensitive Information |
Task data sent to multiple providers |
Data classification gate before multi-provider dispatch |
9. Governance Considerations
Multi-Provider Data Governance
- For each provider used in the proposer pool, verify data processing agreements, data residency requirements, and model training opt-out options
- For APRA-regulated data, confirm each provider is an approved supply chain entity
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Proposer Pool Configuration |
AI Platform |
On change; quarterly review |
Documents which models/providers are in the proposer pool and their configuration |
| MoA Quality Benchmark |
ML Engineering |
Monthly |
Compares MoA output quality vs. single-model baseline; validates investment |
| Provider Data Processing Agreements |
Legal |
On provider addition |
Confirms data handling compliance for each proposer |
| Low-Agreement Escalation Log |
Compliance |
Per escalation event |
Tracks tasks escalated due to low proposer agreement |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| MoA completion rate (all proposers succeed) |
≥ 95% |
24-hour rolling |
< 90% triggers P2 |
| Average proposer agreement rate |
≥ 60% (task-type dependent) |
24-hour rolling |
Trending down triggers P3 review |
| Quality improvement over single-model baseline |
≥ 10% improvement on benchmark |
Monthly eval |
< 5% improvement questions MoA ROI |
| p95 wall-clock latency (parallel proposers) |
≤ slowest proposer p95 × 1.3 |
1-hour rolling |
Exceeds 2× triggers P2 |
11. Cost Considerations
| Configuration |
Proposers |
Cost vs. Single Model |
Quality Improvement |
| 2-proposer (same model, different temperature) |
2 |
2× |
+5–10% |
| 3-proposer (different providers) |
3 |
2.5–3× (provider price differences) |
+10–20% |
| 3-proposer + discussion |
3 + 3 discussion |
4–5× |
+15–25% |
| 4-proposer + LLM synthesis |
4 + 1 |
5–6× |
+15–25% |
Optimisation Strategy
- Reserve MoA for tasks where single-model error rate is unacceptably high (> 5% material error rate)
- Use same-provider multi-temperature configuration as cheapest diversity strategy
- Route routine tasks to single-model; route high-stakes tasks to MoA via Router/Dispatcher (EAAPL-WRK004)
12. Trade-Off Analysis
| Option |
Quality |
Cost |
Latency |
Explainability |
Best For |
| A: 3-proposer + synthesis (Recommended for high-stakes) |
Very High |
High |
Medium |
Medium |
High-stakes outputs; justified by risk reduction |
| B: 2-proposer + voting |
High |
2× |
Low |
High |
Medium-stakes; need explainable agreement |
| C: Single model + reflection (EAAPL-AGT006) |
High |
Low |
Medium |
High |
Most production tasks |
| D: Best-of-N selection |
High |
N× |
Low |
High |
Candidate generation; when best individual suffices |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Proposer model downtime (provider outage) |
Low–Medium |
Medium — reduced diversity |
Health check per provider |
Complete task with available proposers; flag reduced confidence |
| Correlated errors across proposers (shared training data blind spot) |
Medium |
High — false consensus |
Quality benchmark testing |
Maximise architectural diversity (different providers, not just different models) |
| Aggregator synthesis introduces new errors |
Low–Medium |
Medium — synthesis degrades quality |
Aggregation quality benchmark |
Benchmark aggregator quality; fallback to voting if synthesis degrades |
| Low agreement always triggers escalation (poorly calibrated threshold) |
Medium |
Medium — unnecessary human escalation |
Escalation rate monitoring |
Tune agreement threshold per task type based on historical data |
14. Regulatory Considerations
EU AI Act
- Art. 15 (Accuracy and Robustness): MoA directly implements the accuracy and robustness requirement for high-risk AI systems; the quality benchmark demonstrates measurable improvement.
ISO 42001
- §8.4: Multi-provider proposer pool requires supply chain AI governance; each provider must be assessed under the organisation's AI supplier framework.
NIST AI RMF
- MEASURE 2.5: The MoA quality benchmark (measured improvement over single-model baseline) directly implements the AI performance measurement requirement.
15. Reference Implementations
AWS
| Component |
Service |
| Proposer Pool |
Amazon Bedrock (Anthropic Claude + Amazon Nova + Mistral) via unified API |
| Fan-Out |
AWS Step Functions Map state |
| Aggregation (LLM synthesis) |
Amazon Bedrock (Claude 3.5 Sonnet) |
| Result Store |
Amazon DynamoDB |
Azure
| Component |
Service |
| Proposer Pool |
Azure OpenAI (GPT-4o) + Azure AI Models (Llama, Mistral) |
| Fan-Out |
Azure Durable Functions |
| Aggregation |
Azure OpenAI synthesis call |
On-Premises
| Component |
Technology |
| Proposer Pool |
vLLM serving multiple models (Llama 3.1 70B, Mistral Large) + OpenAI API |
| Orchestration |
Python asyncio; LangGraph multi-agent graph |
| Pattern |
ID |
Relationship Type |
Notes |
| Parallel Fan-Out/Fan-In |
EAAPL-WRK003 |
Base Pattern |
MoA specialises fan-out/fan-in for quality improvement through model diversity |
| Multi-Agent Orchestration |
EAAPL-MAG001 |
Peer |
Orchestration for coordinated multi-agent work; MoA for independent parallel quality improvement |
| Reflexive Agent |
EAAPL-AGT006 |
Alternative |
Self-critique quality improvement; lower cost than MoA; MoA for higher stakes |
| Router/Dispatcher |
EAAPL-WRK004 |
Integrates With |
Dispatcher routes high-stakes tasks to MoA; routine tasks to single model |
17. Maturity Assessment
Overall Maturity: Emerging
| Dimension |
Score (1–5) |
Evidence |
| Research Foundation |
4 |
MoA paper (Wang et al., 2024); ensemble LLM literature; strong academic evidence |
| Production Deployment |
3 |
Early production deployments in research/legal tools; enterprise adoption starting |
| Framework Support |
2 |
Custom implementations common; no dominant framework abstraction yet |
| Cost Optimisation |
3 |
Provider pricing dynamics evolving; ROI models maturing |
| Aggregation Tooling |
2 |
Custom synthesis aggregators common; no standard aggregation library |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-06-13 |
Architecture Board |
Initial publication in Agentic Workflows category |