[EAAPL-WRK003] Parallel Fan-Out / Fan-In
Category: Agentic Workflows
Sub-category: Parallel Execution Architecture
Version: 1.0
Maturity: Proven
Tags: fan-out, fan-in, parallel-execution, map-reduce, aggregation, fork-join
Regulatory Relevance: ISO 42001 §8.4, NIST AI RMF (MANAGE 2.2)
1. Executive Summary
The Parallel Fan-Out / Fan-In Pattern defines an architecture in which a single task is decomposed into N independent sub-tasks (fan-out), each executed concurrently by a separate agent or LLM worker, and whose results are aggregated by a fan-in aggregator (fork-join). This is the AI equivalent of map-reduce: distribute parallel work, collect and synthesise results. Compared to sequential execution, fan-out/fan-in reduces end-to-end latency proportional to the degree of parallelism and provides the raw material for ensemble quality improvement through aggregation.
For CIO/CTO audiences: if a task can be broken into independent chunks — analyse 10 contracts simultaneously, generate 5 candidate responses in parallel, search 8 data sources concurrently — this pattern executes all chunks at the same time and synthesises the results, rather than doing them one after another. The cost is the same as sequential execution (or higher, if different models are used), but the wall-clock time drops by the parallelism factor. For time-sensitive workflows — due diligence, incident response, regulatory scanning — this latency reduction is the primary value. The secondary value is resilience: a single worker failure does not fail the entire task.
2. Problem Statement
Business Problem
Enterprise tasks frequently involve analysing multiple independent sources simultaneously: reviewing all contracts in a portfolio, scanning multiple regulatory databases, generating multiple solution candidates for comparison. Sequential processing makes total latency proportional to the number of sources, which is unacceptable for time-sensitive business processes.
Technical Problem
Sequential LLM execution does not utilise available parallelism. When sub-tasks are mutually independent — each sub-task's execution does not depend on another's result — sequential execution wastes wall-clock time and increases total task latency linearly with the number of sub-tasks.
Symptoms of Absence
- Portfolio reviews, multi-source scans, or candidate generation tasks take N× longer than necessary
- Partial failures in multi-source tasks fail the entire operation rather than returning partial results
- No mechanism to compare multiple independent outputs for quality improvement
- Throughput limited by sequential LLM token generation rate
Cost of Inaction
- Latency: 10-source sequential scan at 5s/source = 50s; parallel = 5s. For interactive or SLA-bound processes, this is the difference between usable and unusable.
- Resilience: Sequential pipelines have single-thread failure modes; parallel workers isolate failures
- Quality: Without parallel candidate generation, there is no basis for output quality improvement through selection or synthesis
3. Context
When to Apply
- Task decomposes into N mutually independent sub-tasks (no data dependency between sub-tasks)
- Total latency is a primary constraint and parallel execution infrastructure is available
- Sub-tasks are homogeneous (same prompt template, different inputs) or well-defined heterogeneous workers
- Aggregation strategy is clearly defined and deterministic
When NOT to Apply
- Sub-tasks are data-dependent (output of task A is input to task B) — use Sequential Chain (EAAPL-WRK002)
- Sub-tasks require coordinated state (use Multi-Agent Orchestration, EAAPL-MAG001)
- Cost is a hard constraint and parallelism does not reduce cost vs. sequential
- Aggregation result is non-deterministic and variance is unacceptable
Prerequisites
- Task decomposition function that produces independent sub-tasks
- Defined aggregation strategy (union, intersection, voting, synthesis, best-of-N selection)
- Concurrency infrastructure (async executor, worker pool, parallel workflow engine)
- Partial result handling policy (fail-all vs. return-available-on-partial-failure)
Industry Applicability
| Industry |
Fan-Out Use Case |
Aggregation Strategy |
| Financial Services |
Parallel credit bureau checks (Equifax, Experian, illion) |
Synthesis: merge scores + discrepancies |
| Legal |
Simultaneous review of 20 contracts in a portfolio |
Union: collect all findings; de-duplicate |
| Cybersecurity |
Parallel threat intelligence source query |
Union with deduplication; priority weighting |
| Healthcare |
Parallel guideline database search across multiple bodies |
Synthesis: reconcile potentially conflicting guidance |
| Government |
Parallel policy impact assessment across agencies |
Voting + synthesis for consensus recommendation |
4. Architecture Overview
The Fan-Out/Fan-In architecture has three stages: decomposition, parallel execution, and aggregation.
Decomposition Phase
The task decomposer receives the original task and produces a set of independent sub-task specifications. Each sub-task contains: the sub-task input, the prompt template to use, the worker configuration, and a correlation ID linking it back to the parent task. The decomposer is deterministic: for the same input, it produces the same sub-task set. This enables replay and deterministic debugging.
Fan-Out Phase
The fan-out dispatcher submits all sub-tasks to the worker pool concurrently. Workers are stateless and homogeneous: each executes the same pattern (render prompt → invoke LLM → validate output → return result). The dispatcher tracks outstanding sub-tasks by correlation ID. The maximum degree of parallelism is configurable per task type, balancing API rate limits, cost controls, and latency objectives.
Worker Execution
Each worker is a complete mini-pipeline: it renders its prompt from the sub-task specification, invokes the LLM, validates the output against the sub-task schema, and returns the validated result (or a structured error). Workers are independent — a failure in one worker does not affect others. Workers emit per-execution metrics (latency, token usage, validation result) for observability.
Fan-In Phase
The fan-in aggregator receives all worker results (including partial results if some workers failed). It applies the configured aggregation strategy: union (combine all results), intersection (keep only results present in ≥ K workers), voting (majority rule for categorical decisions), or synthesis (LLM-based synthesis of all results into a unified output). The aggregation strategy is the primary design decision and must be chosen based on the task's quality requirements.
Aggregation Strategies
- Union: Concatenate all results; suitable for comprehensive information gathering (all risk flags from all contracts)
- Intersection: Keep only results confirmed by ≥ K/N workers; suitable for high-confidence claims
- Voting: For categorical decisions, take the majority label; suitable for classification tasks
- Synthesis: LLM call that synthesises all worker outputs into a unified narrative; highest quality, adds latency and cost
- Best-of-N: Score all outputs and return the highest-scoring; suitable for candidate generation (EAAPL-WRK008)
5. Architecture Diagram
flowchart TD
subgraph Input["Task Input"]
A[Original Task]
end
subgraph Decompose["Decomposition"]
B[Task Decomposer]
end
subgraph Workers["Parallel Workers Fan-Out"]
W1[Worker 1]
W2[Worker 2]
W3[Worker 3]
WN[Worker N]
end
subgraph FanIn["Fan-In Aggregation"]
C[Result Collector]
D{Aggregation Strategy}
E[Union / Voting]
F[LLM Synthesis]
end
subgraph Output["Output"]
G[Aggregated Result]
H[Partial Result]
end
A --> B
B --> W1 & W2 & W3 & WN
W1 & W2 & W3 & WN --> C
C --> D
D -->|deterministic| E
D -->|synthesis needed| F
E --> G
F --> G
C -->|partial failure| H
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Task Decomposer |
Logic Component |
Splits original task into N independent sub-tasks |
Deterministic rules; LLM-based decomposition; hybrid |
Critical |
| Fan-Out Dispatcher |
Orchestration |
Submits N sub-tasks to worker pool concurrently |
asyncio.gather (Python); AWS Step Functions Map state; Durable Fan-Out |
Critical |
| Worker |
AI Component |
Executes single sub-task: prompt → LLM → validate → return |
Stateless function; Lambda; container; same or different models |
Critical |
| Rate Limiter |
Resilience |
Enforces concurrency limit to respect API rate limits |
Token bucket; semaphore; API gateway throttle |
High |
| Result Collector |
State |
Tracks outstanding sub-tasks; collects results on completion |
Asyncio futures; Step Functions; Durable entity |
Critical |
| Aggregation Engine |
Logic Component |
Applies configured aggregation strategy to collected results |
Custom Python; LangChain; dedicated LLM synthesis call |
Critical |
| Partial Failure Handler |
Resilience |
Decides whether to fail-all or return partial results on worker failure |
Configurable threshold: e.g., ≥ 80% workers must succeed |
High |
| Fan-Out Metrics Emitter |
Observability |
Per-worker and per-aggregation latency, token usage, success rate |
Prometheus; CloudWatch; Datadog |
Medium |
7. Data Flow
| Step |
Actor |
Action |
Output |
| 1 |
Caller |
Submits task: "Review all 8 vendor contracts for liability exposure" |
Task with 8 contract documents |
| 2 |
Task Decomposer |
Creates 8 sub-tasks, one per contract |
[{sub_task_id: "ST-1", contract: doc1, prompt: "liability_review_v2"}, ...] |
| 3 |
Fan-Out Dispatcher |
Submits all 8 sub-tasks concurrently |
8 concurrent worker invocations |
| 4 |
Workers 1–8 |
Each executes liability review on its assigned contract |
[{contract_id, risk_flags: [...], severity_max: "high"}, ...] |
| 5 |
Result Collector |
Receives results as workers complete (non-blocking) |
8/8 results received; 0 failures |
| 6 |
Aggregation Engine |
Applies union strategy: merges all risk flags |
{total_risk_flags: 23, high_severity: 5, contracts_reviewed: 8} |
| 7 |
Caller |
Receives aggregated result with per-contract breakdown |
Final report |
Error Flow
| Error |
Detection |
Recovery |
| Worker timeout |
Per-worker timeout in dispatcher |
Mark worker as failed; continue collecting other results |
| Worker validation failure |
Schema validation error in worker |
Retry worker once; if fails again, mark as failed result |
| Partial failure (< threshold workers succeeded) |
Partial Failure Handler |
If ≥ minimum success threshold: return partial result with failure list; else: fail entire task |
| Rate limit exceeded (too many concurrent API calls) |
HTTP 429 from LLM provider |
Rate limiter queues excess workers; no data loss |
8. Security Considerations
Parallel Execution Amplifies Injection Risk
- Fan-out submits N simultaneous LLM calls with potentially attacker-controlled inputs
- A single poisoned input document affects only one worker; aggregation stage must not blindly trust any single worker's output
OWASP LLM Top 10
| OWASP LLM Risk |
Fan-Out/Fan-In Applicability |
Mitigation |
| LLM01 Prompt Injection |
Each worker processes potentially untrusted content |
Per-worker input sanitisation; content delimiters |
| LLM04 Model DoS |
N parallel calls can exhaust API rate limits |
Rate limiter with configurable max concurrency; cost ceiling |
| LLM08 Excessive Agency |
N parallel workers × write-capable tools = N× side-effect amplification |
Read-only tools in workers by default; write actions require explicit fan-out permission |
| LLM09 Overreliance |
Aggregated result presented with false consensus confidence |
Aggregation metadata includes worker agreement rate; low agreement flags for human review |
9. Governance Considerations
Aggregation Strategy Governance
- The aggregation strategy (especially voting thresholds and synthesis prompts) has material impact on output quality and must be owned by domain SMEs
- Aggregation strategies for regulated decisions (credit, underwriting) require model risk review
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Task Decomposition Specification |
AI Platform |
On change |
Documents how tasks are decomposed; decomposition logic is version-controlled |
| Aggregation Strategy Register |
Domain SME + AI Platform |
Per use case |
Documents chosen aggregation strategy, threshold values, and justification |
| Worker Result Archive |
Compliance |
Per execution (regulated) |
Individual worker outputs preserved for audit alongside aggregated result |
| Partial Failure Threshold Policy |
AI Governance Board |
Quarterly |
Documents acceptable failure thresholds per task class |
10. Operational Considerations
SLOs
| SLO |
Target |
Window |
Alert |
| Fan-out completion rate (all workers succeed) |
≥ 97% |
24-hour rolling |
< 93% triggers P2; check worker reliability |
| p95 fan-out wall-clock latency |
≤ max(single worker p95) × 1.5 |
1-hour rolling |
Significant excess triggers P2; investigate stragglers |
| Worker success rate per task type |
≥ 98% |
24-hour rolling |
< 95% triggers P3 |
| Aggregation synthesis latency |
≤ 10s (for LLM synthesis) |
1-hour rolling |
> 20s triggers P3 |
Monitoring
- Straggler worker detection: workers taking >3× median latency slow down the entire fan-in
- Worker result variance: high variance in outputs may indicate ambiguous sub-task specification
- Aggregation confidence distribution: track worker agreement rates across task types
11. Cost Considerations
| Configuration |
Workers |
Approx. Cost per Fan-Out (GPT-4o) |
Latency Benefit |
| Small fan-out |
3–5 |
$0.05–0.20 |
3–5× faster than sequential |
| Medium fan-out |
6–10 |
$0.20–0.60 |
6–10× faster than sequential |
| Large fan-out |
11–20 |
$0.60–2.00 |
Up to 15× faster (API rate limits constrain max concurrency) |
| With LLM synthesis |
Any + 1 |
+$0.05–0.20 |
Additional synthesis call overhead |
Optimisations
- Use smaller, faster models for individual workers; reserve larger model for synthesis aggregation only
- Cache worker results (by content hash) to avoid reprocessing identical sub-task inputs
- Tune max concurrency to stay within LLM provider rate limits without queuing overhead
12. Trade-Off Analysis
| Option |
Latency |
Cost |
Quality |
Complexity |
Best For |
| A: Fan-out with deterministic aggregation (Recommended for structured tasks) |
Low |
Equal to sequential |
High |
Medium |
Portfolio review, multi-source scan |
| B: Fan-out with LLM synthesis |
Low + synthesis |
Higher |
Very High |
Medium–High |
Complex synthesis needed |
| C: Sequential processing |
High (N×) |
Equal |
High |
Low |
Small N; dependency between steps |
| D: Mixture-of-Agents (EAAPL-WRK008) |
Low |
Higher (different models) |
Very High |
High |
Quality improvement through diversity |
Architectural Tensions
| Tension |
Left Pole |
Right Pole |
Balance |
| Parallelism vs. Rate limits |
Maximum parallelism for minimum latency |
Low concurrency to respect API limits |
Configure max concurrency per provider; use token bucket |
| Fail-all vs. Return-partial |
Return nothing unless all workers succeed |
Return whatever is available |
Configurable threshold (e.g., 80%); task-class specific |
| Deterministic vs. Synthesis aggregation |
Pure union/voting (fast, deterministic) |
LLM synthesis (higher quality, non-deterministic) |
Use deterministic for regulated decisions; synthesis for executive reports |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Straggler workers (one slow worker blocks fan-in) |
Medium |
Medium — overall latency spike |
Per-worker timeout monitoring |
Per-worker timeout; return partial without straggler |
| Correlated worker failures (all workers fail same way) |
Low |
High — aggregation receives no valid results |
All-workers-failed detection |
Fallback to sequential processing or sequential retry |
| Aggregation bias (synthesis LLM over-weights first worker result) |
Medium |
Medium — result quality skewed |
Worker agreement rate monitoring |
Randomise worker result ordering before synthesis; use structured aggregation |
| Decomposition producing dependent sub-tasks |
Low |
High — workers produce incorrect results due to missing context |
Integration testing of decomposition logic |
Explicit data-independence check in decomposer; test with N=2 case |
| API cost explosion (N workers × unexpected long context) |
Low–Medium |
High — cost overrun |
Per-task cost ceiling; fan-out cost estimate before dispatch |
Pre-estimate total cost before dispatch; abort if > ceiling |
14. Regulatory Considerations
ISO 42001
- §8.4: Parallel execution introduces non-determinism in timing; the pattern must ensure that the final aggregated output is deterministically reproducible from the worker inputs (deterministic aggregation strategies) or explicitly flagged as synthesis-based.
NIST AI RMF
- MANAGE 2.2: Risk of correlated worker failures is a documented failure mode that must be managed; the partial failure handling policy is the control.
Australian Context
- For APRA-regulated use cases, individual worker outputs must be retained alongside the aggregated result so that the aggregation can be audited and replayed.
- For consumer-facing decisions (credit, insurance), the aggregation must not produce outcomes that cannot be explained to the affected individual; voting-based aggregation provides the most explainable audit trail.
15. Reference Implementations
AWS
| Component |
Service |
| Fan-Out Dispatcher |
AWS Step Functions Map state (distributed mode for > 40 concurrent) |
| Workers |
AWS Lambda functions (one per sub-task invocation) |
| Result Collection |
Step Functions state machine synchronises Map outputs |
| LLM Synthesis |
Amazon Bedrock InvokeModel (Claude 3.5 Sonnet) |
| Rate Limiting |
Concurrency limit on Lambda + Step Functions MaxConcurrency |
Azure
| Component |
Service |
| Fan-Out Dispatcher |
Azure Durable Functions Fan-out/Fan-in pattern |
| Workers |
Durable Activity Functions |
| Result Collection |
Durable Orchestration Function Task.WhenAll |
| LLM Synthesis |
Azure OpenAI Service |
On-Premises
| Component |
Technology |
| Fan-Out Dispatcher |
Python asyncio.gather with semaphore for concurrency control |
| Workers |
Async coroutines; Ray for large-scale parallelism |
| Aggregation |
Custom Python; LangChain parallel chain |
| LLM |
vLLM with async OpenAI-compatible API |
| Pattern |
ID |
Relationship Type |
Notes |
| Mixture of Agents |
EAAPL-WRK008 |
Specialisation |
MoA uses fan-out with diverse models for quality improvement; this pattern uses fan-out for throughput/coverage |
| Multi-Agent Orchestration |
EAAPL-MAG001 |
Peer |
Orchestration manages agent coordination; fan-out is a specific execution topology within an orchestrated system |
| Plan-and-Execute |
EAAPL-WRK005 |
Complementary |
Plan-and-Execute uses fan-out to execute parallelisable planned sub-tasks |
| Sequential Chain |
EAAPL-WRK002 |
Alternative |
Sequential for dependent steps; fan-out for independent steps |
| Workflow State Machine |
EAAPL-WRK012 |
Integrates With |
State machine governs fan-out state transitions and failure handling |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension |
Score (1–5) |
Evidence |
| Research Foundation |
4 |
Map-reduce heritage; ensemble learning literature; LLM parallelism well-documented |
| Production Deployment |
4 |
Deployed in document processing, multi-source search, candidate generation |
| Framework Support |
4 |
LangChain parallel chains; Step Functions Map; Durable Functions fan-out |
| Aggregation Tooling |
3 |
Deterministic aggregation mature; LLM synthesis aggregation still evolving best practices |
| Observability |
3 |
Per-worker observability available; straggler detection tooling maturing |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2025-06-13 |
Architecture Board |
Initial publication in Agentic Workflows category |