[EAAPL-DAT006] Federated Learning Pattern
Category: Data Architecture
Sub-category: Distributed AI / Privacy-Preserving ML
Version: 1.1
Maturity: Emerging
Tags: federated-learning, FedAvg, differential-privacy, gradient-aggregation, consortium-AI, cross-silo, cross-device
Regulatory Relevance: GDPR Article 5/25, Privacy Act (Australia) APP 3, EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4
1. Executive Summary
Federated learning enables multiple organisations or devices to collaboratively train a shared AI model without any participant sharing their raw data. Each participant trains a local model on their own data and shares only model gradients or weights with a central coordinator, which aggregates them into an improved global model. No raw data ever leaves the participant's environment.
This pattern is transformative for industries where data cannot be centralised due to privacy regulation, competitive sensitivity, or jurisdictional restrictions: hospital consortia training clinical AI without sharing patient records; competing banks jointly training fraud detection models; mobile devices improving language models without uploading private messages.
The pattern covers cross-silo federation (organisation-to-organisation, typically <100 participants with high-quality data) and cross-device federation (device-to-server, potentially millions of participants with intermittent connectivity). It addresses the critical engineering challenges: communication efficiency, statistical heterogeneity (non-IID data), system heterogeneity, and privacy amplification through differential privacy on gradient updates.
Target audience: Chief Data Officers, AI Research leads, Healthcare/Banking consortium architects, ML Platform leads.
2. Problem Statement
Business Problem
High-value AI use cases require data that no single organisation can legally or competitively accumulate. Healthcare networks want clinical AI trained on all patient populations; banks want fraud models trained on cross-institution transaction patterns; governments want public health AI trained across jurisdictions. Traditional data sharing is blocked by privacy law, competition law, or data sovereignty requirements.
Technical Problem
- Centralising personal data from multiple organisations violates GDPR, APRA, and similar privacy regulations.
- Data sharing agreements between competing organisations are commercially infeasible.
- Cross-border data transfer restrictions prevent cloud-based centralisation.
- Each organisation has too little data alone; combined they have sufficient statistical power.
- Even with anonymisation, competitive organisations will not share detailed customer data.
Symptoms
- AI use case identified as high-value but blocked at data access stage indefinitely.
- Each participant's model has poor performance due to small or non-representative local dataset.
- Privacy counsel blocking data consortium proposals.
- Regulators signalling openness to federated approaches (e.g., EBA guidance on federated fraud detection).
Cost of Inaction
| Dimension |
Impact |
| Model quality |
Individual models underperform (insufficient data) vs. federated models |
| Competitive |
Organisations with more data accumulation win; federated levels playing field |
| Regulatory |
Data centralisation attempts attract regulatory scrutiny; federated is regulatorily preferred |
| Healthcare |
Clinical AI trained on single-hospital data misses rare conditions; patient harm |
3. Context
When to Apply
- Multiple organisations or devices hold relevant training data that cannot be centralised.
- Privacy regulation or competition law prohibits raw data sharing.
- Cross-border data transfer restrictions apply.
- Data is sufficiently heterogeneous that centralisation would require complex harmonisation.
- Participants have sufficient compute to run local training (cross-silo: always; cross-device: modern smartphones).
When NOT to Apply
- Data can be legally and practically centralised (federated adds unnecessary complexity).
- Participants lack compute for local training.
- Data is severely heterogeneous (non-IID) to the point where federated training diverges (use transfer learning instead).
- Security of gradient transmission cannot be guaranteed (gradient inversion attacks possible in low-participant settings).
- Regulatory framework requires centralised data audit (some jurisdictions require data in a single auditable location).
Prerequisites
| Prerequisite |
Minimum Viable |
Preferred |
| Participant compute |
Modern CPU for tabular; GPU for deep learning |
Dedicated GPU nodes at each participant |
| Network connectivity |
Reliable internet (cross-silo) |
High-bandwidth private network |
| Federation framework |
Flower (FL framework), PySyft |
Enterprise: NVIDIA FLARE, IBM FL, Google FL |
| DP integration |
Opacus DP-SGD |
Calibrated DP budget per participant |
| Legal agreement |
Data sharing agreement (data stays local) |
Multilateral federated learning agreement with IP provisions |
Industry Applicability
| Industry |
Applicability |
Driver |
| Healthcare |
Critical |
Patient privacy; multi-hospital clinical AI; rare disease research |
| Financial Services |
High |
Cross-bank fraud detection; credit risk consortium; AML |
| Telecommunications |
High |
Shared network anomaly detection; cross-carrier fraud |
| Government |
Medium |
Cross-agency AI without central data lake |
| Retail |
Medium |
Cross-retailer demand forecasting consortium |
| Automotive |
High |
Cross-manufacturer autonomous driving model improvement |
4. Architecture Overview
Design Philosophy
Federated learning inverts the traditional ML paradigm: rather than bringing data to the model, the model is brought to the data. The architecture must solve five engineering challenges simultaneously.
Challenge 1 — Communication Efficiency. In cross-silo FL, gradient transmission between participants and the coordinator is the primary bottleneck. Full gradient transmission for large neural networks can involve gigabytes per round. The pattern addresses this through gradient compression (sparsification: transmit only top-k% gradients by magnitude; quantisation: reduce gradient precision from float32 to int8 or 4-bit), achieving 10–100× communication reduction with minimal accuracy loss.
Challenge 2 — Statistical Heterogeneity (Non-IID Data). Each participant's data reflects their specific population, which may differ significantly from the global distribution. Naive FedAvg converges poorly on non-IID data. The pattern addresses this through FedProx (adds a proximal term to local loss function, preventing local models from drifting too far from the global model) and SCAFFOLD (corrects for client drift using control variates). For severely heterogeneous data, personalised federated learning (each participant maintains a local fine-tuned head on top of the shared global representation) is used.
Challenge 3 — Privacy Amplification. Gradient sharing, while safer than raw data sharing, is not perfectly private — gradient inversion attacks can recover training data from gradients in small-participant settings. The pattern applies Differential Privacy via DP-SGD (Opacus) at each participant before gradient transmission, adding calibrated Gaussian noise to clip-and-noised gradients. Secure aggregation protocols (Google's SecAgg) allow the coordinator to compute the aggregate gradient without seeing individual participants' gradients — even the coordinator cannot invert a specific participant's gradient.
Challenge 4 — System Heterogeneity. In cross-device FL, participants have wildly varying compute capabilities and connectivity. The pattern implements asynchronous federated learning (FedAsync): participants submit gradients when ready rather than in synchronised rounds; the global model is updated with each received gradient using a mixing hyperparameter. This handles stragglers without blocking the federation round.
Challenge 5 — Byzantine Robustness. Malicious or faulty participants may submit adversarial gradients to corrupt the global model (model poisoning). The coordinator applies gradient validation: per-participant gradient norms are compared; outliers (>3σ from mean) are rejected or down-weighted. FedMedian aggregation (median instead of mean) provides Byzantine-robust aggregation in adversarial settings.
Federation Coordinator. The coordinator orchestrates rounds: selects participants, distributes the global model, collects and aggregates gradients, validates gradient integrity, and updates the global model. The coordinator does not process raw data. In cross-silo settings, the coordinator may be hosted by a neutral third party (industry consortium body, regulator-approved platform) to ensure no participant gains competitive advantage.
5. Architecture Diagram
flowchart TD
subgraph Participants["Participant Environments"]
A[Participant A Local Data]
B[Participant B Local Data]
C[Participant C Local Data]
end
subgraph Training["Local Training"]
D[Local Model + DP-SGD]
end
subgraph Coordinator["Federation Coordinator"]
E[Gradient Validator]
F[Secure Aggregator]
G[(Global Model Registry)]
end
A --> D
B --> D
C --> D
D -->|compressed DP gradients| E
E --> F
F --> G
G -->|updated global model| D
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#dbeafe,stroke:#3b82f6
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f3e8ff,stroke:#a855f7
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#fef9c3,stroke:#eab308
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Federation Coordinator |
Service |
Round orchestration; global model distribution; gradient aggregation |
Flower (flwr), NVIDIA FLARE, IBM FL, Google Federated Core |
Critical |
| Secure Aggregator |
Processing |
Aggregates gradients without exposing individual contributions (SecAgg protocol) |
Google SecAgg, PySyft SecureSum, custom MPC |
High |
| Gradient Validator |
Processing |
Detects and rejects outlier/adversarial gradients; Byzantine robustness |
Custom norm-based filter; FedMedian aggregation |
High |
| Local Training Engine |
Processing (per participant) |
Trains local model on participant's data; implements FedProx/SCAFFOLD |
PyTorch + Flower client, TensorFlow Federated, NVIDIA FLARE client |
Critical |
| DP-SGD Engine (per participant) |
Processing |
Applies differential privacy to gradients before transmission |
Opacus (PyTorch), TensorFlow Privacy |
Critical |
| Gradient Compressor |
Processing (per participant) |
Sparsification and quantisation of gradients for transmission efficiency |
Custom Python; PowerSGD; TopK sparsification |
High |
| Global Model Registry |
Storage |
Stores global model versions; tracks federation round history |
MLflow, DVC, Weights & Biases, custom |
High |
| DP Budget Tracker |
Processing |
Tracks cumulative privacy budget (ε) across rounds per participant |
Opacus privacy accounting, custom ε tracker |
Critical |
| Hold-out Evaluator |
Processing |
Evaluates global model on neutral validation dataset not owned by any participant |
Custom Python evaluation harness |
High |
| Federation Agreement Registry |
Governance |
Stores legal federation agreement; approved use cases; participant consent |
Custom registry; legal document management |
High |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Coordinator |
Initialises global model; selects participants for round |
Round configuration with global model checkpoint |
| 2 |
Participants |
Download global model checkpoint |
Local copy of global model |
| 3 |
Each participant |
Trains local model on local data using FedProx with proximal term |
Updated local model weights |
| 4 |
Each participant |
Applies DP-SGD: clips gradients; adds Gaussian noise |
DP-protected gradient |
| 5 |
Each participant |
Compresses gradient: sparsification + quantisation |
Compressed DP gradient |
| 6 |
Coordinator |
Receives compressed DP gradients from participants |
Gradient set for round |
| 7 |
Gradient Validator |
Checks gradient norms; rejects outliers |
Validated gradient set |
| 8 |
Secure Aggregator |
Aggregates validated gradients (FedAvg or FedProx mean) |
Aggregated global gradient |
| 9 |
Coordinator |
Updates global model with aggregated gradient |
New global model version |
| 10 |
Hold-out Evaluator |
Evaluates global model on neutral validation set |
Accuracy + fairness metrics |
| 11 |
Coordinator |
If metrics meet threshold: promote global model; begin next round |
Promoted global model; federated training continues |
Error Flow
| Error Condition |
Trigger |
Response |
Recovery |
| Participant dropout mid-round |
Network failure; compute failure |
Coordinator proceeds with available participants (minimum quorum check) |
Retry dropped participant in next round |
| Gradient norm outlier (possible poisoning) |
Gradient norm >3σ from round mean |
Gradient rejected; participant flagged for review |
Review participant's local training setup; escalate to consortium governance |
| DP budget exhausted (ε > threshold) |
Cumulative rounds exceed privacy budget |
Participant stops contributing; new training data or budget reset required |
Negotiate new DP budget; assess if prior training meets privacy requirements |
| Global model divergence (loss increases) |
Non-IID data; insufficient FedProx proximal term |
Round rejected; proximal term strength increased |
Adjust FedProx μ hyperparameter; reduce local training epochs |
8. Security Considerations
Authentication & Authorisation
- Each participant authenticates to coordinator using mutual TLS certificates; participant identity linked to federation legal agreement.
- Coordinator validates participant identity before distributing global model.
Secrets Management
- DP-SGD noise seed managed locally by each participant; not shared with coordinator.
- Secure Aggregation protocol keys ephemeral per round; participants derive shared secrets using Diffie-Hellman.
Data Classification
- Raw training data classified as Confidential or higher at each participant; never transmitted.
- DP gradients classified as Internal; coordinator sees only aggregated gradient.
- Global model classified per use case sensitivity; clinical AI models typically Confidential.
Encryption
- All gradient transmission encrypted using TLS 1.3.
- Secure Aggregation provides additional cryptographic guarantee: coordinator cannot decrypt individual participant gradients.
Auditability
- Every federation round logged: round number, participants, aggregation method, global model version, evaluation metrics.
- DP budget consumption per participant logged; shared with participant for their own privacy accounting.
- Gradient rejection events logged with reason; reviewed by consortium governance.
OWASP LLM Top 10 Mapping
| OWASP LLM Risk |
Relevance |
Mitigation |
| LLM03: Training Data Poisoning |
Malicious participant submits adversarial gradients |
Gradient validation; FedMedian aggregation; Byzantine-robust aggregation |
| LLM06: Sensitive Information Disclosure |
Gradient inversion could recover training data |
DP-SGD prevents inversion; SecAgg hides individual gradients from coordinator |
| LLM04: Model Denial of Service |
Participant submits malformed gradient causing coordinator crash |
Gradient schema validation; norm check before aggregation |
9. Governance Considerations
Responsible AI
- Federated models may perform differently across participant populations (fairness concern): hold-out evaluation must include per-participant subgroup performance metrics.
- Consortium governance board responsible for global model promotion decisions when performance is uneven.
Model Risk Management
- Global model is trained on distributed data with different quality levels; model risk documentation must describe participant data quality standards.
- Model risk committee approval required before production deployment of federated model.
Human Approval Checkpoints
- Federation agreement signed by legal representatives of all participants.
- Global model promotion requires consortium governance board approval.
- DP budget reset (increasing privacy expenditure) requires participant-level DPO approval.
Governance Artefacts
| Artefact |
Owner |
Cadence |
Purpose |
| Federation Agreement |
Legal / Consortium |
On establishment |
Defines IP ownership of global model; permitted uses; data sovereignty |
| Round Audit Log |
Coordinator |
Per round |
Immutable log of participants, gradients received/rejected, model version |
| DP Budget Report |
Each Participant |
Per round |
Cumulative ε consumption; participant's own privacy accounting |
| Per-Participant Fairness Report |
Coordinator |
Per promotion |
Global model performance on each participant's subpopulation |
| Model Card (Federated) |
Consortium ML Team |
Per model version |
Training cohort summary; DP parameters; known limitations per participant |
10. Operational Considerations
Monitoring
| Metric |
Alert Threshold |
Tooling |
| Round completion rate |
<80% participants completing round |
Coordinator logs |
| Global model accuracy (hold-out) |
<performance floor |
Evaluation pipeline |
| Gradient rejection rate |
>10% in a round |
Coordinator metrics |
| DP budget consumption rate |
>budget plan |
Budget tracker |
| Per-participant contribution latency |
>round deadline |
Coordinator timing |
SLOs
| SLO |
Target |
Measurement |
| Federation round completion |
<4 hours per round (cross-silo) |
Round timing logs |
| Global model hold-out evaluation |
<1 hour after round completion |
Evaluation pipeline |
| Gradient transmission availability |
>99.5% per participant |
Network monitoring |
Disaster Recovery
| Component |
RTO |
RPO |
Strategy |
| Coordinator |
2 hours |
Last completed round |
Stateless except model registry; restore from model registry |
| Global Model Registry |
4 hours |
1 hour |
Cross-region replication |
| Participant Local Training |
Per participant |
Per participant |
Each participant manages their own training infrastructure |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Typical Range |
Notes |
| Coordinator compute |
$500–$5,000/month |
Lightweight for aggregation; scales with participant count |
| Participant training compute |
$1,000–$10,000/month per participant |
GPU training; largest cost component |
| Network (gradient transmission) |
$100–$1,000/month |
Reduced by compression; scales with model size × rounds |
| Secure Aggregation compute |
$200–$2,000/month |
Cryptographic overhead; scales with participant count |
| Legal (federation agreement) |
$20,000–$100,000 one-time |
Multilateral consortium agreement |
Indicative Cost Range
| Scale |
Monthly Cost (Coordinator) |
Monthly Cost (Per Participant) |
| Small consortium (3–5 participants) |
$1,000–$5,000 |
$1,000–$5,000 |
| Medium consortium (10–20 participants) |
$3,000–$15,000 |
$2,000–$8,000 |
| Large / cross-device (100+ participants) |
$10,000–$50,000 |
Varies widely |
12. Trade-Off Analysis
Option Comparison
| Option |
Pros |
Cons |
Recommended When |
| A: Federated Learning (this pattern) |
True data locality; privacy-preserving; legally viable |
Complex; convergence slower than centralised; requires participant compute |
Data cannot be centralised; regulatory blocking; >3 participants |
| B: Data clean room / privacy sandbox |
Raw data never shared; analytics on aggregate queries |
Limited to query-based insights; cannot train complex ML models |
Analytics use cases; not full ML model training |
| C: Centralised with contractual data sharing |
Better convergence; simpler architecture |
Legally complex; GDPR/competition risk; single point of breach |
Trusted consortium; single jurisdiction; non-sensitive data |
| D: Transfer learning (pre-train centrally, fine-tune locally) |
No raw data sharing for fine-tuning; good performance |
Requires large public pre-training dataset; may not transfer to specialist domains |
Public pre-training data available; specialist fine-tuning needed |
Architectural Tensions
| Tension |
Trade-Off |
Resolution |
| Privacy (DP noise) vs. model utility |
More DP noise = better privacy, worse model quality |
Tune ε per risk level; accept utility reduction for high-risk use cases |
| Communication efficiency vs. convergence speed |
More compression = faster rounds, slower convergence |
Use TopK sparsification (top 10% gradients) as default; adjust per round budget |
| Cross-silo vs. cross-device architecture |
Cross-device needs async + partial participation; cross-silo needs synchronous consensus |
Implement FedAsync for cross-device; synchronous FedProx for cross-silo |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Byzantine participant (model poisoning via adversarial gradients) |
Low |
High — global model corrupted |
Gradient norm outlier detection; FedMedian |
Isolate participant; revert to last clean model; re-run rounds without participant |
| Global model divergence on non-IID data |
Medium |
High — useless global model |
Hold-out evaluation after each round |
Increase FedProx proximal term; reduce local epochs; use SCAFFOLD |
| DP budget exhaustion — participant cannot contribute |
Medium |
Medium — federation loses participant |
DP budget tracker per participant |
Assess privacy-utility trade-off; negotiate extended budget or retire participant |
| Coordinator compromise |
Very Low |
Critical — adversary sees round gradients |
Intrusion detection; secure aggregation |
Secure aggregation prevents coordinator from seeing individual gradients; rotate keys |
| Legal dispute over global model IP |
Low |
High — programme halted |
Legal agreement monitoring |
Federation agreement must pre-define IP ownership; no execution dependency on dispute resolution |
14. Regulatory Considerations
| Regulation |
Requirement |
Pattern Response |
| GDPR Article 5(1)(b) |
Purpose limitation |
Federation agreement defines permitted use of global model |
| GDPR Article 25 |
Privacy by design |
Federated architecture keeps data local; DP-SGD minimises gradient leakage |
| EU AI Act Article 10 |
Training data governance |
Global model inherits data governance requirements of all participants' data |
| APRA CPS 234 |
Third-party risk |
Coordinator is a third party; security standards contractually required |
| Competition law |
Anti-trust compliance |
Coordinator must not aggregate commercially sensitive information; legal sign-off required |
| Cross-border data transfer |
Data sovereignty |
Raw data stays local; gradient transmission reviewed per jurisdiction |
15. Reference Implementations
AWS
| Component |
AWS Service |
| Coordinator |
Amazon SageMaker Federated Learning + custom Flower coordinator on ECS |
| Participant training |
SageMaker Training Jobs at each participant site |
| Global model registry |
SageMaker Model Registry |
| Secure communication |
AWS PrivateLink (cross-silo) |
Azure
| Component |
Azure Service |
| Coordinator |
Azure ML Federated Learning (preview) + NVIDIA FLARE on AKS |
| Participant training |
Azure ML Compute at each participant |
| Model registry |
Azure ML Model Registry |
GCP
| Component |
GCP Service |
| Coordinator |
Vertex AI + Flower on Cloud Run |
| Participant training |
Vertex AI Training per participant |
| Global model |
Vertex AI Model Registry |
On-Premises
| Component |
Technology |
| Coordinator |
Flower (flwr) or NVIDIA FLARE on Kubernetes |
| Participant |
PyTorch + Opacus on GPU node |
| Communication |
mTLS over private network |
| Model registry |
MLflow |
| Pattern |
ID |
Relationship |
Notes |
| Privacy by Design for AI Data |
EAAPL-DAT005 |
Complements |
DP-SGD is a privacy-by-design technique |
| Synthetic Data Generation |
EAAPL-DAT004 |
Alternative |
Synthetic data is an alternative when federated training is infeasible |
| AI Training Data Governance |
EAAPL-DAT007 |
Depends on |
Federation agreement is a training data governance artefact |
| Model Versioning |
EAAPL-MDL001 |
Depends on |
Global model versioning per federation round |
| Fine-Tuning Pipeline |
EAAPL-MDL006 |
Complements |
Global federated model fine-tuned locally per participant |
17. Maturity Assessment
Overall Maturity: Emerging — Federated learning frameworks (Flower, NVIDIA FLARE) are production-ready. Cross-silo deployments are proven in healthcare and finance consortia. Cross-device FL at scale is mature (Google deployed FL for Gboard). However, regulatory frameworks for federated AI governance are still developing.
| Dimension |
Score (1–5) |
Notes |
| Architectural clarity |
4 |
Well-defined federation patterns; non-IID handling still research-active |
| Tooling maturity |
3 |
Flower/NVIDIA FLARE production-ready; enterprise tooling maturing |
| Regulatory alignment |
3 |
GDPR alignment good; AI Act treatment of federated models unclear |
| Operational complexity |
2 |
High operational complexity; multi-party coordination challenging |
| Cost efficiency |
3 |
High participant compute cost; offset by regulatory compliance enablement |
| Security |
4 |
DP + SecAgg provides strong privacy guarantees |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-03-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2025-03-01 |
EAAPL Working Group |
Added FedProx/SCAFFOLD detail; Byzantine robustness; NVIDIA FLARE reference |