EAAPL-DAT006Proven

Federated Learning Pattern

Data ArchitectureEU AI ActISO/IEC 42001

[EAAPL-DAT006] Federated Learning Pattern

Category: Data Architecture
Sub-category: Distributed AI / Privacy-Preserving ML
Version: 1.1
Maturity: Emerging
Tags: federated-learning, FedAvg, differential-privacy, gradient-aggregation, consortium-AI, cross-silo, cross-device
Regulatory Relevance: GDPR Article 5/25, Privacy Act (Australia) APP 3, EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4

1. Executive Summary

Federated learning enables multiple organisations or devices to collaboratively train a shared AI model without any participant sharing their raw data. Each participant trains a local model on their own data and shares only model gradients or weights with a central coordinator, which aggregates them into an improved global model. No raw data ever leaves the participant's environment.

This pattern is transformative for industries where data cannot be centralised due to privacy regulation, competitive sensitivity, or jurisdictional restrictions: hospital consortia training clinical AI without sharing patient records; competing banks jointly training fraud detection models; mobile devices improving language models without uploading private messages.

The pattern covers cross-silo federation (organisation-to-organisation, typically <100 participants with high-quality data) and cross-device federation (device-to-server, potentially millions of participants with intermittent connectivity). It addresses the critical engineering challenges: communication efficiency, statistical heterogeneity (non-IID data), system heterogeneity, and privacy amplification through differential privacy on gradient updates.

Target audience: Chief Data Officers, AI Research leads, Healthcare/Banking consortium architects, ML Platform leads.

2. Problem Statement

Business Problem

High-value AI use cases require data that no single organisation can legally or competitively accumulate. Healthcare networks want clinical AI trained on all patient populations; banks want fraud models trained on cross-institution transaction patterns; governments want public health AI trained across jurisdictions. Traditional data sharing is blocked by privacy law, competition law, or data sovereignty requirements.

Technical Problem

Centralising personal data from multiple organisations violates GDPR, APRA, and similar privacy regulations.
Data sharing agreements between competing organisations are commercially infeasible.
Cross-border data transfer restrictions prevent cloud-based centralisation.
Each organisation has too little data alone; combined they have sufficient statistical power.
Even with anonymisation, competitive organisations will not share detailed customer data.

Symptoms

AI use case identified as high-value but blocked at data access stage indefinitely.
Each participant's model has poor performance due to small or non-representative local dataset.
Privacy counsel blocking data consortium proposals.
Regulators signalling openness to federated approaches (e.g., EBA guidance on federated fraud detection).

Cost of Inaction

Dimension	Impact
Model quality	Individual models underperform (insufficient data) vs. federated models
Competitive	Organisations with more data accumulation win; federated levels playing field
Regulatory	Data centralisation attempts attract regulatory scrutiny; federated is regulatorily preferred
Healthcare	Clinical AI trained on single-hospital data misses rare conditions; patient harm

3. Context

When to Apply

Multiple organisations or devices hold relevant training data that cannot be centralised.
Privacy regulation or competition law prohibits raw data sharing.
Cross-border data transfer restrictions apply.
Data is sufficiently heterogeneous that centralisation would require complex harmonisation.
Participants have sufficient compute to run local training (cross-silo: always; cross-device: modern smartphones).

When NOT to Apply

Data can be legally and practically centralised (federated adds unnecessary complexity).
Participants lack compute for local training.
Data is severely heterogeneous (non-IID) to the point where federated training diverges (use transfer learning instead).
Security of gradient transmission cannot be guaranteed (gradient inversion attacks possible in low-participant settings).
Regulatory framework requires centralised data audit (some jurisdictions require data in a single auditable location).

Prerequisites

Prerequisite	Minimum Viable	Preferred
Participant compute	Modern CPU for tabular; GPU for deep learning	Dedicated GPU nodes at each participant
Network connectivity	Reliable internet (cross-silo)	High-bandwidth private network
Federation framework	Flower (FL framework), PySyft	Enterprise: NVIDIA FLARE, IBM FL, Google FL
DP integration	Opacus DP-SGD	Calibrated DP budget per participant
Legal agreement	Data sharing agreement (data stays local)	Multilateral federated learning agreement with IP provisions

Industry Applicability

Industry	Applicability	Driver
Healthcare	Critical	Patient privacy; multi-hospital clinical AI; rare disease research
Financial Services	High	Cross-bank fraud detection; credit risk consortium; AML
Telecommunications	High	Shared network anomaly detection; cross-carrier fraud
Government	Medium	Cross-agency AI without central data lake
Retail	Medium	Cross-retailer demand forecasting consortium
Automotive	High	Cross-manufacturer autonomous driving model improvement

4. Architecture Overview

Design Philosophy

Federated learning inverts the traditional ML paradigm: rather than bringing data to the model, the model is brought to the data. The architecture must solve five engineering challenges simultaneously.

Challenge 1 — Communication Efficiency. In cross-silo FL, gradient transmission between participants and the coordinator is the primary bottleneck. Full gradient transmission for large neural networks can involve gigabytes per round. The pattern addresses this through gradient compression (sparsification: transmit only top-k% gradients by magnitude; quantisation: reduce gradient precision from float32 to int8 or 4-bit), achieving 10–100× communication reduction with minimal accuracy loss.

Challenge 2 — Statistical Heterogeneity (Non-IID Data). Each participant's data reflects their specific population, which may differ significantly from the global distribution. Naive FedAvg converges poorly on non-IID data. The pattern addresses this through FedProx (adds a proximal term to local loss function, preventing local models from drifting too far from the global model) and SCAFFOLD (corrects for client drift using control variates). For severely heterogeneous data, personalised federated learning (each participant maintains a local fine-tuned head on top of the shared global representation) is used.

Challenge 3 — Privacy Amplification. Gradient sharing, while safer than raw data sharing, is not perfectly private — gradient inversion attacks can recover training data from gradients in small-participant settings. The pattern applies Differential Privacy via DP-SGD (Opacus) at each participant before gradient transmission, adding calibrated Gaussian noise to clip-and-noised gradients. Secure aggregation protocols (Google's SecAgg) allow the coordinator to compute the aggregate gradient without seeing individual participants' gradients — even the coordinator cannot invert a specific participant's gradient.

Challenge 4 — System Heterogeneity. In cross-device FL, participants have wildly varying compute capabilities and connectivity. The pattern implements asynchronous federated learning (FedAsync): participants submit gradients when ready rather than in synchronised rounds; the global model is updated with each received gradient using a mixing hyperparameter. This handles stragglers without blocking the federation round.

Challenge 5 — Byzantine Robustness. Malicious or faulty participants may submit adversarial gradients to corrupt the global model (model poisoning). The coordinator applies gradient validation: per-participant gradient norms are compared; outliers (>3σ from mean) are rejected or down-weighted. FedMedian aggregation (median instead of mean) provides Byzantine-robust aggregation in adversarial settings.

Federation Coordinator. The coordinator orchestrates rounds: selects participants, distributes the global model, collects and aggregates gradients, validates gradient integrity, and updates the global model. The coordinator does not process raw data. In cross-silo settings, the coordinator may be hosted by a neutral third party (industry consortium body, regulator-approved platform) to ensure no participant gains competitive advantage.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Participants["Participant Environments"] A[Participant A Local Data] B[Participant B Local Data] C[Participant C Local Data] end subgraph Training["Local Training"] D[Local Model + DP-SGD] end subgraph Coordinator["Federation Coordinator"] E[Gradient Validator] F[Secure Aggregator] G[(Global Model Registry)] end A --> D B --> D C --> D D -->|compressed DP gradients| E E --> F F --> G G -->|updated global model| D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#dbeafe,stroke:#3b82f6 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Federation Coordinator	Service	Round orchestration; global model distribution; gradient aggregation	Flower (flwr), NVIDIA FLARE, IBM FL, Google Federated Core	Critical
Secure Aggregator	Processing	Aggregates gradients without exposing individual contributions (SecAgg protocol)	Google SecAgg, PySyft SecureSum, custom MPC	High
Gradient Validator	Processing	Detects and rejects outlier/adversarial gradients; Byzantine robustness	Custom norm-based filter; FedMedian aggregation	High
Local Training Engine	Processing (per participant)	Trains local model on participant's data; implements FedProx/SCAFFOLD	PyTorch + Flower client, TensorFlow Federated, NVIDIA FLARE client	Critical
DP-SGD Engine (per participant)	Processing	Applies differential privacy to gradients before transmission	Opacus (PyTorch), TensorFlow Privacy	Critical
Gradient Compressor	Processing (per participant)	Sparsification and quantisation of gradients for transmission efficiency	Custom Python; PowerSGD; TopK sparsification	High
Global Model Registry	Storage	Stores global model versions; tracks federation round history	MLflow, DVC, Weights & Biases, custom	High
DP Budget Tracker	Processing	Tracks cumulative privacy budget (ε) across rounds per participant	Opacus privacy accounting, custom ε tracker	Critical
Hold-out Evaluator	Processing	Evaluates global model on neutral validation dataset not owned by any participant	Custom Python evaluation harness	High
Federation Agreement Registry	Governance	Stores legal federation agreement; approved use cases; participant consent	Custom registry; legal document management	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Coordinator	Initialises global model; selects participants for round	Round configuration with global model checkpoint
2	Participants	Download global model checkpoint	Local copy of global model
3	Each participant	Trains local model on local data using FedProx with proximal term	Updated local model weights
4	Each participant	Applies DP-SGD: clips gradients; adds Gaussian noise	DP-protected gradient
5	Each participant	Compresses gradient: sparsification + quantisation	Compressed DP gradient
6	Coordinator	Receives compressed DP gradients from participants	Gradient set for round
7	Gradient Validator	Checks gradient norms; rejects outliers	Validated gradient set
8	Secure Aggregator	Aggregates validated gradients (FedAvg or FedProx mean)	Aggregated global gradient
9	Coordinator	Updates global model with aggregated gradient	New global model version
10	Hold-out Evaluator	Evaluates global model on neutral validation set	Accuracy + fairness metrics
11	Coordinator	If metrics meet threshold: promote global model; begin next round	Promoted global model; federated training continues

Error Flow

Error Condition	Trigger	Response	Recovery
Participant dropout mid-round	Network failure; compute failure	Coordinator proceeds with available participants (minimum quorum check)	Retry dropped participant in next round
Gradient norm outlier (possible poisoning)	Gradient norm >3σ from round mean	Gradient rejected; participant flagged for review	Review participant's local training setup; escalate to consortium governance
DP budget exhausted (ε > threshold)	Cumulative rounds exceed privacy budget	Participant stops contributing; new training data or budget reset required	Negotiate new DP budget; assess if prior training meets privacy requirements
Global model divergence (loss increases)	Non-IID data; insufficient FedProx proximal term	Round rejected; proximal term strength increased	Adjust FedProx μ hyperparameter; reduce local training epochs

8. Security Considerations

Authentication & Authorisation

Each participant authenticates to coordinator using mutual TLS certificates; participant identity linked to federation legal agreement.
Coordinator validates participant identity before distributing global model.

Secrets Management

DP-SGD noise seed managed locally by each participant; not shared with coordinator.
Secure Aggregation protocol keys ephemeral per round; participants derive shared secrets using Diffie-Hellman.

Data Classification

Raw training data classified as Confidential or higher at each participant; never transmitted.
DP gradients classified as Internal; coordinator sees only aggregated gradient.
Global model classified per use case sensitivity; clinical AI models typically Confidential.

Encryption

All gradient transmission encrypted using TLS 1.3.
Secure Aggregation provides additional cryptographic guarantee: coordinator cannot decrypt individual participant gradients.

Auditability

Every federation round logged: round number, participants, aggregation method, global model version, evaluation metrics.
DP budget consumption per participant logged; shared with participant for their own privacy accounting.
Gradient rejection events logged with reason; reviewed by consortium governance.

OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM03: Training Data Poisoning	Malicious participant submits adversarial gradients	Gradient validation; FedMedian aggregation; Byzantine-robust aggregation
LLM06: Sensitive Information Disclosure	Gradient inversion could recover training data	DP-SGD prevents inversion; SecAgg hides individual gradients from coordinator
LLM04: Model Denial of Service	Participant submits malformed gradient causing coordinator crash	Gradient schema validation; norm check before aggregation

9. Governance Considerations

Responsible AI

Federated models may perform differently across participant populations (fairness concern): hold-out evaluation must include per-participant subgroup performance metrics.
Consortium governance board responsible for global model promotion decisions when performance is uneven.

Model Risk Management

Global model is trained on distributed data with different quality levels; model risk documentation must describe participant data quality standards.
Model risk committee approval required before production deployment of federated model.

Human Approval Checkpoints

Federation agreement signed by legal representatives of all participants.
Global model promotion requires consortium governance board approval.
DP budget reset (increasing privacy expenditure) requires participant-level DPO approval.

Governance Artefacts

Artefact	Owner	Cadence	Purpose
Federation Agreement	Legal / Consortium	On establishment	Defines IP ownership of global model; permitted uses; data sovereignty
Round Audit Log	Coordinator	Per round	Immutable log of participants, gradients received/rejected, model version
DP Budget Report	Each Participant	Per round	Cumulative ε consumption; participant's own privacy accounting
Per-Participant Fairness Report	Coordinator	Per promotion	Global model performance on each participant's subpopulation
Model Card (Federated)	Consortium ML Team	Per model version	Training cohort summary; DP parameters; known limitations per participant

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Tooling
Round completion rate	<80% participants completing round	Coordinator logs
Global model accuracy (hold-out)	<performance floor	Evaluation pipeline
Gradient rejection rate	>10% in a round	Coordinator metrics
DP budget consumption rate	>budget plan	Budget tracker
Per-participant contribution latency	>round deadline	Coordinator timing

SLOs

SLO	Target	Measurement
Federation round completion	<4 hours per round (cross-silo)	Round timing logs
Global model hold-out evaluation	<1 hour after round completion	Evaluation pipeline
Gradient transmission availability	>99.5% per participant	Network monitoring

Disaster Recovery

Component	RTO	RPO	Strategy
Coordinator	2 hours	Last completed round	Stateless except model registry; restore from model registry
Global Model Registry	4 hours	1 hour	Cross-region replication
Participant Local Training	Per participant	Per participant	Each participant manages their own training infrastructure

11. Cost Considerations

Cost Drivers

Cost Driver	Typical Range	Notes
Coordinator compute	$500–$5,000/month	Lightweight for aggregation; scales with participant count
Participant training compute	$1,000–$10,000/month per participant	GPU training; largest cost component
Network (gradient transmission)	$100–$1,000/month	Reduced by compression; scales with model size × rounds
Secure Aggregation compute	$200–$2,000/month	Cryptographic overhead; scales with participant count
Legal (federation agreement)	$20,000–$100,000 one-time	Multilateral consortium agreement

Indicative Cost Range

Scale	Monthly Cost (Coordinator)	Monthly Cost (Per Participant)
Small consortium (3–5 participants)	$1,000–$5,000	$1,000–$5,000
Medium consortium (10–20 participants)	$3,000–$15,000	$2,000–$8,000
Large / cross-device (100+ participants)	$10,000–$50,000	Varies widely

12. Trade-Off Analysis

Option Comparison

Option	Pros	Cons	Recommended When
A: Federated Learning (this pattern)	True data locality; privacy-preserving; legally viable	Complex; convergence slower than centralised; requires participant compute	Data cannot be centralised; regulatory blocking; >3 participants
B: Data clean room / privacy sandbox	Raw data never shared; analytics on aggregate queries	Limited to query-based insights; cannot train complex ML models	Analytics use cases; not full ML model training
C: Centralised with contractual data sharing	Better convergence; simpler architecture	Legally complex; GDPR/competition risk; single point of breach	Trusted consortium; single jurisdiction; non-sensitive data
D: Transfer learning (pre-train centrally, fine-tune locally)	No raw data sharing for fine-tuning; good performance	Requires large public pre-training dataset; may not transfer to specialist domains	Public pre-training data available; specialist fine-tuning needed

Architectural Tensions

Tension	Trade-Off	Resolution
Privacy (DP noise) vs. model utility	More DP noise = better privacy, worse model quality	Tune ε per risk level; accept utility reduction for high-risk use cases
Communication efficiency vs. convergence speed	More compression = faster rounds, slower convergence	Use TopK sparsification (top 10% gradients) as default; adjust per round budget
Cross-silo vs. cross-device architecture	Cross-device needs async + partial participation; cross-silo needs synchronous consensus	Implement FedAsync for cross-device; synchronous FedProx for cross-silo

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Byzantine participant (model poisoning via adversarial gradients)	Low	High — global model corrupted	Gradient norm outlier detection; FedMedian	Isolate participant; revert to last clean model; re-run rounds without participant
Global model divergence on non-IID data	Medium	High — useless global model	Hold-out evaluation after each round	Increase FedProx proximal term; reduce local epochs; use SCAFFOLD
DP budget exhaustion — participant cannot contribute	Medium	Medium — federation loses participant	DP budget tracker per participant	Assess privacy-utility trade-off; negotiate extended budget or retire participant
Coordinator compromise	Very Low	Critical — adversary sees round gradients	Intrusion detection; secure aggregation	Secure aggregation prevents coordinator from seeing individual gradients; rotate keys
Legal dispute over global model IP	Low	High — programme halted	Legal agreement monitoring	Federation agreement must pre-define IP ownership; no execution dependency on dispute resolution

14. Regulatory Considerations

Regulation	Requirement	Pattern Response
GDPR Article 5(1)(b)	Purpose limitation	Federation agreement defines permitted use of global model
GDPR Article 25	Privacy by design	Federated architecture keeps data local; DP-SGD minimises gradient leakage
EU AI Act Article 10	Training data governance	Global model inherits data governance requirements of all participants' data
APRA CPS 234	Third-party risk	Coordinator is a third party; security standards contractually required
Competition law	Anti-trust compliance	Coordinator must not aggregate commercially sensitive information; legal sign-off required
Cross-border data transfer	Data sovereignty	Raw data stays local; gradient transmission reviewed per jurisdiction

15. Reference Implementations

AWS

Component	AWS Service
Coordinator	Amazon SageMaker Federated Learning + custom Flower coordinator on ECS
Participant training	SageMaker Training Jobs at each participant site
Global model registry	SageMaker Model Registry
Secure communication	AWS PrivateLink (cross-silo)

Azure

Component	Azure Service
Coordinator	Azure ML Federated Learning (preview) + NVIDIA FLARE on AKS
Participant training	Azure ML Compute at each participant
Model registry	Azure ML Model Registry

GCP

Component	GCP Service
Coordinator	Vertex AI + Flower on Cloud Run
Participant training	Vertex AI Training per participant
Global model	Vertex AI Model Registry

On-Premises

Component	Technology
Coordinator	Flower (flwr) or NVIDIA FLARE on Kubernetes
Participant	PyTorch + Opacus on GPU node
Communication	mTLS over private network
Model registry	MLflow

Pattern	ID	Relationship	Notes
Privacy by Design for AI Data	EAAPL-DAT005	Complements	DP-SGD is a privacy-by-design technique
Synthetic Data Generation	EAAPL-DAT004	Alternative	Synthetic data is an alternative when federated training is infeasible
AI Training Data Governance	EAAPL-DAT007	Depends on	Federation agreement is a training data governance artefact
Model Versioning	EAAPL-MDL001	Depends on	Global model versioning per federation round
Fine-Tuning Pipeline	EAAPL-MDL006	Complements	Global federated model fine-tuned locally per participant

17. Maturity Assessment

Overall Maturity: Emerging — Federated learning frameworks (Flower, NVIDIA FLARE) are production-ready. Cross-silo deployments are proven in healthcare and finance consortia. Cross-device FL at scale is mature (Google deployed FL for Gboard). However, regulatory frameworks for federated AI governance are still developing.

Dimension	Score (1–5)	Notes
Architectural clarity	4	Well-defined federation patterns; non-IID handling still research-active
Tooling maturity	3	Flower/NVIDIA FLARE production-ready; enterprise tooling maturing
Regulatory alignment	3	GDPR alignment good; AI Act treatment of federated models unclear
Operational complexity	2	High operational complexity; multi-party coordination challenging
Cost efficiency	3	High participant compute cost; offset by regulatory compliance enablement
Security	4	DP + SecAgg provides strong privacy guarantees

18. Revision History

Version	Date	Author	Changes
1.0	2024-03-01	EAAPL Working Group	Initial publication
1.1	2025-03-01	EAAPL Working Group	Added FedProx/SCAFFOLD detail; Byzantine robustness; NVIDIA FLARE reference

Track this pattern for APRA/ASIC review

← Back to Library More Data Architecture →