EAAPL-DAT007Proven

10 signals→

AI Training Data Governance

Data ArchitectureEU AI ActISO/IEC 42001Field-tested in AU

[EAAPL-DAT007] AI Training Data Governance

Category: Data Architecture
Sub-category: Data Governance / AI Training
Version: 1.2
Maturity: Proven
Tags: training-data-governance, dataset-versioning, provenance, bias-assessment, licence-management, consent-records
Regulatory Relevance: EU AI Act Articles 10/17, APRA CPS 234, Privacy Act APP 3/6, ISO 42001 §8.4, NIST AI RMF GOVERN-1.2

1. Executive Summary

AI training data governance is the foundation of responsible AI. A model is only as trustworthy as the data it was trained on. Yet most organisations lack systematic governance for training datasets: no formal registration, no version history, no bias assessment, no IP/licence tracking for third-party data, and no audit-grade consent records.

This pattern defines a comprehensive AI training data governance framework covering the full lifecycle from dataset acquisition through model deprecation. It establishes a Training Data Registry as the system of record, with mandatory governance artefacts for every dataset used in production AI training: provenance declaration, bias assessment report, licence and IP clearance, consent record, and quality scorecard.

Organisations that implement this pattern can respond to EU AI Act Article 10 and regulatory audits in hours rather than months, demonstrate systematic bias management to auditors, and prevent costly legal disputes over training data IP ownership.

Target audience: Chief Data Officers, AI Governance leads, Legal/IP Counsel, ML Platform leads.

2. Problem Statement

Business Problem

Organisations face increasing regulatory and legal pressure to demonstrate that AI training data was lawfully acquired, appropriately consented, free from prohibited bias, and licensed for AI training use. Without systematic governance, they cannot make this demonstration.

Technical Problem

Training datasets are created ad hoc by ML engineers; no formal registration or versioning.
No systematic tracking of whether training data contains third-party IP with AI training restrictions.
Bias assessments (if done) are informal; not linked to the training dataset version or model version.
Consent records for data used in AI training are not linked to training datasets — cannot prove consent was valid at training time.
Dataset versions are not immutable; datasets are overwritten, destroying the audit trail.

Symptoms

Cannot answer "which data trained this model?" for a production model built 12 months ago.
Legal team discovers training data included copyrighted text without AI training licence.
Regulatory audit requires bias assessment for training data; no formal assessment exists.
Data subject withdraws consent; organisation cannot determine if that subject's data was used in training.
Training dataset changed after model validation but before production deployment; discrepancy discovered in audit.

Cost of Inaction

Dimension	Impact
Regulatory	EU AI Act Article 10 violation; APRA enforcement; Privacy Act penalty
Legal	Copyright infringement claims for training data; settlements in tens of millions
Reputational	Public disclosure of biased training data triggers brand crisis
Operational	Manual reconstruction of training data history takes weeks per model

3. Context

When to Apply

Any AI system trained on data for production deployment.
AI systems subject to regulatory review (EU AI Act, APRA, Privacy Act).
AI using third-party licensed data or web-scraped data.
AI systems where bias is a material risk (credit, employment, health, law enforcement).
Organisations with multiple ML teams producing models (governance prevents divergent practices).

When NOT to Apply

Pure research experimentation with public benchmark datasets.
AI trained entirely on proprietary, clearly consented, non-sensitive internal data with no regulatory obligation.

Prerequisites

Prerequisite	Minimum Viable	Preferred
Dataset storage	File system with versioning	Immutable object store with version IDs
ML platform	MLflow (basic)	MLflow + DVC + Model Registry
Data catalogue	Spreadsheet	DataHub / Atlan with API
Legal counsel	Internal review	IP specialist + privacy counsel
Bias assessment tooling	Manual statistical analysis	AI Fairness 360, Aequitas, Fairlearn

Industry Applicability

Industry	Applicability	Driver
Financial Services	Critical	APRA model risk; credit decision AI; GDPR
Healthcare	Critical	Clinical AI; patient consent; EU AI Act high-risk
Government	Critical	Public sector AI accountability; FOI obligations
Legal / RegTech	High	AI-assisted legal decisions; IP liability
Retail	Medium	Personalisation AI; consent management
Technology	High	Foundation model training; IP clearance critical

4. Architecture Overview

Design Philosophy

The core principle of AI training data governance is that a training dataset is a first-class governed artefact — as formally managed as a production software release. The Training Data Registry is the system of record: every dataset used in production model training must have a registered, versioned, governance-approved entry before a training run can proceed.

Dataset Registration and Versioning. Each training dataset is registered with a unique ID and version in the Training Data Registry. The dataset is stored in an immutable object store (S3 with Object Lock, GCS with retention policy) — once registered, the dataset content cannot be changed. If the dataset is updated, a new version is registered. This immutability is the foundation of reproducible AI: given a model version, the exact training data can always be retrieved.

Provenance Declaration. For each dataset, the registering team must declare: data sources (which operational systems, external datasets, or acquired datasets contributed records); transformation logic (which pipelines produced the dataset from sources); collection period (the date range of data collection); and known exclusions (records excluded and why). This information is captured in a structured Provenance Record and linked to the dataset version.

Bias Assessment. For every training dataset used in consequential AI (EU AI Act Annex III, or internally classified high-risk), a Bias Assessment Report is mandatory. The assessment evaluates: demographic distribution (is the training population representative of the inference population?); historical bias (does the data encode historical discrimination?); proxy variable risk (do features correlate with protected attributes?); label bias (were labels applied inconsistently across demographic groups?). The assessment uses standardised tools (AI Fairness 360, Aequitas) and is reviewed by a designated bias assessor (independent of the ML team that built the dataset).

Licence and IP Management. Third-party data (purchased datasets, web-scraped data, open datasets) must have IP clearance before use in AI training. The IP Clearance Record documents: source licence type; whether the licence explicitly permits AI training use; jurisdictional restrictions; expiry date; any attribution obligations. This is enforced by the governance workflow: training runs cannot proceed for datasets with expired, missing, or prohibitive IP clearance.

Consent Record Integration. For datasets containing personal information, a Consent Record is required documenting: the legal basis for processing (consent, legitimate interest, statutory obligation); the consent scope (which uses are covered); the consent date range (were all subjects consenting when the data was collected?); and the consent withdrawal propagation mechanism. This integrates with the Privacy by Design pattern (EAAPL-DAT005).

Governance Workflow. Dataset registration triggers an automated governance workflow: (1) automated checks (schema validation, quality scorecard linkage, completeness of provenance record); (2) bias assessment submission (if required by risk classification); (3) IP clearance review (if third-party data); (4) consent record review (if personal data); (5) approval by Dataset Governance Officer. Only approved datasets appear in the "approved for production training" view of the registry.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Input["Dataset Acquisition"] A[Internal Data] B[Third-Party Data] end subgraph Governance["Governance Workflow"] C[Dataset Registry] D[Bias and IP Assessment] E{Governance Approval} end subgraph Output["Production Pipeline"] F[(Approved Dataset Store)] G[ML Training Pipeline] end A --> C B --> C C --> D D --> E E -->|approved| F E -->|rejected| C F --> G G -->|lineage| C style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#fef9c3,stroke:#eab308 style G fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Training Data Registry	Database + API	System of record for all training datasets; version management; governance status	Custom PostgreSQL + REST API; MLflow Dataset tracking; DVC	Critical
Immutable Dataset Store	Storage	Content-addressable, write-once storage for registered training datasets	S3 Object Lock, GCS Retention Policy, Azure Immutable Blob Storage	Critical
Provenance Record Schema	Data Schema	Structured provenance declaration per dataset version	JSON Schema, linked to registry via dataset ID	High
Bias Assessment Engine	Processing	Automated demographic distribution analysis; proxy variable detection	AI Fairness 360, Aequitas, Fairlearn, custom pandas/scipy	High
IP Clearance Database	Database	Tracks licence type, AI training permission, expiry, attribution requirements per data source	Custom PostgreSQL; Collibra data governance; spreadsheet (minimum)	High
Consent Record Integration	Integration	Links training dataset to consent records from consent management platform	Custom integration; OneTrust API	High
Governance Workflow Engine	Orchestration	Manages multi-step dataset approval workflow; notifications; escalation	Jira workflows, custom Airflow DAG, ServiceNow	High
Dataset Governance Officer Role	Human Role	Reviews and approves datasets; owns governance workflow	Organisational role; may delegate to domain owners	Critical
Model Registry Linkage	Integration	Bidirectional link: model version → dataset version; dataset version → model versions	MLflow dataset tracking; custom bidirectional index	Critical
Compliance Dashboard	Application	Shows governance coverage gaps; expiry alerts; regulatory query support	Grafana, custom React, Metabase	Medium

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	ML Team / Data Engineer	Acquires dataset; registers in Training Data Registry with provenance declaration	Dataset ID + version; Provenance Record
2	Immutable Store	Dataset written to Object-Lock storage; hash computed	Immutable dataset with content hash
3	Governance Workflow	Automated checks: schema valid, quality scorecard linked, provenance complete	Check pass/fail report
4	Bias Assessor	Runs bias assessment; submits Bias Assessment Report	Bias Assessment Report linked to dataset version
5	IP Counsel	Reviews licence; records IP Clearance Record	IP Clearance: approved/restricted/prohibited
6	Privacy Officer	Reviews consent record; confirms legal basis; links to consent system	Consent Record linked to dataset version
7	Dataset Governance Officer	Reviews all artefacts; approves or rejects	Dataset status: Approved / Rejected / Conditional
8	ML Platform	Training pipeline validates dataset ID is in Approved status before starting training	Training run approved to start
9	Model Registry	Training run completes; model version linked to dataset version	Bidirectional model ↔ dataset lineage
10	Compliance Dashboard	Continuously monitors for expiring IP clearances; consent renewals; bias reassessment triggers	Expiry alerts; governance gap report

Error Flow

Error Condition	Trigger	Response	Recovery
Training run attempted with unapproved dataset	Pipeline requests training on unapproved dataset ID	Training pipeline blocked by governance gate	Team completes governance approval workflow before resubmitting
IP clearance expired for training dataset	Clearance expiry date reached	Dataset status set to Restricted; dependent models flagged	Legal team renews licence or confirms expiry acceptable; status updated
Bias assessment finds high-risk demographic skew	PSI >0.25 for protected group	Dataset flagged; human review required before approval	ML team and domain expert review skew; remediation (resampling, additional data collection) or documented acceptance
Consent record invalidated (consent withdrawn at scale)	Large-scale consent withdrawal affecting training dataset	Training pipeline notified; dataset flagged for re-evaluation	Remove withdrawn records; re-register updated dataset version

8. Security Considerations

Authentication & Authorisation

Training Data Registry write access restricted to ML Platform service identity and designated data engineers.
Dataset content in immutable store: write access locked after registration; read access controlled by ML Platform.
Governance workflow approval requires authenticated Dataset Governance Officer identity.

Secrets Management

No secrets in training dataset files; credentials for accessing source systems managed in secrets manager.

Data Classification

Training datasets classified based on most sensitive data element; classification enforced in registry metadata.
Immutable store access tiered by dataset classification.

Encryption

Datasets encrypted at rest (AES-256); encryption keys in KMS.
Dataset content hash computed before encryption; stored as integrity verification.

Auditability

Every governance workflow decision logged with actor, decision, timestamp, and justification.
Dataset access for training logged: which training run read which dataset version.
IP clearance status changes logged; ownership trail maintained.

OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM03: Training Data Poisoning	Unreviewed dataset could contain adversarial records	Governance approval workflow; quality scorecard gate
LLM06: Sensitive Information Disclosure	PII in training data surfaces in model	Consent record + privacy review gate in governance workflow
LLM02: Insecure Output Handling	Model trained on biased data produces biased outputs	Bias Assessment Report gate; downstream bias monitoring

9. Governance Considerations

Responsible AI

Bias Assessment is a mandatory governance gate for all consequential AI training datasets.
Dataset Governance Officer is accountable for approving bias assessment outcomes.

Model Risk Management

Model risk frameworks require training data governance documentation; Training Data Registry provides this automatically.
Model lifecycle audit requires dataset version lineage; registry + model registry link provides this.

Human Approval Checkpoints

Dataset Governance Officer approval required before any dataset enters Approved status.
Conditional approval (with documented exceptions) requires CDO sign-off.
IP clearance renewal requires legal counsel review.

Governance Artefacts

Artefact	Owner	Cadence	Purpose
Provenance Record	Data Engineer	Per dataset version	Documents data sources, transformations, collection period
Bias Assessment Report	Bias Assessor	Per dataset version (consequential AI)	Demographic distribution, proxy analysis, label bias
IP Clearance Record	Legal / IP Counsel	Per third-party data source	Licence type, AI training permission, expiry
Consent Record	Privacy Officer	Per dataset version (personal data)	Legal basis, consent scope, date range, withdrawal status
Governance Approval Record	Dataset Governance Officer	Per dataset version	Decision, conditions, approver identity, timestamp
Dataset Deprecation Impact Report	ML Platform	Before deprecation	Models and predictions impacted by dataset removal

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Tooling
Governance approval SLA	>10 business days without decision	Workflow system alert
IP clearance expiry	90 days before expiry	Compliance dashboard alert
Datasets in Approved status without bias assessment (if required)	Any	Governance gap report
Training runs using unapproved dataset (blocked)	Any attempted bypass	Pipeline security gate log
Consent record linkage for personal data datasets	<100%	Governance gap report

SLOs

SLO	Target	Measurement
Dataset governance approval (standard datasets)	≤5 business days	Workflow timestamps
Governance gap closure (missing artefact)	≤10 business days after detection	Dashboard + Jira tracking
Training Data Registry availability	99.9%	Health check

11. Cost Considerations

Cost Drivers

Cost Driver	Typical Range	Notes
Training Data Registry (custom build)	$5,000–$50,000 one-time + $500–$2,000/month ops	Custom database + API
Immutable dataset storage	$100–$3,000/month	Scales with dataset volume
Bias assessment tooling	$0–$2,000/month	AI Fairness 360 OSS free; enterprise bias platforms
IP counsel reviews	$500–$5,000 per dataset	Per third-party dataset
Governance workflow engineering	0.5–1 FTE	Setup + ongoing management
Dataset Governance Officer time	0.25–0.5 FTE	Review and approval workload

Indicative Cost Range

Scale	Monthly Cost	Basis
Small (1–3 models, <10 datasets)	$2,000–$8,000	Custom registry + manual workflow
Medium (5–15 models, 20–50 datasets)	$8,000–$25,000	Custom registry + automated workflow + bias tooling
Large (20+ models, 100+ datasets)	$25,000–$80,000	Enterprise governance platform + full automation

12. Trade-Off Analysis

Option Comparison

Option	Pros	Cons	Recommended When
A: Full Training Data Governance Framework (this pattern)	Regulatory-grade; complete audit trail; IP protection	High governance overhead; slows initial dataset registration	Regulated industry; production AI; EU AI Act obligation
B: MLflow Dataset Tracking only	Lightweight; integrated with existing MLflow	No bias assessment, IP clearance, or consent management	Research AI; no regulatory obligation
C: DVC (Data Version Control) only	Good versioning; reproducibility; git-like workflow	No governance workflow; no bias/IP/consent management	Open-source / research context
D: No training data governance	Zero overhead	Fails regulatory audit; legal IP risk; no reproducibility	Never for production AI

Architectural Tensions

Tension	Trade-Off	Resolution
Governance thoroughness vs. ML team velocity	Full governance slows dataset iteration	Tiered governance: lightweight for experiments; full for production
Immutability vs. data correction	Immutable storage prevents correcting bad data	Corrections create new dataset versions; governance workflow for corrections
Centralised governance vs. domain ownership	Central team = bottleneck; domain teams = inconsistency	Domain-owned datasets + central governance standards + automated checks

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Training dataset modified after governance approval	Medium	High — regulatory audit finds discrepancy	Content hash comparison; immutable storage	Object Lock prevents modification; hash check in training pipeline
IP clearance missed for third-party data subset	Medium	High — copyright infringement risk	Governance workflow IP check gate	Legal review of training dataset composition; remove or relicence affected data
Bias assessment not triggered (automation gap)	Medium	High — biased model deployed without assessment	Governance gap report	Mandatory bias assessment in workflow automation; backfill for existing datasets
Governance Officer backlog — approval SLA missed	High	Medium — ML team blocked; velocity impact	SLA monitoring in workflow system	Delegate approval authority; increase DGO capacity; automate low-risk approvals

14. Regulatory Considerations

Regulation	Article/Clause	Requirement	Pattern Response
EU AI Act	Article 10(2)(a-f)	Training data requirements: relevance, representativeness, absence of errors	Provenance + quality scorecard + bias assessment
EU AI Act	Article 10(3)	Examine data for biases; take corrective action	Mandatory bias assessment gate
EU AI Act	Article 17	Quality management system documentation	Training Data Registry serves as quality management documentation
EU AI Act	Article 12	Record-keeping for minimum 10 years	Immutable dataset store + registry retained per schedule
APRA CPS 234	§32	Information asset management	Dataset registration and version control
Privacy Act (Australia)	APP 3/6	Collection and use limitation	Consent Record gate in governance workflow
Copyright law	Various	AI training on copyrighted data	IP Clearance Record; licence review gate
ISO 42001	§8.4	Data governance for AI	Training Data Registry implements ISO 42001 §8.4

15. Reference Implementations

AWS

Component	AWS Service
Training Data Registry	Amazon DynamoDB + API Gateway (custom)
Immutable Dataset Store	Amazon S3 with Object Lock (WORM)
Governance Workflow	AWS Step Functions + SNS notifications
Bias Assessment	SageMaker Clarify
Model Registry Linkage	SageMaker Model Registry + dataset tracking

Azure

Component	Azure Service
Training Data Registry	Azure Cosmos DB + custom API
Immutable Dataset Store	Azure Immutable Blob Storage
Governance Workflow	Azure Logic Apps + Azure DevOps
Bias Assessment	Azure ML Responsible AI dashboard
Model Linkage	Azure ML Model Registry

GCP

Component	GCP Service
Training Data Registry	Cloud Firestore + custom API
Immutable Dataset Store	GCS with retention policy
Governance Workflow	Cloud Workflows + Pub/Sub
Bias Assessment	Vertex Explainable AI + custom AIF360 job
Model Linkage	Vertex AI Model Registry

On-Premises

Component	Technology
Training Data Registry	PostgreSQL + FastAPI
Immutable Store	MinIO with Object Lock
Governance Workflow	Apache Airflow (human-in-the-loop tasks)
Bias Assessment	AI Fairness 360 + Aequitas on Kubernetes
Model Linkage	MLflow

Pattern	ID	Relationship	Notes
Data Lineage for AI	EAAPL-DAT003	Complements	Dataset versions are key nodes in the AI lineage graph
Data Quality for AI	EAAPL-DAT002	Depends on	Quality Scorecard is a mandatory artefact in governance workflow
Privacy by Design for AI Data	EAAPL-DAT005	Depends on	Consent Record integration is a governance workflow step
Synthetic Data Generation	EAAPL-DAT004	Complements	Synthetic datasets must be registered and governed
Model Versioning	EAAPL-MDL001	Bidirectional	Model version ↔ dataset version lineage
Fine-Tuning Pipeline	EAAPL-MDL006	Depends on	Fine-tuning training data must be registered and governed

17. Maturity Assessment

Overall Maturity: Proven — Training data versioning (DVC, MLflow) is mature. Formal governance workflows for bias/IP/consent are increasingly required by regulation; tooling is maturing rapidly. EU AI Act enforcement starting 2026 is accelerating adoption.

Dimension	Score (1–5)	Notes
Architectural clarity	5	Well-defined components and workflow
Tooling maturity	3	Registry custom-built in most orgs; integrated platforms emerging
Regulatory alignment	5	Direct EU AI Act Art. 10/17 implementation
Operational complexity	3	Governance officer workload; automation reduces over time
Cost efficiency	4	Offset by regulatory risk reduction and IP protection
Security	4	Immutable storage; access controls; audit logging

18. Revision History

Version	Date	Author	Changes
1.0	2023-10-15	EAAPL Working Group	Initial publication
1.1	2024-07-01	EAAPL Working Group	Added EU AI Act Article 10 deep mapping; IP clearance detail
1.2	2025-03-01	EAAPL Working Group	Added copyright law section; updated tooling references

Track this pattern for APRA/ASIC review

← Back to Library More Data Architecture →