EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryData ArchitectureEAAPL-DAT007
EAAPL-DAT007Proven
⇄ Compare

AI Training Data Governance

🗄️ Data ArchitectureEU AI ActISO/IEC 42001🏭 Field-tested in AU

[EAAPL-DAT007] AI Training Data Governance

Category: Data Architecture
Sub-category: Data Governance / AI Training
Version: 1.2
Maturity: Proven
Tags: training-data-governance, dataset-versioning, provenance, bias-assessment, licence-management, consent-records
Regulatory Relevance: EU AI Act Articles 10/17, APRA CPS 234, Privacy Act APP 3/6, ISO 42001 §8.4, NIST AI RMF GOVERN-1.2


1. Executive Summary

AI training data governance is the foundation of responsible AI. A model is only as trustworthy as the data it was trained on. Yet most organisations lack systematic governance for training datasets: no formal registration, no version history, no bias assessment, no IP/licence tracking for third-party data, and no audit-grade consent records.

This pattern defines a comprehensive AI training data governance framework covering the full lifecycle from dataset acquisition through model deprecation. It establishes a Training Data Registry as the system of record, with mandatory governance artefacts for every dataset used in production AI training: provenance declaration, bias assessment report, licence and IP clearance, consent record, and quality scorecard.

Organisations that implement this pattern can respond to EU AI Act Article 10 and regulatory audits in hours rather than months, demonstrate systematic bias management to auditors, and prevent costly legal disputes over training data IP ownership.

Target audience: Chief Data Officers, AI Governance leads, Legal/IP Counsel, ML Platform leads.


2. Problem Statement

Business Problem

Organisations face increasing regulatory and legal pressure to demonstrate that AI training data was lawfully acquired, appropriately consented, free from prohibited bias, and licensed for AI training use. Without systematic governance, they cannot make this demonstration.

Technical Problem

  • Training datasets are created ad hoc by ML engineers; no formal registration or versioning.
  • No systematic tracking of whether training data contains third-party IP with AI training restrictions.
  • Bias assessments (if done) are informal; not linked to the training dataset version or model version.
  • Consent records for data used in AI training are not linked to training datasets — cannot prove consent was valid at training time.
  • Dataset versions are not immutable; datasets are overwritten, destroying the audit trail.

Symptoms

  • Cannot answer "which data trained this model?" for a production model built 12 months ago.
  • Legal team discovers training data included copyrighted text without AI training licence.
  • Regulatory audit requires bias assessment for training data; no formal assessment exists.
  • Data subject withdraws consent; organisation cannot determine if that subject's data was used in training.
  • Training dataset changed after model validation but before production deployment; discrepancy discovered in audit.

Cost of Inaction

Dimension Impact
Regulatory EU AI Act Article 10 violation; APRA enforcement; Privacy Act penalty
Legal Copyright infringement claims for training data; settlements in tens of millions
Reputational Public disclosure of biased training data triggers brand crisis
Operational Manual reconstruction of training data history takes weeks per model

3. Context

When to Apply

  • Any AI system trained on data for production deployment.
  • AI systems subject to regulatory review (EU AI Act, APRA, Privacy Act).
  • AI using third-party licensed data or web-scraped data.
  • AI systems where bias is a material risk (credit, employment, health, law enforcement).
  • Organisations with multiple ML teams producing models (governance prevents divergent practices).

When NOT to Apply

  • Pure research experimentation with public benchmark datasets.
  • AI trained entirely on proprietary, clearly consented, non-sensitive internal data with no regulatory obligation.

Prerequisites

Prerequisite Minimum Viable Preferred
Dataset storage File system with versioning Immutable object store with version IDs
ML platform MLflow (basic) MLflow + DVC + Model Registry
Data catalogue Spreadsheet DataHub / Atlan with API
Legal counsel Internal review IP specialist + privacy counsel
Bias assessment tooling Manual statistical analysis AI Fairness 360, Aequitas, Fairlearn

Industry Applicability

Industry Applicability Driver
Financial Services Critical APRA model risk; credit decision AI; GDPR
Healthcare Critical Clinical AI; patient consent; EU AI Act high-risk
Government Critical Public sector AI accountability; FOI obligations
Legal / RegTech High AI-assisted legal decisions; IP liability
Retail Medium Personalisation AI; consent management
Technology High Foundation model training; IP clearance critical

4. Architecture Overview

Design Philosophy

The core principle of AI training data governance is that a training dataset is a first-class governed artefact — as formally managed as a production software release. The Training Data Registry is the system of record: every dataset used in production model training must have a registered, versioned, governance-approved entry before a training run can proceed.

Dataset Registration and Versioning. Each training dataset is registered with a unique ID and version in the Training Data Registry. The dataset is stored in an immutable object store (S3 with Object Lock, GCS with retention policy) — once registered, the dataset content cannot be changed. If the dataset is updated, a new version is registered. This immutability is the foundation of reproducible AI: given a model version, the exact training data can always be retrieved.

Provenance Declaration. For each dataset, the registering team must declare: data sources (which operational systems, external datasets, or acquired datasets contributed records); transformation logic (which pipelines produced the dataset from sources); collection period (the date range of data collection); and known exclusions (records excluded and why). This information is captured in a structured Provenance Record and linked to the dataset version.

Bias Assessment. For every training dataset used in consequential AI (EU AI Act Annex III, or internally classified high-risk), a Bias Assessment Report is mandatory. The assessment evaluates: demographic distribution (is the training population representative of the inference population?); historical bias (does the data encode historical discrimination?); proxy variable risk (do features correlate with protected attributes?); label bias (were labels applied inconsistently across demographic groups?). The assessment uses standardised tools (AI Fairness 360, Aequitas) and is reviewed by a designated bias assessor (independent of the ML team that built the dataset).

Licence and IP Management. Third-party data (purchased datasets, web-scraped data, open datasets) must have IP clearance before use in AI training. The IP Clearance Record documents: source licence type; whether the licence explicitly permits AI training use; jurisdictional restrictions; expiry date; any attribution obligations. This is enforced by the governance workflow: training runs cannot proceed for datasets with expired, missing, or prohibitive IP clearance.

Consent Record Integration. For datasets containing personal information, a Consent Record is required documenting: the legal basis for processing (consent, legitimate interest, statutory obligation); the consent scope (which uses are covered); the consent date range (were all subjects consenting when the data was collected?); and the consent withdrawal propagation mechanism. This integrates with the Privacy by Design pattern (EAAPL-DAT005).

Governance Workflow. Dataset registration triggers an automated governance workflow: (1) automated checks (schema validation, quality scorecard linkage, completeness of provenance record); (2) bias assessment submission (if required by risk classification); (3) IP clearance review (if third-party data); (4) consent record review (if personal data); (5) approval by Dataset Governance Officer. Only approved datasets appear in the "approved for production training" view of the registry.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Dataset Acquisition"] A[Internal Data] B[Third-Party Data] end subgraph Governance["Governance Workflow"] C[Dataset Registry] D[Bias and IP Assessment] E{Governance Approval} end subgraph Output["Production Pipeline"] F[(Approved Dataset Store)] G[ML Training Pipeline] end A --> C B --> C C --> D D --> E E -->|approved| F E -->|rejected| C F --> G G -->|lineage| C style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#fef9c3,stroke:#eab308 style G fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Training Data Registry Database + API System of record for all training datasets; version management; governance status Custom PostgreSQL + REST API; MLflow Dataset tracking; DVC Critical
Immutable Dataset Store Storage Content-addressable, write-once storage for registered training datasets S3 Object Lock, GCS Retention Policy, Azure Immutable Blob Storage Critical
Provenance Record Schema Data Schema Structured provenance declaration per dataset version JSON Schema, linked to registry via dataset ID High
Bias Assessment Engine Processing Automated demographic distribution analysis; proxy variable detection AI Fairness 360, Aequitas, Fairlearn, custom pandas/scipy High
IP Clearance Database Database Tracks licence type, AI training permission, expiry, attribution requirements per data source Custom PostgreSQL; Collibra data governance; spreadsheet (minimum) High
Consent Record Integration Integration Links training dataset to consent records from consent management platform Custom integration; OneTrust API High
Governance Workflow Engine Orchestration Manages multi-step dataset approval workflow; notifications; escalation Jira workflows, custom Airflow DAG, ServiceNow High
Dataset Governance Officer Role Human Role Reviews and approves datasets; owns governance workflow Organisational role; may delegate to domain owners Critical
Model Registry Linkage Integration Bidirectional link: model version → dataset version; dataset version → model versions MLflow dataset tracking; custom bidirectional index Critical
Compliance Dashboard Application Shows governance coverage gaps; expiry alerts; regulatory query support Grafana, custom React, Metabase Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 ML Team / Data Engineer Acquires dataset; registers in Training Data Registry with provenance declaration Dataset ID + version; Provenance Record
2 Immutable Store Dataset written to Object-Lock storage; hash computed Immutable dataset with content hash
3 Governance Workflow Automated checks: schema valid, quality scorecard linked, provenance complete Check pass/fail report
4 Bias Assessor Runs bias assessment; submits Bias Assessment Report Bias Assessment Report linked to dataset version
5 IP Counsel Reviews licence; records IP Clearance Record IP Clearance: approved/restricted/prohibited
6 Privacy Officer Reviews consent record; confirms legal basis; links to consent system Consent Record linked to dataset version
7 Dataset Governance Officer Reviews all artefacts; approves or rejects Dataset status: Approved / Rejected / Conditional
8 ML Platform Training pipeline validates dataset ID is in Approved status before starting training Training run approved to start
9 Model Registry Training run completes; model version linked to dataset version Bidirectional model ↔ dataset lineage
10 Compliance Dashboard Continuously monitors for expiring IP clearances; consent renewals; bias reassessment triggers Expiry alerts; governance gap report

Error Flow

Error Condition Trigger Response Recovery
Training run attempted with unapproved dataset Pipeline requests training on unapproved dataset ID Training pipeline blocked by governance gate Team completes governance approval workflow before resubmitting
IP clearance expired for training dataset Clearance expiry date reached Dataset status set to Restricted; dependent models flagged Legal team renews licence or confirms expiry acceptable; status updated
Bias assessment finds high-risk demographic skew PSI >0.25 for protected group Dataset flagged; human review required before approval ML team and domain expert review skew; remediation (resampling, additional data collection) or documented acceptance
Consent record invalidated (consent withdrawn at scale) Large-scale consent withdrawal affecting training dataset Training pipeline notified; dataset flagged for re-evaluation Remove withdrawn records; re-register updated dataset version

8. Security Considerations

Authentication & Authorisation

  • Training Data Registry write access restricted to ML Platform service identity and designated data engineers.
  • Dataset content in immutable store: write access locked after registration; read access controlled by ML Platform.
  • Governance workflow approval requires authenticated Dataset Governance Officer identity.

Secrets Management

  • No secrets in training dataset files; credentials for accessing source systems managed in secrets manager.

Data Classification

  • Training datasets classified based on most sensitive data element; classification enforced in registry metadata.
  • Immutable store access tiered by dataset classification.

Encryption

  • Datasets encrypted at rest (AES-256); encryption keys in KMS.
  • Dataset content hash computed before encryption; stored as integrity verification.

Auditability

  • Every governance workflow decision logged with actor, decision, timestamp, and justification.
  • Dataset access for training logged: which training run read which dataset version.
  • IP clearance status changes logged; ownership trail maintained.

OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM03: Training Data Poisoning Unreviewed dataset could contain adversarial records Governance approval workflow; quality scorecard gate
LLM06: Sensitive Information Disclosure PII in training data surfaces in model Consent record + privacy review gate in governance workflow
LLM02: Insecure Output Handling Model trained on biased data produces biased outputs Bias Assessment Report gate; downstream bias monitoring

9. Governance Considerations

Responsible AI

  • Bias Assessment is a mandatory governance gate for all consequential AI training datasets.
  • Dataset Governance Officer is accountable for approving bias assessment outcomes.

Model Risk Management

  • Model risk frameworks require training data governance documentation; Training Data Registry provides this automatically.
  • Model lifecycle audit requires dataset version lineage; registry + model registry link provides this.

Human Approval Checkpoints

  • Dataset Governance Officer approval required before any dataset enters Approved status.
  • Conditional approval (with documented exceptions) requires CDO sign-off.
  • IP clearance renewal requires legal counsel review.

Governance Artefacts

Artefact Owner Cadence Purpose
Provenance Record Data Engineer Per dataset version Documents data sources, transformations, collection period
Bias Assessment Report Bias Assessor Per dataset version (consequential AI) Demographic distribution, proxy analysis, label bias
IP Clearance Record Legal / IP Counsel Per third-party data source Licence type, AI training permission, expiry
Consent Record Privacy Officer Per dataset version (personal data) Legal basis, consent scope, date range, withdrawal status
Governance Approval Record Dataset Governance Officer Per dataset version Decision, conditions, approver identity, timestamp
Dataset Deprecation Impact Report ML Platform Before deprecation Models and predictions impacted by dataset removal

10. Operational Considerations

Monitoring

Metric Alert Threshold Tooling
Governance approval SLA >10 business days without decision Workflow system alert
IP clearance expiry 90 days before expiry Compliance dashboard alert
Datasets in Approved status without bias assessment (if required) Any Governance gap report
Training runs using unapproved dataset (blocked) Any attempted bypass Pipeline security gate log
Consent record linkage for personal data datasets <100% Governance gap report

SLOs

SLO Target Measurement
Dataset governance approval (standard datasets) ≤5 business days Workflow timestamps
Governance gap closure (missing artefact) ≤10 business days after detection Dashboard + Jira tracking
Training Data Registry availability 99.9% Health check

11. Cost Considerations

Cost Drivers

Cost Driver Typical Range Notes
Training Data Registry (custom build) $5,000–$50,000 one-time + $500–$2,000/month ops Custom database + API
Immutable dataset storage $100–$3,000/month Scales with dataset volume
Bias assessment tooling $0–$2,000/month AI Fairness 360 OSS free; enterprise bias platforms
IP counsel reviews $500–$5,000 per dataset Per third-party dataset
Governance workflow engineering 0.5–1 FTE Setup + ongoing management
Dataset Governance Officer time 0.25–0.5 FTE Review and approval workload

Indicative Cost Range

Scale Monthly Cost Basis
Small (1–3 models, <10 datasets) $2,000–$8,000 Custom registry + manual workflow
Medium (5–15 models, 20–50 datasets) $8,000–$25,000 Custom registry + automated workflow + bias tooling
Large (20+ models, 100+ datasets) $25,000–$80,000 Enterprise governance platform + full automation

12. Trade-Off Analysis

Option Comparison

Option Pros Cons Recommended When
A: Full Training Data Governance Framework (this pattern) Regulatory-grade; complete audit trail; IP protection High governance overhead; slows initial dataset registration Regulated industry; production AI; EU AI Act obligation
B: MLflow Dataset Tracking only Lightweight; integrated with existing MLflow No bias assessment, IP clearance, or consent management Research AI; no regulatory obligation
C: DVC (Data Version Control) only Good versioning; reproducibility; git-like workflow No governance workflow; no bias/IP/consent management Open-source / research context
D: No training data governance Zero overhead Fails regulatory audit; legal IP risk; no reproducibility Never for production AI

Architectural Tensions

Tension Trade-Off Resolution
Governance thoroughness vs. ML team velocity Full governance slows dataset iteration Tiered governance: lightweight for experiments; full for production
Immutability vs. data correction Immutable storage prevents correcting bad data Corrections create new dataset versions; governance workflow for corrections
Centralised governance vs. domain ownership Central team = bottleneck; domain teams = inconsistency Domain-owned datasets + central governance standards + automated checks

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Training dataset modified after governance approval Medium High — regulatory audit finds discrepancy Content hash comparison; immutable storage Object Lock prevents modification; hash check in training pipeline
IP clearance missed for third-party data subset Medium High — copyright infringement risk Governance workflow IP check gate Legal review of training dataset composition; remove or relicence affected data
Bias assessment not triggered (automation gap) Medium High — biased model deployed without assessment Governance gap report Mandatory bias assessment in workflow automation; backfill for existing datasets
Governance Officer backlog — approval SLA missed High Medium — ML team blocked; velocity impact SLA monitoring in workflow system Delegate approval authority; increase DGO capacity; automate low-risk approvals

14. Regulatory Considerations

Regulation Article/Clause Requirement Pattern Response
EU AI Act Article 10(2)(a-f) Training data requirements: relevance, representativeness, absence of errors Provenance + quality scorecard + bias assessment
EU AI Act Article 10(3) Examine data for biases; take corrective action Mandatory bias assessment gate
EU AI Act Article 17 Quality management system documentation Training Data Registry serves as quality management documentation
EU AI Act Article 12 Record-keeping for minimum 10 years Immutable dataset store + registry retained per schedule
APRA CPS 234 §32 Information asset management Dataset registration and version control
Privacy Act (Australia) APP 3/6 Collection and use limitation Consent Record gate in governance workflow
Copyright law Various AI training on copyrighted data IP Clearance Record; licence review gate
ISO 42001 §8.4 Data governance for AI Training Data Registry implements ISO 42001 §8.4

15. Reference Implementations

AWS

Component AWS Service
Training Data Registry Amazon DynamoDB + API Gateway (custom)
Immutable Dataset Store Amazon S3 with Object Lock (WORM)
Governance Workflow AWS Step Functions + SNS notifications
Bias Assessment SageMaker Clarify
Model Registry Linkage SageMaker Model Registry + dataset tracking

Azure

Component Azure Service
Training Data Registry Azure Cosmos DB + custom API
Immutable Dataset Store Azure Immutable Blob Storage
Governance Workflow Azure Logic Apps + Azure DevOps
Bias Assessment Azure ML Responsible AI dashboard
Model Linkage Azure ML Model Registry

GCP

Component GCP Service
Training Data Registry Cloud Firestore + custom API
Immutable Dataset Store GCS with retention policy
Governance Workflow Cloud Workflows + Pub/Sub
Bias Assessment Vertex Explainable AI + custom AIF360 job
Model Linkage Vertex AI Model Registry

On-Premises

Component Technology
Training Data Registry PostgreSQL + FastAPI
Immutable Store MinIO with Object Lock
Governance Workflow Apache Airflow (human-in-the-loop tasks)
Bias Assessment AI Fairness 360 + Aequitas on Kubernetes
Model Linkage MLflow

Pattern ID Relationship Notes
Data Lineage for AI EAAPL-DAT003 Complements Dataset versions are key nodes in the AI lineage graph
Data Quality for AI EAAPL-DAT002 Depends on Quality Scorecard is a mandatory artefact in governance workflow
Privacy by Design for AI Data EAAPL-DAT005 Depends on Consent Record integration is a governance workflow step
Synthetic Data Generation EAAPL-DAT004 Complements Synthetic datasets must be registered and governed
Model Versioning EAAPL-MDL001 Bidirectional Model version ↔ dataset version lineage
Fine-Tuning Pipeline EAAPL-MDL006 Depends on Fine-tuning training data must be registered and governed

17. Maturity Assessment

Overall Maturity: Proven — Training data versioning (DVC, MLflow) is mature. Formal governance workflows for bias/IP/consent are increasingly required by regulation; tooling is maturing rapidly. EU AI Act enforcement starting 2026 is accelerating adoption.

Dimension Score (1–5) Notes
Architectural clarity 5 Well-defined components and workflow
Tooling maturity 3 Registry custom-built in most orgs; integrated platforms emerging
Regulatory alignment 5 Direct EU AI Act Art. 10/17 implementation
Operational complexity 3 Governance officer workload; automation reduces over time
Cost efficiency 4 Offset by regulatory risk reduction and IP protection
Security 4 Immutable storage; access controls; audit logging

18. Revision History

Version Date Author Changes
1.0 2023-10-15 EAAPL Working Group Initial publication
1.1 2024-07-01 EAAPL Working Group Added EU AI Act Article 10 deep mapping; IP clearance detail
1.2 2025-03-01 EAAPL Working Group Added copyright law section; updated tooling references
← Back to LibraryMore Data Architecture