EAAPL-DAT003Proven

Data Lineage for AI

Data ArchitectureEU AI ActISO/IEC 42001Field-tested in AU

[EAAPL-DAT003] Data Lineage for AI

Category: Data Architecture
Sub-category: Data Lineage / AI Traceability
Version: 1.2
Maturity: Proven
Tags: data-lineage, provenance, OpenLineage, explainability, impact-analysis, traceability, regulatory-audit
Regulatory Relevance: EU AI Act Articles 12 & 13, APRA CPS 234, GDPR Article 22, ISO 42001 §8.6, NIST AI RMF GOVERN-6

1. Executive Summary

Regulators, auditors, and risk committees increasingly demand that organisations explain not just what an AI model predicted, but why — tracing back through the model architecture, training data, and ultimately to raw source systems. Traditional data lineage tools capture ETL pipelines but stop at the data warehouse boundary, leaving the AI layer invisible.

This pattern defines an end-to-end AI data lineage architecture using the OpenLineage standard, capturing lineage from raw source systems through every transformation stage, training run, model version, and inference event. The lineage graph enables regulators to answer: "Which version of which data, processed how, produced this model, which made this prediction on this date?"

Beyond compliance, AI lineage delivers operational value: when a source system changes, impact analysis identifies all downstream models at risk within minutes instead of weeks. Organisations adopting this pattern have reduced regulatory investigation response time from weeks to hours and eliminated surprise model breakages from upstream schema changes.

Target audience: Chief Data Officers, Chief Compliance Officers, Enterprise Architects, ML Platform leads.

2. Problem Statement

Business Problem

When an AI model makes a consequential decision (loan rejection, insurance claim denial, clinical risk assessment), the organisation must be able to explain and defend that decision — including the data that trained the model. Without lineage, this explanation is impossible, creating regulatory and legal exposure.

Technical Problem

Data lineage tools capture ETL/SQL lineage but do not model AI-specific lineage events (feature engineering, training, inference).
Model registries store model artefacts but do not link models to the specific dataset versions used for training.
Inference logs record predictions but not the feature values used to produce each prediction, nor the model version.
Impact analysis of upstream data changes on downstream AI models is manual and error-prone.
No standard schema exists for capturing AI lineage events — leading to bespoke, non-interoperable solutions.

Symptoms

Regulatory enquiry response ("explain this credit decision") takes weeks rather than hours.
Schema change in an operational database silently breaks a downstream AI model.
Model retraining produces different results but the cause cannot be traced to a specific data change.
Multiple incompatible lineage stores exist: one for ETL, one for ML pipelines, one for BI — no unified view.
Audit finds model was trained on data that included consent-withdrawn records; no mechanism to detect this.

Cost of Inaction

Dimension	Impact
Regulatory	EU AI Act Article 12 violation; APRA regulatory action; GDPR Article 22 right-to-explanation breach
Operational	Weeks to diagnose model quality issues from upstream data changes
Legal	Inability to defend AI decisions in tribunal or litigation
Trust	Stakeholders (regulators, customers) cannot verify AI system integrity

3. Context

When to Apply

Any production AI system in a regulated industry.
AI systems where decisions are consequential (credit, insurance, clinical, employment).
Organisations with multiple AI models consuming data from shared source systems (high impact analysis value).
Systems subject to right-to-explanation requirements (GDPR, EU AI Act, Privacy Act).
Organisations where data contracts / data mesh are in use (lineage complements contract governance).

When NOT to Apply

Pure research/PoC AI with no production decisions.
AI systems consuming only fully external, black-box data APIs with no lineage available.
Very simple AI systems (single feature, deterministic rule) where lineage is obvious from inspection.

Prerequisites

Prerequisite	Minimum Viable	Preferred
Data pipeline observability	Ad hoc logging	Structured pipeline execution logs
ML pipeline tooling	Manual training scripts	MLflow / Kubeflow with run tracking
Lineage storage	Flat file (JSON)	Graph database (Neptune, Neo4j) or OpenLineage backend
Inference logging	Basic prediction logs	Structured inference log with feature values + model version

Industry Applicability

Industry	Applicability	Driver
Financial Services	Critical	APRA CPS 234; model risk management; lending decisions
Healthcare	Critical	EU AI Act high-risk; clinical decision support; drug discovery
Insurance	High	Actuarial model explainability; claims decisions
Government	High	Public sector AI accountability; FOI obligations
Retail	Medium	Personalisation; recommendation system transparency
Telecommunications	Medium	Churn; fraud model explainability

4. Architecture Overview

Design Philosophy

The foundational insight of this pattern is that AI lineage is a graph, not a table. The lineage of a single prediction involves a directed acyclic graph connecting: raw source records → data transformations → joined datasets → feature engineering → training dataset version → model training run → model version → inference request → prediction output. Each node in this graph is an immutable versioned artefact; each edge is a transformation event with metadata.

OpenLineage as the Standard. Rather than inventing a proprietary lineage schema, this pattern adopts the OpenLineage standard (openlineage.io), which defines a common event schema for lineage capture across data pipelines, ML platforms, and inference services. OpenLineage events are emitted by each pipeline stage, collected by a lineage backend (Marquez or Atlan), and stored in a queryable lineage graph.

Four Lineage Event Classes. AI lineage comprises four distinct event classes, each requiring specific schema extensions to the base OpenLineage spec:

Dataset lineage events: Emitted by ETL/ELT pipelines; capture source → transformation → output dataset with row counts, schema versions, and quality check results.
Feature engineering events: Emitted by feature pipelines; capture source datasets → feature computation logic → feature set version with temporal validity metadata.
Training events: Emitted by training pipelines; capture feature set versions → training run parameters → model artefact version with quality scorecard ID.
Inference events: Emitted by inference services; capture model version + feature values → prediction output with confidence score. Note: inference events are high-volume; sampling strategies are required for cost management while preserving full lineage for flagged or high-stakes predictions.

Lineage Query Patterns. Three primary query patterns drive the lineage architecture's design:

Forward impact query: "Which models and predictions are downstream of dataset X?" — used for impact analysis before schema changes.
Backward provenance query: "What data produced this prediction?" — used for regulatory explanation.
Cross-version diff query: "What changed in the data between model version M1 and M2?" — used for model quality investigation.

These query patterns require a graph-capable storage backend (Neo4j, Amazon Neptune, or a columnar store with graph query extensions).

Selective Inference Lineage. Full capture of feature values at inference time for every prediction is prohibitively expensive at scale (millions of predictions/day). The pattern uses a tiered strategy: full lineage for consequential predictions (flagged by risk score, decision type, or regulatory classification); sampled lineage (1–5%) for routine predictions; and always-full lineage for predictions that are later reviewed, appealed, or investigated.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Pipeline["Data-to-Model Pipeline"] A[Source Systems] B[ETL and Feature Engineering] C[Model Training] D[Inference Service] end subgraph Lineage["Lineage Backend"] E[OpenLineage Collector] F[(Lineage Graph Store)] G[Lineage Query API] end A --> B B --> C C --> D B -->|lineage events| E C -->|lineage events| E D -->|lineage events| E E --> F F --> G style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
OpenLineage Emitter (ETL)	Library / Agent	Emits dataset lineage events from ETL jobs	dbt OpenLineage plugin, Airflow OpenLineage provider, Spark OpenLineage integration	Critical
OpenLineage Emitter (Feature)	Library	Emits feature engineering lineage events	Custom Python OpenLineage client, MLflow OpenLineage integration	Critical
OpenLineage Emitter (Training)	Library	Emits training run lineage events linking feature versions to model version	MLflow OpenLineage plugin, Kubeflow OpenLineage integration	Critical
Selective Inference Lineage Capture	Middleware	Captures full feature values + model version for consequential and sampled predictions	Custom inference middleware, Arize AI, WhyLabs	High
OpenLineage Emitter (Inference)	Library	Emits inference lineage events	Custom Python OpenLineage client	High
OpenLineage Collector	API Service	Receives lineage events from all emitters; validates schema; routes to store	Marquez (OSS), Atlan, OpenMetadata	Critical
Lineage Graph Store	Storage	Stores lineage graph as queryable DAG	Neo4j, Amazon Neptune, Memgraph, PostgreSQL + pg_graph	Critical
Lineage Query API	API Service	Exposes lineage graph for forward/backward/diff queries	Marquez REST API, custom GraphQL API, Neo4j Cypher endpoint	High
Regulatory Audit Tool	Application	Generates human-readable explanation reports from lineage graph	Custom report generator, Collibra compliance reports	High
Impact Analysis Service	Application	Executes forward impact queries before schema changes; produces risk report	Custom Python + lineage API, DataHub impact analysis	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	ETL pipeline	Transforms source data; OpenLineage emitter fires START + COMPLETE events	Dataset lineage events in OpenLineage JSON format
2	OpenLineage Collector	Receives events; validates against OpenLineage schema; writes to graph store	Lineage nodes (datasets) + edges (transformations) in graph
3	Feature pipeline	Computes features; emits feature lineage events with input dataset versions	Feature lineage nodes + edges in graph
4	Training pipeline	Trains model; emits training event with feature set version IDs + hyperparameters	Training lineage node linking feature set version → model version
5	Model Registry	Stores model artefact; receives lineage ID from training emitter	Model version enriched with lineage pointer
6	Inference service	Serves prediction; selective lineage capture for consequential + sampled predictions	Inference lineage event: model version + feature snapshot + prediction
7	Lineage Collector	Ingests inference events; appends to lineage graph	Full lineage graph from source → prediction
8	Regulatory audit	Executes backward provenance query for specific prediction	Human-readable provenance report: data sources → transformations → model → prediction
9	Impact analysis	Before schema change: executes forward impact query	Risk report: list of downstream models + predictions at risk
10	Compliance dashboard	Continuously queries lineage completeness	Lineage coverage metric per model

Error Flow

Error Condition	Trigger	Response	Recovery
Lineage emitter failure (event not sent)	Network error; emitter crash	Prediction still served (lineage not on critical path); alert raised; lineage gap recorded	Emitter retries with exponential backoff; gap filled from pipeline logs if available
Lineage collector unavailable	Collector service down	Events queued in emitter buffer (local file or queue); delivered when collector recovers	Collector HA deployment; queue-based event delivery
Lineage graph store corruption	Hardware failure	Lineage queries unavailable; no impact on AI serving	Restore from backup; replay buffered events
Incomplete inference lineage (sampling miss)	Prediction not in sample; not flagged consequential	Prediction served without full lineage; noted in lineage completeness metric	Accept for routine predictions; escalate if prediction later flagged

8. Security Considerations

Authentication & Authorisation

OpenLineage Collector API requires authenticated emitters (API keys per pipeline stage); keys rotated quarterly.
Lineage Query API requires role-based access: data engineers (full read), auditors (read-only subset), business analysts (anonymised lineage).
Feature value snapshots in inference lineage classified as Confidential; access restricted to authorised investigators.

Secrets Management

Collector API keys stored in secrets manager; not in pipeline code.
Lineage graph database credentials rotated every 90 days.

Data Classification

Lineage metadata (dataset names, row counts, schema versions) classified as Internal.
Feature value snapshots in inference lineage classified as Confidential (may contain PII); stored with encryption and strict access control.
Regulatory reports generated from lineage may contain PII context; classified as Confidential.

Encryption

Lineage graph store encrypted at rest (AES-256); in transit TLS 1.3.
Feature value snapshots encrypted at rest with separate encryption keys; key access logged.

Auditability

All lineage query executions logged (who queried what lineage, when, why).
Lineage events immutable once written; no update/delete path.
Lineage completeness gaps (missing events) logged and alerted.

OWASP LLM Top 10 Mapping

OWASP LLM Risk	Relevance	Mitigation
LLM01: Prompt Injection	Adversarial input could attempt to manipulate lineage metadata	Lineage events are system-generated, not user-input; validate emitter identity
LLM06: Sensitive Information Disclosure	Feature value snapshots contain PII	Encrypted storage; strict access control; anonymisation for non-investigative queries
LLM02: Insecure Output Handling	Lineage reports consumed without validation	Report generation uses read-only lineage API; no dynamic code execution
LLM09: Overreliance	Auditors trust lineage completeness claims without verification	Lineage completeness metric surfaced in compliance dashboard; gaps explicitly flagged

9. Governance Considerations

Responsible AI

Complete lineage enables bias attribution: if a model exhibits demographic bias, lineage identifies whether the bias originates in source data, feature engineering, or labelling.
Lineage enables right-to-erasure impact analysis: when a data subject requests erasure, lineage identifies all models trained on data linked to that subject (machine unlearning trigger).

Model Risk Management

Model risk committees require provenance validation for all production models; Lineage Query API provides this programmatically.
Model version deprecation requires impact analysis (which predictions were served by this version?); lineage graph enables this query.

Human Approval Checkpoints

Before upstream schema change: Impact Analysis Service report reviewed by data owner + affected ML leads.
Regulatory investigation: Lineage Query API output reviewed by compliance officer before submission to regulator.

Governance Artefacts

Artefact	Owner	Cadence	Purpose
Lineage Completeness Report	ML Platform	Weekly	Coverage % of production models with full source-to-prediction lineage
Provenance Report (per prediction)	Compliance Team (on demand)	Per regulatory enquiry	Full lineage trace for specific prediction; human-readable
Impact Analysis Report	Data Owner (on change)	Before schema changes	Forward impact: which models at risk from proposed change
Consent Withdrawal Impact Report	Privacy Officer (on demand)	Per data subject request	Identifies models trained on data linked to subject

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Tooling
Lineage event delivery success rate	<99.5% over 1 hour	Collector metrics + Grafana
Lineage completeness per model	<95%	Custom completeness query + alert
Lineage graph store query latency (p99)	>2 seconds	Graph store metrics
Inference lineage capture rate (consequential predictions)	<100%	Inference service metrics
Collector queue depth (if event buffering)	>10,000 events	Queue metrics

SLOs

SLO	Target	Measurement
Lineage event delivery (pipeline to graph)	<5 minutes end-to-end	Event timestamp vs. graph ingestion timestamp
Backward provenance query response	<10 seconds	Lineage Query API response time
Lineage graph store availability	99.9%	Health check
Lineage completeness for production models	≥95%	Weekly completeness query

Logging

All OpenLineage events logged in raw form (JSON) alongside graph store; serves as event replay source.
Lineage query audit log retained 7 years.

Incident Management

Lineage gap (missing events for production model) → P2 incident; ML Platform investigates emitter health.
Lineage graph store unavailable → P1 if regulatory investigation in progress; P2 otherwise.

Disaster Recovery

Component	RTO	RPO	Strategy
Lineage Graph Store	4 hours	1 hour	Database backup + standby replica; event replay from raw event store
OpenLineage Collector	1 hour	0	Multi-AZ stateless deployment; events buffered in pipeline until collector recovers
Raw Event Store	8 hours	24 hours	Cross-region object storage replication

11. Cost Considerations

Cost Drivers

Cost Driver	Typical Range	Notes
Lineage graph store	$300–$5,000/month	Neo4j AuraDB / Amazon Neptune; scales with graph size
OpenLineage Collector (Marquez)	$0–$2,000/month	Marquez OSS free; hosted Atlan has licence cost
Inference lineage storage (feature snapshots)	$100–$2,000/month	Object store; scales with prediction volume × sampling rate
Compute for lineage queries	$50–$500/month	Graph query compute; low for typical audit query patterns
Engineering	0.25–0.5 FTE	Emitter maintenance; lineage completeness monitoring

Scaling Risks

Inference lineage feature snapshots at high prediction volume (>1M/day) can generate significant storage cost; use sampling + tiered retention.
Graph store query cost grows with graph depth; optimise with indexed traversal patterns.

Optimisations

Use OpenLineage open-source stack (Marquez + PostgreSQL backend) for cost-sensitive deployments.
Apply sampling for routine inference lineage; full capture only for consequential predictions.
Implement lineage graph compaction: archive lineage for deprecated model versions to cold storage.
Cache frequent backward provenance queries for commonly audited decisions.

Indicative Cost Range

Scale	Monthly Cost	Basis
Small (1–5 models, <100K predictions/day)	$500–$3,000	Marquez OSS + Neo4j Community + light object store
Medium (5–20 models, 1M predictions/day)	$3,000–$12,000	Managed graph store + Atlan OSS + sampled inference lineage
Large (20+ models, 10M+ predictions/day)	$12,000–$50,000	Amazon Neptune + full enterprise stack + tiered lineage storage

12. Trade-Off Analysis

Option Comparison

Option	Pros	Cons	Recommended When
A: Full OpenLineage end-to-end (this pattern)	Standard; interoperable; covers ETL to inference; regulatory-grade	Setup complexity; emitter integration per pipeline tool	Regulated industry; multiple AI systems; regulatory audit requirements
B: Model registry lineage only (MLflow dataset tags)	Simple; low overhead; fast to implement	Misses ETL lineage; no inference-level traceability; not regulatory-grade	Experimental; no regulatory obligation; single model
C: Manual lineage documentation	Near-zero infrastructure cost	Inaccurate; outdated within weeks; fails regulatory scrutiny	Only viable for very small, stable AI systems with no regulatory obligation
D: Proprietary lineage tool (Collibra, Atlan)	Rich UI; enterprise support; cataloguing integrated	High licence cost; vendor lock-in; may not cover all pipeline tools	Large enterprise with existing licence; strong BI lineage needs

Architectural Tensions

Tension	Trade-Off	Resolution
Full inference lineage vs. storage cost	Full feature snapshots at 1M predictions/day is very expensive	Tiered capture: consequential = full, routine = sampled, all = model version + timestamp
Event delivery latency vs. pipeline performance	Synchronous lineage emission adds latency to pipelines	Async event emission; lineage not on the critical serving path
Lineage standardisation vs. pipeline tool flexibility	Standardising on OpenLineage requires emitter integration per tool	OpenLineage has integrations for major tools (Airflow, Spark, dbt, MLflow); coverage now >80% of common tools
Graph query power vs. operational simplicity	Graph databases powerful but operationally complex	PostgreSQL backend for Marquez acceptable for <10M lineage nodes; graph DB for larger deployments

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Emitter not integrated in new pipeline	High	Medium — lineage gap for new model	Lineage completeness check	Emitter integration checklist in pipeline onboarding; completeness alert
Lineage event schema mismatch	Medium	Medium — events rejected by collector; lineage gap	Collector validation error logs	Schema versioning in emitter; forward-compatible schema evolution
Graph store capacity exhausted	Low	High — lineage writes fail; lineage gap	Graph store disk/capacity alerts	Lineage compaction policy; archive old lineage to cold storage
Inference lineage not captured for appealed decision	Low	Critical — unable to respond to right-to-explanation request	Post-hoc check when appeal received	Always-full lineage for decisions above risk threshold; escalation flag
Lineage tampered after write	Very Low	Critical — regulatory fraud	Immutable write policy; hash verification	Append-only lineage store; cryptographic hash of events

Cascading Failure Scenarios

Schema change cascade without lineage: Data team changes source table schema → no impact analysis (lineage incomplete) → feature pipeline silently produces incorrect features → model degradation → consequential decisions affected → regulatory enquiry → unable to trace root cause.
Lineage collector outage during audit: Regulator requests provenance report → Lineage Query API unavailable → raw event store must be replayed → hours of investigation delay.

14. Regulatory Considerations

Regulation	Article/Clause	Requirement	Pattern Response
EU AI Act	Article 12	Record-keeping: logs for high-risk AI must be kept for minimum 10 years	Lineage graph + raw event store; immutable; retained per regulatory schedule
EU AI Act	Article 13	Transparency: users must receive information about training data characteristics	Backward provenance query enables training data summary for any prediction
GDPR	Article 22	Right to explanation for automated decisions	Backward provenance report provides explanation basis
GDPR	Article 17	Right to erasure	Forward impact query from subject data → identifies models to unlearn
APRA CPS 234	§32	Maintain integrity of information assets	Immutable lineage events; hash verification
Privacy Act (Australia)	APP 13	Correction of personal information	Lineage enables identifying all model versions trained on corrected data
ISO 42001	§8.6	Traceability of AI system inputs	Full source-to-prediction lineage per this pattern
NIST AI RMF	GOVERN-6	Accountability and transparency	Lineage provides accountability trail for AI decisions

15. Reference Implementations

AWS

Component	AWS Service
OpenLineage Collector	Marquez on ECS or Amazon Managed Service for Apache Airflow (lineage)
Lineage Graph Store	Amazon Neptune
ETL Lineage Emitter	AWS Glue with OpenLineage integration
Training Lineage	SageMaker ML Lineage Tracking (native) + OpenLineage emitter
Inference Lineage	SageMaker Model Monitor + custom Lambda emitter
Raw Event Store	S3 with Object Lock (WORM)

Azure

Component	Azure Service
OpenLineage Collector	OpenMetadata on AKS
Lineage Graph Store	Azure Cosmos DB (Gremlin API)
ETL Lineage	Azure Purview (native lineage) + OpenLineage bridge
Training Lineage	Azure ML (native) + OpenLineage emitter
Inference Lineage	Custom Azure Function emitter

GCP

Component	AWS Service
OpenLineage Collector	Marquez on Cloud Run
Lineage Graph Store	Cloud Spanner or Neo4j on GKE
ETL Lineage	Dataplex lineage (native) + Cloud Dataflow OpenLineage
Training Lineage	Vertex AI ML Metadata (native) + OpenLineage bridge
Inference Lineage	Custom Cloud Function emitter

On-Premises

Component	Technology
OpenLineage Collector	Marquez (OSS) on Kubernetes
Lineage Graph Store	Neo4j Community (small) or Enterprise (large)
ETL Lineage	Apache Airflow OpenLineage provider + Spark OpenLineage
Training Lineage	MLflow + custom OpenLineage emitter
Inference Lineage	Custom Python middleware
Raw Event Store	MinIO with object locking

Pattern	ID	Relationship	Notes
AI Data Mesh Integration	EAAPL-DAT001	Depends on	Lineage is core to data product governance in mesh
Data Quality for AI	EAAPL-DAT002	Complements	Quality scorecard IDs embedded in lineage events
AI Training Data Governance	EAAPL-DAT007	Complements	Training data approval records linked via lineage
Privacy by Design for AI Data	EAAPL-DAT005	Enables	Lineage enables right-to-erasure impact analysis
Model Versioning	EAAPL-MDL001	Depends on	Model version IDs are key lineage graph nodes
Model Rollback	EAAPL-MDL004	Enables	Lineage query identifies which predictions were served by rolled-back version
Human Approval Gateway	EAAPL-HIL001	Complements	Approval decisions are lineage events in consequential AI

17. Maturity Assessment

Overall Maturity: Proven — OpenLineage standard is mature and broadly adopted. Graph-based lineage storage is operationally proven. Inference-level lineage capture at scale remains an evolving practice.

Dimension	Score (1–5)	Notes
Architectural clarity	5	OpenLineage standard provides clear event schema
Tooling maturity	4	ETL lineage tools mature; inference lineage tooling maturing
Regulatory alignment	5	Strong EU AI Act and GDPR alignment
Operational complexity	3	Graph store operations require specialist skills
Cost efficiency	4	OSS stack cost-effective; inference lineage at scale requires optimisation
Security	4	Immutable event store; strong access controls defined

18. Revision History

Version	Date	Author	Changes
1.0	2023-11-01	EAAPL Working Group	Initial publication; OpenLineage-based architecture
1.1	2024-05-15	EAAPL Working Group	Added EU AI Act Article 12 alignment; inference lineage tiering
1.2	2025-03-01	EAAPL Working Group	Added right-to-erasure lineage pattern; updated reference implementations

Track this pattern for APRA/ASIC review

← Back to Library More Data Architecture →