EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryData ArchitectureEAAPL-DAT003
EAAPL-DAT003Proven
⇄ Compare

Data Lineage for AI

🗄️ Data ArchitectureEU AI ActISO/IEC 42001🏭 Field-tested in AU

[EAAPL-DAT003] Data Lineage for AI

Category: Data Architecture
Sub-category: Data Lineage / AI Traceability
Version: 1.2
Maturity: Proven
Tags: data-lineage, provenance, OpenLineage, explainability, impact-analysis, traceability, regulatory-audit
Regulatory Relevance: EU AI Act Articles 12 & 13, APRA CPS 234, GDPR Article 22, ISO 42001 §8.6, NIST AI RMF GOVERN-6


1. Executive Summary

Regulators, auditors, and risk committees increasingly demand that organisations explain not just what an AI model predicted, but why — tracing back through the model architecture, training data, and ultimately to raw source systems. Traditional data lineage tools capture ETL pipelines but stop at the data warehouse boundary, leaving the AI layer invisible.

This pattern defines an end-to-end AI data lineage architecture using the OpenLineage standard, capturing lineage from raw source systems through every transformation stage, training run, model version, and inference event. The lineage graph enables regulators to answer: "Which version of which data, processed how, produced this model, which made this prediction on this date?"

Beyond compliance, AI lineage delivers operational value: when a source system changes, impact analysis identifies all downstream models at risk within minutes instead of weeks. Organisations adopting this pattern have reduced regulatory investigation response time from weeks to hours and eliminated surprise model breakages from upstream schema changes.

Target audience: Chief Data Officers, Chief Compliance Officers, Enterprise Architects, ML Platform leads.


2. Problem Statement

Business Problem

When an AI model makes a consequential decision (loan rejection, insurance claim denial, clinical risk assessment), the organisation must be able to explain and defend that decision — including the data that trained the model. Without lineage, this explanation is impossible, creating regulatory and legal exposure.

Technical Problem

  • Data lineage tools capture ETL/SQL lineage but do not model AI-specific lineage events (feature engineering, training, inference).
  • Model registries store model artefacts but do not link models to the specific dataset versions used for training.
  • Inference logs record predictions but not the feature values used to produce each prediction, nor the model version.
  • Impact analysis of upstream data changes on downstream AI models is manual and error-prone.
  • No standard schema exists for capturing AI lineage events — leading to bespoke, non-interoperable solutions.

Symptoms

  • Regulatory enquiry response ("explain this credit decision") takes weeks rather than hours.
  • Schema change in an operational database silently breaks a downstream AI model.
  • Model retraining produces different results but the cause cannot be traced to a specific data change.
  • Multiple incompatible lineage stores exist: one for ETL, one for ML pipelines, one for BI — no unified view.
  • Audit finds model was trained on data that included consent-withdrawn records; no mechanism to detect this.

Cost of Inaction

Dimension Impact
Regulatory EU AI Act Article 12 violation; APRA regulatory action; GDPR Article 22 right-to-explanation breach
Operational Weeks to diagnose model quality issues from upstream data changes
Legal Inability to defend AI decisions in tribunal or litigation
Trust Stakeholders (regulators, customers) cannot verify AI system integrity

3. Context

When to Apply

  • Any production AI system in a regulated industry.
  • AI systems where decisions are consequential (credit, insurance, clinical, employment).
  • Organisations with multiple AI models consuming data from shared source systems (high impact analysis value).
  • Systems subject to right-to-explanation requirements (GDPR, EU AI Act, Privacy Act).
  • Organisations where data contracts / data mesh are in use (lineage complements contract governance).

When NOT to Apply

  • Pure research/PoC AI with no production decisions.
  • AI systems consuming only fully external, black-box data APIs with no lineage available.
  • Very simple AI systems (single feature, deterministic rule) where lineage is obvious from inspection.

Prerequisites

Prerequisite Minimum Viable Preferred
Data pipeline observability Ad hoc logging Structured pipeline execution logs
ML pipeline tooling Manual training scripts MLflow / Kubeflow with run tracking
Lineage storage Flat file (JSON) Graph database (Neptune, Neo4j) or OpenLineage backend
Inference logging Basic prediction logs Structured inference log with feature values + model version

Industry Applicability

Industry Applicability Driver
Financial Services Critical APRA CPS 234; model risk management; lending decisions
Healthcare Critical EU AI Act high-risk; clinical decision support; drug discovery
Insurance High Actuarial model explainability; claims decisions
Government High Public sector AI accountability; FOI obligations
Retail Medium Personalisation; recommendation system transparency
Telecommunications Medium Churn; fraud model explainability

4. Architecture Overview

Design Philosophy

The foundational insight of this pattern is that AI lineage is a graph, not a table. The lineage of a single prediction involves a directed acyclic graph connecting: raw source records → data transformations → joined datasets → feature engineering → training dataset version → model training run → model version → inference request → prediction output. Each node in this graph is an immutable versioned artefact; each edge is a transformation event with metadata.

OpenLineage as the Standard. Rather than inventing a proprietary lineage schema, this pattern adopts the OpenLineage standard (openlineage.io), which defines a common event schema for lineage capture across data pipelines, ML platforms, and inference services. OpenLineage events are emitted by each pipeline stage, collected by a lineage backend (Marquez or Atlan), and stored in a queryable lineage graph.

Four Lineage Event Classes. AI lineage comprises four distinct event classes, each requiring specific schema extensions to the base OpenLineage spec:

  1. Dataset lineage events: Emitted by ETL/ELT pipelines; capture source → transformation → output dataset with row counts, schema versions, and quality check results.
  2. Feature engineering events: Emitted by feature pipelines; capture source datasets → feature computation logic → feature set version with temporal validity metadata.
  3. Training events: Emitted by training pipelines; capture feature set versions → training run parameters → model artefact version with quality scorecard ID.
  4. Inference events: Emitted by inference services; capture model version + feature values → prediction output with confidence score. Note: inference events are high-volume; sampling strategies are required for cost management while preserving full lineage for flagged or high-stakes predictions.

Lineage Query Patterns. Three primary query patterns drive the lineage architecture's design:

  • Forward impact query: "Which models and predictions are downstream of dataset X?" — used for impact analysis before schema changes.
  • Backward provenance query: "What data produced this prediction?" — used for regulatory explanation.
  • Cross-version diff query: "What changed in the data between model version M1 and M2?" — used for model quality investigation.

These query patterns require a graph-capable storage backend (Neo4j, Amazon Neptune, or a columnar store with graph query extensions).

Selective Inference Lineage. Full capture of feature values at inference time for every prediction is prohibitively expensive at scale (millions of predictions/day). The pattern uses a tiered strategy: full lineage for consequential predictions (flagged by risk score, decision type, or regulatory classification); sampled lineage (1–5%) for routine predictions; and always-full lineage for predictions that are later reviewed, appealed, or investigated.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Pipeline["Data-to-Model Pipeline"] A[Source Systems] B[ETL and Feature Engineering] C[Model Training] D[Inference Service] end subgraph Lineage["Lineage Backend"] E[OpenLineage Collector] F[(Lineage Graph Store)] G[Lineage Query API] end A --> B B --> C C --> D B -->|lineage events| E C -->|lineage events| E D -->|lineage events| E E --> F F --> G style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
OpenLineage Emitter (ETL) Library / Agent Emits dataset lineage events from ETL jobs dbt OpenLineage plugin, Airflow OpenLineage provider, Spark OpenLineage integration Critical
OpenLineage Emitter (Feature) Library Emits feature engineering lineage events Custom Python OpenLineage client, MLflow OpenLineage integration Critical
OpenLineage Emitter (Training) Library Emits training run lineage events linking feature versions to model version MLflow OpenLineage plugin, Kubeflow OpenLineage integration Critical
Selective Inference Lineage Capture Middleware Captures full feature values + model version for consequential and sampled predictions Custom inference middleware, Arize AI, WhyLabs High
OpenLineage Emitter (Inference) Library Emits inference lineage events Custom Python OpenLineage client High
OpenLineage Collector API Service Receives lineage events from all emitters; validates schema; routes to store Marquez (OSS), Atlan, OpenMetadata Critical
Lineage Graph Store Storage Stores lineage graph as queryable DAG Neo4j, Amazon Neptune, Memgraph, PostgreSQL + pg_graph Critical
Lineage Query API API Service Exposes lineage graph for forward/backward/diff queries Marquez REST API, custom GraphQL API, Neo4j Cypher endpoint High
Regulatory Audit Tool Application Generates human-readable explanation reports from lineage graph Custom report generator, Collibra compliance reports High
Impact Analysis Service Application Executes forward impact queries before schema changes; produces risk report Custom Python + lineage API, DataHub impact analysis High

7. Data Flow

Primary Flow

Step Actor Action Output
1 ETL pipeline Transforms source data; OpenLineage emitter fires START + COMPLETE events Dataset lineage events in OpenLineage JSON format
2 OpenLineage Collector Receives events; validates against OpenLineage schema; writes to graph store Lineage nodes (datasets) + edges (transformations) in graph
3 Feature pipeline Computes features; emits feature lineage events with input dataset versions Feature lineage nodes + edges in graph
4 Training pipeline Trains model; emits training event with feature set version IDs + hyperparameters Training lineage node linking feature set version → model version
5 Model Registry Stores model artefact; receives lineage ID from training emitter Model version enriched with lineage pointer
6 Inference service Serves prediction; selective lineage capture for consequential + sampled predictions Inference lineage event: model version + feature snapshot + prediction
7 Lineage Collector Ingests inference events; appends to lineage graph Full lineage graph from source → prediction
8 Regulatory audit Executes backward provenance query for specific prediction Human-readable provenance report: data sources → transformations → model → prediction
9 Impact analysis Before schema change: executes forward impact query Risk report: list of downstream models + predictions at risk
10 Compliance dashboard Continuously queries lineage completeness Lineage coverage metric per model

Error Flow

Error Condition Trigger Response Recovery
Lineage emitter failure (event not sent) Network error; emitter crash Prediction still served (lineage not on critical path); alert raised; lineage gap recorded Emitter retries with exponential backoff; gap filled from pipeline logs if available
Lineage collector unavailable Collector service down Events queued in emitter buffer (local file or queue); delivered when collector recovers Collector HA deployment; queue-based event delivery
Lineage graph store corruption Hardware failure Lineage queries unavailable; no impact on AI serving Restore from backup; replay buffered events
Incomplete inference lineage (sampling miss) Prediction not in sample; not flagged consequential Prediction served without full lineage; noted in lineage completeness metric Accept for routine predictions; escalate if prediction later flagged

8. Security Considerations

Authentication & Authorisation

  • OpenLineage Collector API requires authenticated emitters (API keys per pipeline stage); keys rotated quarterly.
  • Lineage Query API requires role-based access: data engineers (full read), auditors (read-only subset), business analysts (anonymised lineage).
  • Feature value snapshots in inference lineage classified as Confidential; access restricted to authorised investigators.

Secrets Management

  • Collector API keys stored in secrets manager; not in pipeline code.
  • Lineage graph database credentials rotated every 90 days.

Data Classification

  • Lineage metadata (dataset names, row counts, schema versions) classified as Internal.
  • Feature value snapshots in inference lineage classified as Confidential (may contain PII); stored with encryption and strict access control.
  • Regulatory reports generated from lineage may contain PII context; classified as Confidential.

Encryption

  • Lineage graph store encrypted at rest (AES-256); in transit TLS 1.3.
  • Feature value snapshots encrypted at rest with separate encryption keys; key access logged.

Auditability

  • All lineage query executions logged (who queried what lineage, when, why).
  • Lineage events immutable once written; no update/delete path.
  • Lineage completeness gaps (missing events) logged and alerted.

OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM01: Prompt Injection Adversarial input could attempt to manipulate lineage metadata Lineage events are system-generated, not user-input; validate emitter identity
LLM06: Sensitive Information Disclosure Feature value snapshots contain PII Encrypted storage; strict access control; anonymisation for non-investigative queries
LLM02: Insecure Output Handling Lineage reports consumed without validation Report generation uses read-only lineage API; no dynamic code execution
LLM09: Overreliance Auditors trust lineage completeness claims without verification Lineage completeness metric surfaced in compliance dashboard; gaps explicitly flagged

9. Governance Considerations

Responsible AI

  • Complete lineage enables bias attribution: if a model exhibits demographic bias, lineage identifies whether the bias originates in source data, feature engineering, or labelling.
  • Lineage enables right-to-erasure impact analysis: when a data subject requests erasure, lineage identifies all models trained on data linked to that subject (machine unlearning trigger).

Model Risk Management

  • Model risk committees require provenance validation for all production models; Lineage Query API provides this programmatically.
  • Model version deprecation requires impact analysis (which predictions were served by this version?); lineage graph enables this query.

Human Approval Checkpoints

  • Before upstream schema change: Impact Analysis Service report reviewed by data owner + affected ML leads.
  • Regulatory investigation: Lineage Query API output reviewed by compliance officer before submission to regulator.

Governance Artefacts

Artefact Owner Cadence Purpose
Lineage Completeness Report ML Platform Weekly Coverage % of production models with full source-to-prediction lineage
Provenance Report (per prediction) Compliance Team (on demand) Per regulatory enquiry Full lineage trace for specific prediction; human-readable
Impact Analysis Report Data Owner (on change) Before schema changes Forward impact: which models at risk from proposed change
Consent Withdrawal Impact Report Privacy Officer (on demand) Per data subject request Identifies models trained on data linked to subject

10. Operational Considerations

Monitoring

Metric Alert Threshold Tooling
Lineage event delivery success rate <99.5% over 1 hour Collector metrics + Grafana
Lineage completeness per model <95% Custom completeness query + alert
Lineage graph store query latency (p99) >2 seconds Graph store metrics
Inference lineage capture rate (consequential predictions) <100% Inference service metrics
Collector queue depth (if event buffering) >10,000 events Queue metrics

SLOs

SLO Target Measurement
Lineage event delivery (pipeline to graph) <5 minutes end-to-end Event timestamp vs. graph ingestion timestamp
Backward provenance query response <10 seconds Lineage Query API response time
Lineage graph store availability 99.9% Health check
Lineage completeness for production models ≥95% Weekly completeness query

Logging

  • All OpenLineage events logged in raw form (JSON) alongside graph store; serves as event replay source.
  • Lineage query audit log retained 7 years.

Incident Management

  • Lineage gap (missing events for production model) → P2 incident; ML Platform investigates emitter health.
  • Lineage graph store unavailable → P1 if regulatory investigation in progress; P2 otherwise.

Disaster Recovery

Component RTO RPO Strategy
Lineage Graph Store 4 hours 1 hour Database backup + standby replica; event replay from raw event store
OpenLineage Collector 1 hour 0 Multi-AZ stateless deployment; events buffered in pipeline until collector recovers
Raw Event Store 8 hours 24 hours Cross-region object storage replication

11. Cost Considerations

Cost Drivers

Cost Driver Typical Range Notes
Lineage graph store $300–$5,000/month Neo4j AuraDB / Amazon Neptune; scales with graph size
OpenLineage Collector (Marquez) $0–$2,000/month Marquez OSS free; hosted Atlan has licence cost
Inference lineage storage (feature snapshots) $100–$2,000/month Object store; scales with prediction volume × sampling rate
Compute for lineage queries $50–$500/month Graph query compute; low for typical audit query patterns
Engineering 0.25–0.5 FTE Emitter maintenance; lineage completeness monitoring

Scaling Risks

  • Inference lineage feature snapshots at high prediction volume (>1M/day) can generate significant storage cost; use sampling + tiered retention.
  • Graph store query cost grows with graph depth; optimise with indexed traversal patterns.

Optimisations

  • Use OpenLineage open-source stack (Marquez + PostgreSQL backend) for cost-sensitive deployments.
  • Apply sampling for routine inference lineage; full capture only for consequential predictions.
  • Implement lineage graph compaction: archive lineage for deprecated model versions to cold storage.
  • Cache frequent backward provenance queries for commonly audited decisions.

Indicative Cost Range

Scale Monthly Cost Basis
Small (1–5 models, <100K predictions/day) $500–$3,000 Marquez OSS + Neo4j Community + light object store
Medium (5–20 models, 1M predictions/day) $3,000–$12,000 Managed graph store + Atlan OSS + sampled inference lineage
Large (20+ models, 10M+ predictions/day) $12,000–$50,000 Amazon Neptune + full enterprise stack + tiered lineage storage

12. Trade-Off Analysis

Option Comparison

Option Pros Cons Recommended When
A: Full OpenLineage end-to-end (this pattern) Standard; interoperable; covers ETL to inference; regulatory-grade Setup complexity; emitter integration per pipeline tool Regulated industry; multiple AI systems; regulatory audit requirements
B: Model registry lineage only (MLflow dataset tags) Simple; low overhead; fast to implement Misses ETL lineage; no inference-level traceability; not regulatory-grade Experimental; no regulatory obligation; single model
C: Manual lineage documentation Near-zero infrastructure cost Inaccurate; outdated within weeks; fails regulatory scrutiny Only viable for very small, stable AI systems with no regulatory obligation
D: Proprietary lineage tool (Collibra, Atlan) Rich UI; enterprise support; cataloguing integrated High licence cost; vendor lock-in; may not cover all pipeline tools Large enterprise with existing licence; strong BI lineage needs

Architectural Tensions

Tension Trade-Off Resolution
Full inference lineage vs. storage cost Full feature snapshots at 1M predictions/day is very expensive Tiered capture: consequential = full, routine = sampled, all = model version + timestamp
Event delivery latency vs. pipeline performance Synchronous lineage emission adds latency to pipelines Async event emission; lineage not on the critical serving path
Lineage standardisation vs. pipeline tool flexibility Standardising on OpenLineage requires emitter integration per tool OpenLineage has integrations for major tools (Airflow, Spark, dbt, MLflow); coverage now >80% of common tools
Graph query power vs. operational simplicity Graph databases powerful but operationally complex PostgreSQL backend for Marquez acceptable for <10M lineage nodes; graph DB for larger deployments

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Emitter not integrated in new pipeline High Medium — lineage gap for new model Lineage completeness check Emitter integration checklist in pipeline onboarding; completeness alert
Lineage event schema mismatch Medium Medium — events rejected by collector; lineage gap Collector validation error logs Schema versioning in emitter; forward-compatible schema evolution
Graph store capacity exhausted Low High — lineage writes fail; lineage gap Graph store disk/capacity alerts Lineage compaction policy; archive old lineage to cold storage
Inference lineage not captured for appealed decision Low Critical — unable to respond to right-to-explanation request Post-hoc check when appeal received Always-full lineage for decisions above risk threshold; escalation flag
Lineage tampered after write Very Low Critical — regulatory fraud Immutable write policy; hash verification Append-only lineage store; cryptographic hash of events

Cascading Failure Scenarios

  • Schema change cascade without lineage: Data team changes source table schema → no impact analysis (lineage incomplete) → feature pipeline silently produces incorrect features → model degradation → consequential decisions affected → regulatory enquiry → unable to trace root cause.
  • Lineage collector outage during audit: Regulator requests provenance report → Lineage Query API unavailable → raw event store must be replayed → hours of investigation delay.

14. Regulatory Considerations

Regulation Article/Clause Requirement Pattern Response
EU AI Act Article 12 Record-keeping: logs for high-risk AI must be kept for minimum 10 years Lineage graph + raw event store; immutable; retained per regulatory schedule
EU AI Act Article 13 Transparency: users must receive information about training data characteristics Backward provenance query enables training data summary for any prediction
GDPR Article 22 Right to explanation for automated decisions Backward provenance report provides explanation basis
GDPR Article 17 Right to erasure Forward impact query from subject data → identifies models to unlearn
APRA CPS 234 §32 Maintain integrity of information assets Immutable lineage events; hash verification
Privacy Act (Australia) APP 13 Correction of personal information Lineage enables identifying all model versions trained on corrected data
ISO 42001 §8.6 Traceability of AI system inputs Full source-to-prediction lineage per this pattern
NIST AI RMF GOVERN-6 Accountability and transparency Lineage provides accountability trail for AI decisions

15. Reference Implementations

AWS

Component AWS Service
OpenLineage Collector Marquez on ECS or Amazon Managed Service for Apache Airflow (lineage)
Lineage Graph Store Amazon Neptune
ETL Lineage Emitter AWS Glue with OpenLineage integration
Training Lineage SageMaker ML Lineage Tracking (native) + OpenLineage emitter
Inference Lineage SageMaker Model Monitor + custom Lambda emitter
Raw Event Store S3 with Object Lock (WORM)

Azure

Component Azure Service
OpenLineage Collector OpenMetadata on AKS
Lineage Graph Store Azure Cosmos DB (Gremlin API)
ETL Lineage Azure Purview (native lineage) + OpenLineage bridge
Training Lineage Azure ML (native) + OpenLineage emitter
Inference Lineage Custom Azure Function emitter

GCP

Component AWS Service
OpenLineage Collector Marquez on Cloud Run
Lineage Graph Store Cloud Spanner or Neo4j on GKE
ETL Lineage Dataplex lineage (native) + Cloud Dataflow OpenLineage
Training Lineage Vertex AI ML Metadata (native) + OpenLineage bridge
Inference Lineage Custom Cloud Function emitter

On-Premises

Component Technology
OpenLineage Collector Marquez (OSS) on Kubernetes
Lineage Graph Store Neo4j Community (small) or Enterprise (large)
ETL Lineage Apache Airflow OpenLineage provider + Spark OpenLineage
Training Lineage MLflow + custom OpenLineage emitter
Inference Lineage Custom Python middleware
Raw Event Store MinIO with object locking

Pattern ID Relationship Notes
AI Data Mesh Integration EAAPL-DAT001 Depends on Lineage is core to data product governance in mesh
Data Quality for AI EAAPL-DAT002 Complements Quality scorecard IDs embedded in lineage events
AI Training Data Governance EAAPL-DAT007 Complements Training data approval records linked via lineage
Privacy by Design for AI Data EAAPL-DAT005 Enables Lineage enables right-to-erasure impact analysis
Model Versioning EAAPL-MDL001 Depends on Model version IDs are key lineage graph nodes
Model Rollback EAAPL-MDL004 Enables Lineage query identifies which predictions were served by rolled-back version
Human Approval Gateway EAAPL-HIL001 Complements Approval decisions are lineage events in consequential AI

17. Maturity Assessment

Overall Maturity: Proven — OpenLineage standard is mature and broadly adopted. Graph-based lineage storage is operationally proven. Inference-level lineage capture at scale remains an evolving practice.

Dimension Score (1–5) Notes
Architectural clarity 5 OpenLineage standard provides clear event schema
Tooling maturity 4 ETL lineage tools mature; inference lineage tooling maturing
Regulatory alignment 5 Strong EU AI Act and GDPR alignment
Operational complexity 3 Graph store operations require specialist skills
Cost efficiency 4 OSS stack cost-effective; inference lineage at scale requires optimisation
Security 4 Immutable event store; strong access controls defined

18. Revision History

Version Date Author Changes
1.0 2023-11-01 EAAPL Working Group Initial publication; OpenLineage-based architecture
1.1 2024-05-15 EAAPL Working Group Added EU AI Act Article 12 alignment; inference lineage tiering
1.2 2025-03-01 EAAPL Working Group Added right-to-erasure lineage pattern; updated reference implementations
← Back to LibraryMore Data Architecture