[EAAPL-DAT003] Data Lineage for AI
Category: Data Architecture
Sub-category: Data Lineage / AI Traceability
Version: 1.2
Maturity: Proven
Tags: data-lineage, provenance, OpenLineage, explainability, impact-analysis, traceability, regulatory-audit
Regulatory Relevance: EU AI Act Articles 12 & 13, APRA CPS 234, GDPR Article 22, ISO 42001 §8.6, NIST AI RMF GOVERN-6
1. Executive Summary
Regulators, auditors, and risk committees increasingly demand that organisations explain not just what an AI model predicted, but why — tracing back through the model architecture, training data, and ultimately to raw source systems. Traditional data lineage tools capture ETL pipelines but stop at the data warehouse boundary, leaving the AI layer invisible.
This pattern defines an end-to-end AI data lineage architecture using the OpenLineage standard, capturing lineage from raw source systems through every transformation stage, training run, model version, and inference event. The lineage graph enables regulators to answer: "Which version of which data, processed how, produced this model, which made this prediction on this date?"
Beyond compliance, AI lineage delivers operational value: when a source system changes, impact analysis identifies all downstream models at risk within minutes instead of weeks. Organisations adopting this pattern have reduced regulatory investigation response time from weeks to hours and eliminated surprise model breakages from upstream schema changes.
Target audience: Chief Data Officers, Chief Compliance Officers, Enterprise Architects, ML Platform leads.
2. Problem Statement
Business Problem
When an AI model makes a consequential decision (loan rejection, insurance claim denial, clinical risk assessment), the organisation must be able to explain and defend that decision — including the data that trained the model. Without lineage, this explanation is impossible, creating regulatory and legal exposure.
Technical Problem
- Data lineage tools capture ETL/SQL lineage but do not model AI-specific lineage events (feature engineering, training, inference).
- Model registries store model artefacts but do not link models to the specific dataset versions used for training.
- Inference logs record predictions but not the feature values used to produce each prediction, nor the model version.
- Impact analysis of upstream data changes on downstream AI models is manual and error-prone.
- No standard schema exists for capturing AI lineage events — leading to bespoke, non-interoperable solutions.
Symptoms
- Regulatory enquiry response ("explain this credit decision") takes weeks rather than hours.
- Schema change in an operational database silently breaks a downstream AI model.
- Model retraining produces different results but the cause cannot be traced to a specific data change.
- Multiple incompatible lineage stores exist: one for ETL, one for ML pipelines, one for BI — no unified view.
- Audit finds model was trained on data that included consent-withdrawn records; no mechanism to detect this.
Cost of Inaction
| Dimension |
Impact |
| Regulatory |
EU AI Act Article 12 violation; APRA regulatory action; GDPR Article 22 right-to-explanation breach |
| Operational |
Weeks to diagnose model quality issues from upstream data changes |
| Legal |
Inability to defend AI decisions in tribunal or litigation |
| Trust |
Stakeholders (regulators, customers) cannot verify AI system integrity |
3. Context
When to Apply
- Any production AI system in a regulated industry.
- AI systems where decisions are consequential (credit, insurance, clinical, employment).
- Organisations with multiple AI models consuming data from shared source systems (high impact analysis value).
- Systems subject to right-to-explanation requirements (GDPR, EU AI Act, Privacy Act).
- Organisations where data contracts / data mesh are in use (lineage complements contract governance).
When NOT to Apply
- Pure research/PoC AI with no production decisions.
- AI systems consuming only fully external, black-box data APIs with no lineage available.
- Very simple AI systems (single feature, deterministic rule) where lineage is obvious from inspection.
Prerequisites
| Prerequisite |
Minimum Viable |
Preferred |
| Data pipeline observability |
Ad hoc logging |
Structured pipeline execution logs |
| ML pipeline tooling |
Manual training scripts |
MLflow / Kubeflow with run tracking |
| Lineage storage |
Flat file (JSON) |
Graph database (Neptune, Neo4j) or OpenLineage backend |
| Inference logging |
Basic prediction logs |
Structured inference log with feature values + model version |
Industry Applicability
| Industry |
Applicability |
Driver |
| Financial Services |
Critical |
APRA CPS 234; model risk management; lending decisions |
| Healthcare |
Critical |
EU AI Act high-risk; clinical decision support; drug discovery |
| Insurance |
High |
Actuarial model explainability; claims decisions |
| Government |
High |
Public sector AI accountability; FOI obligations |
| Retail |
Medium |
Personalisation; recommendation system transparency |
| Telecommunications |
Medium |
Churn; fraud model explainability |
4. Architecture Overview
Design Philosophy
The foundational insight of this pattern is that AI lineage is a graph, not a table. The lineage of a single prediction involves a directed acyclic graph connecting: raw source records → data transformations → joined datasets → feature engineering → training dataset version → model training run → model version → inference request → prediction output. Each node in this graph is an immutable versioned artefact; each edge is a transformation event with metadata.
OpenLineage as the Standard. Rather than inventing a proprietary lineage schema, this pattern adopts the OpenLineage standard (openlineage.io), which defines a common event schema for lineage capture across data pipelines, ML platforms, and inference services. OpenLineage events are emitted by each pipeline stage, collected by a lineage backend (Marquez or Atlan), and stored in a queryable lineage graph.
Four Lineage Event Classes. AI lineage comprises four distinct event classes, each requiring specific schema extensions to the base OpenLineage spec:
- Dataset lineage events: Emitted by ETL/ELT pipelines; capture source → transformation → output dataset with row counts, schema versions, and quality check results.
- Feature engineering events: Emitted by feature pipelines; capture source datasets → feature computation logic → feature set version with temporal validity metadata.
- Training events: Emitted by training pipelines; capture feature set versions → training run parameters → model artefact version with quality scorecard ID.
- Inference events: Emitted by inference services; capture model version + feature values → prediction output with confidence score. Note: inference events are high-volume; sampling strategies are required for cost management while preserving full lineage for flagged or high-stakes predictions.
Lineage Query Patterns. Three primary query patterns drive the lineage architecture's design:
- Forward impact query: "Which models and predictions are downstream of dataset X?" — used for impact analysis before schema changes.
- Backward provenance query: "What data produced this prediction?" — used for regulatory explanation.
- Cross-version diff query: "What changed in the data between model version M1 and M2?" — used for model quality investigation.
These query patterns require a graph-capable storage backend (Neo4j, Amazon Neptune, or a columnar store with graph query extensions).
Selective Inference Lineage. Full capture of feature values at inference time for every prediction is prohibitively expensive at scale (millions of predictions/day). The pattern uses a tiered strategy: full lineage for consequential predictions (flagged by risk score, decision type, or regulatory classification); sampled lineage (1–5%) for routine predictions; and always-full lineage for predictions that are later reviewed, appealed, or investigated.
5. Architecture Diagram
flowchart TD
subgraph Pipeline["Data-to-Model Pipeline"]
A[Source Systems]
B[ETL and Feature Engineering]
C[Model Training]
D[Inference Service]
end
subgraph Lineage["Lineage Backend"]
E[OpenLineage Collector]
F[(Lineage Graph Store)]
G[Lineage Query API]
end
A --> B
B --> C
C --> D
B -->|lineage events| E
C -->|lineage events| E
D -->|lineage events| E
E --> F
F --> G
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#fef9c3,stroke:#eab308
style G fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| OpenLineage Emitter (ETL) |
Library / Agent |
Emits dataset lineage events from ETL jobs |
dbt OpenLineage plugin, Airflow OpenLineage provider, Spark OpenLineage integration |
Critical |
| OpenLineage Emitter (Feature) |
Library |
Emits feature engineering lineage events |
Custom Python OpenLineage client, MLflow OpenLineage integration |
Critical |
| OpenLineage Emitter (Training) |
Library |
Emits training run lineage events linking feature versions to model version |
MLflow OpenLineage plugin, Kubeflow OpenLineage integration |
Critical |
| Selective Inference Lineage Capture |
Middleware |
Captures full feature values + model version for consequential and sampled predictions |
Custom inference middleware, Arize AI, WhyLabs |
High |
| OpenLineage Emitter (Inference) |
Library |
Emits inference lineage events |
Custom Python OpenLineage client |
High |
| OpenLineage Collector |
API Service |
Receives lineage events from all emitters; validates schema; routes to store |
Marquez (OSS), Atlan, OpenMetadata |
Critical |
| Lineage Graph Store |
Storage |
Stores lineage graph as queryable DAG |
Neo4j, Amazon Neptune, Memgraph, PostgreSQL + pg_graph |
Critical |
| Lineage Query API |
API Service |
Exposes lineage graph for forward/backward/diff queries |
Marquez REST API, custom GraphQL API, Neo4j Cypher endpoint |
High |
| Regulatory Audit Tool |
Application |
Generates human-readable explanation reports from lineage graph |
Custom report generator, Collibra compliance reports |
High |
| Impact Analysis Service |
Application |
Executes forward impact queries before schema changes; produces risk report |
Custom Python + lineage API, DataHub impact analysis |
High |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
ETL pipeline |
Transforms source data; OpenLineage emitter fires START + COMPLETE events |
Dataset lineage events in OpenLineage JSON format |
| 2 |
OpenLineage Collector |
Receives events; validates against OpenLineage schema; writes to graph store |
Lineage nodes (datasets) + edges (transformations) in graph |
| 3 |
Feature pipeline |
Computes features; emits feature lineage events with input dataset versions |
Feature lineage nodes + edges in graph |
| 4 |
Training pipeline |
Trains model; emits training event with feature set version IDs + hyperparameters |
Training lineage node linking feature set version → model version |
| 5 |
Model Registry |
Stores model artefact; receives lineage ID from training emitter |
Model version enriched with lineage pointer |
| 6 |
Inference service |
Serves prediction; selective lineage capture for consequential + sampled predictions |
Inference lineage event: model version + feature snapshot + prediction |
| 7 |
Lineage Collector |
Ingests inference events; appends to lineage graph |
Full lineage graph from source → prediction |
| 8 |
Regulatory audit |
Executes backward provenance query for specific prediction |
Human-readable provenance report: data sources → transformations → model → prediction |
| 9 |
Impact analysis |
Before schema change: executes forward impact query |
Risk report: list of downstream models + predictions at risk |
| 10 |
Compliance dashboard |
Continuously queries lineage completeness |
Lineage coverage metric per model |
Error Flow
| Error Condition |
Trigger |
Response |
Recovery |
| Lineage emitter failure (event not sent) |
Network error; emitter crash |
Prediction still served (lineage not on critical path); alert raised; lineage gap recorded |
Emitter retries with exponential backoff; gap filled from pipeline logs if available |
| Lineage collector unavailable |
Collector service down |
Events queued in emitter buffer (local file or queue); delivered when collector recovers |
Collector HA deployment; queue-based event delivery |
| Lineage graph store corruption |
Hardware failure |
Lineage queries unavailable; no impact on AI serving |
Restore from backup; replay buffered events |
| Incomplete inference lineage (sampling miss) |
Prediction not in sample; not flagged consequential |
Prediction served without full lineage; noted in lineage completeness metric |
Accept for routine predictions; escalate if prediction later flagged |
8. Security Considerations
Authentication & Authorisation
- OpenLineage Collector API requires authenticated emitters (API keys per pipeline stage); keys rotated quarterly.
- Lineage Query API requires role-based access: data engineers (full read), auditors (read-only subset), business analysts (anonymised lineage).
- Feature value snapshots in inference lineage classified as Confidential; access restricted to authorised investigators.
Secrets Management
- Collector API keys stored in secrets manager; not in pipeline code.
- Lineage graph database credentials rotated every 90 days.
Data Classification
- Lineage metadata (dataset names, row counts, schema versions) classified as Internal.
- Feature value snapshots in inference lineage classified as Confidential (may contain PII); stored with encryption and strict access control.
- Regulatory reports generated from lineage may contain PII context; classified as Confidential.
Encryption
- Lineage graph store encrypted at rest (AES-256); in transit TLS 1.3.
- Feature value snapshots encrypted at rest with separate encryption keys; key access logged.
Auditability
- All lineage query executions logged (who queried what lineage, when, why).
- Lineage events immutable once written; no update/delete path.
- Lineage completeness gaps (missing events) logged and alerted.
OWASP LLM Top 10 Mapping
| OWASP LLM Risk |
Relevance |
Mitigation |
| LLM01: Prompt Injection |
Adversarial input could attempt to manipulate lineage metadata |
Lineage events are system-generated, not user-input; validate emitter identity |
| LLM06: Sensitive Information Disclosure |
Feature value snapshots contain PII |
Encrypted storage; strict access control; anonymisation for non-investigative queries |
| LLM02: Insecure Output Handling |
Lineage reports consumed without validation |
Report generation uses read-only lineage API; no dynamic code execution |
| LLM09: Overreliance |
Auditors trust lineage completeness claims without verification |
Lineage completeness metric surfaced in compliance dashboard; gaps explicitly flagged |
9. Governance Considerations
Responsible AI
- Complete lineage enables bias attribution: if a model exhibits demographic bias, lineage identifies whether the bias originates in source data, feature engineering, or labelling.
- Lineage enables right-to-erasure impact analysis: when a data subject requests erasure, lineage identifies all models trained on data linked to that subject (machine unlearning trigger).
Model Risk Management
- Model risk committees require provenance validation for all production models; Lineage Query API provides this programmatically.
- Model version deprecation requires impact analysis (which predictions were served by this version?); lineage graph enables this query.
Human Approval Checkpoints
- Before upstream schema change: Impact Analysis Service report reviewed by data owner + affected ML leads.
- Regulatory investigation: Lineage Query API output reviewed by compliance officer before submission to regulator.
Governance Artefacts
| Artefact |
Owner |
Cadence |
Purpose |
| Lineage Completeness Report |
ML Platform |
Weekly |
Coverage % of production models with full source-to-prediction lineage |
| Provenance Report (per prediction) |
Compliance Team (on demand) |
Per regulatory enquiry |
Full lineage trace for specific prediction; human-readable |
| Impact Analysis Report |
Data Owner (on change) |
Before schema changes |
Forward impact: which models at risk from proposed change |
| Consent Withdrawal Impact Report |
Privacy Officer (on demand) |
Per data subject request |
Identifies models trained on data linked to subject |
10. Operational Considerations
Monitoring
| Metric |
Alert Threshold |
Tooling |
| Lineage event delivery success rate |
<99.5% over 1 hour |
Collector metrics + Grafana |
| Lineage completeness per model |
<95% |
Custom completeness query + alert |
| Lineage graph store query latency (p99) |
>2 seconds |
Graph store metrics |
| Inference lineage capture rate (consequential predictions) |
<100% |
Inference service metrics |
| Collector queue depth (if event buffering) |
>10,000 events |
Queue metrics |
SLOs
| SLO |
Target |
Measurement |
| Lineage event delivery (pipeline to graph) |
<5 minutes end-to-end |
Event timestamp vs. graph ingestion timestamp |
| Backward provenance query response |
<10 seconds |
Lineage Query API response time |
| Lineage graph store availability |
99.9% |
Health check |
| Lineage completeness for production models |
≥95% |
Weekly completeness query |
Logging
- All OpenLineage events logged in raw form (JSON) alongside graph store; serves as event replay source.
- Lineage query audit log retained 7 years.
Incident Management
- Lineage gap (missing events for production model) → P2 incident; ML Platform investigates emitter health.
- Lineage graph store unavailable → P1 if regulatory investigation in progress; P2 otherwise.
Disaster Recovery
| Component |
RTO |
RPO |
Strategy |
| Lineage Graph Store |
4 hours |
1 hour |
Database backup + standby replica; event replay from raw event store |
| OpenLineage Collector |
1 hour |
0 |
Multi-AZ stateless deployment; events buffered in pipeline until collector recovers |
| Raw Event Store |
8 hours |
24 hours |
Cross-region object storage replication |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Typical Range |
Notes |
| Lineage graph store |
$300–$5,000/month |
Neo4j AuraDB / Amazon Neptune; scales with graph size |
| OpenLineage Collector (Marquez) |
$0–$2,000/month |
Marquez OSS free; hosted Atlan has licence cost |
| Inference lineage storage (feature snapshots) |
$100–$2,000/month |
Object store; scales with prediction volume × sampling rate |
| Compute for lineage queries |
$50–$500/month |
Graph query compute; low for typical audit query patterns |
| Engineering |
0.25–0.5 FTE |
Emitter maintenance; lineage completeness monitoring |
Scaling Risks
- Inference lineage feature snapshots at high prediction volume (>1M/day) can generate significant storage cost; use sampling + tiered retention.
- Graph store query cost grows with graph depth; optimise with indexed traversal patterns.
Optimisations
- Use OpenLineage open-source stack (Marquez + PostgreSQL backend) for cost-sensitive deployments.
- Apply sampling for routine inference lineage; full capture only for consequential predictions.
- Implement lineage graph compaction: archive lineage for deprecated model versions to cold storage.
- Cache frequent backward provenance queries for commonly audited decisions.
Indicative Cost Range
| Scale |
Monthly Cost |
Basis |
| Small (1–5 models, <100K predictions/day) |
$500–$3,000 |
Marquez OSS + Neo4j Community + light object store |
| Medium (5–20 models, 1M predictions/day) |
$3,000–$12,000 |
Managed graph store + Atlan OSS + sampled inference lineage |
| Large (20+ models, 10M+ predictions/day) |
$12,000–$50,000 |
Amazon Neptune + full enterprise stack + tiered lineage storage |
12. Trade-Off Analysis
Option Comparison
| Option |
Pros |
Cons |
Recommended When |
| A: Full OpenLineage end-to-end (this pattern) |
Standard; interoperable; covers ETL to inference; regulatory-grade |
Setup complexity; emitter integration per pipeline tool |
Regulated industry; multiple AI systems; regulatory audit requirements |
| B: Model registry lineage only (MLflow dataset tags) |
Simple; low overhead; fast to implement |
Misses ETL lineage; no inference-level traceability; not regulatory-grade |
Experimental; no regulatory obligation; single model |
| C: Manual lineage documentation |
Near-zero infrastructure cost |
Inaccurate; outdated within weeks; fails regulatory scrutiny |
Only viable for very small, stable AI systems with no regulatory obligation |
| D: Proprietary lineage tool (Collibra, Atlan) |
Rich UI; enterprise support; cataloguing integrated |
High licence cost; vendor lock-in; may not cover all pipeline tools |
Large enterprise with existing licence; strong BI lineage needs |
Architectural Tensions
| Tension |
Trade-Off |
Resolution |
| Full inference lineage vs. storage cost |
Full feature snapshots at 1M predictions/day is very expensive |
Tiered capture: consequential = full, routine = sampled, all = model version + timestamp |
| Event delivery latency vs. pipeline performance |
Synchronous lineage emission adds latency to pipelines |
Async event emission; lineage not on the critical serving path |
| Lineage standardisation vs. pipeline tool flexibility |
Standardising on OpenLineage requires emitter integration per tool |
OpenLineage has integrations for major tools (Airflow, Spark, dbt, MLflow); coverage now >80% of common tools |
| Graph query power vs. operational simplicity |
Graph databases powerful but operationally complex |
PostgreSQL backend for Marquez acceptable for <10M lineage nodes; graph DB for larger deployments |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Emitter not integrated in new pipeline |
High |
Medium — lineage gap for new model |
Lineage completeness check |
Emitter integration checklist in pipeline onboarding; completeness alert |
| Lineage event schema mismatch |
Medium |
Medium — events rejected by collector; lineage gap |
Collector validation error logs |
Schema versioning in emitter; forward-compatible schema evolution |
| Graph store capacity exhausted |
Low |
High — lineage writes fail; lineage gap |
Graph store disk/capacity alerts |
Lineage compaction policy; archive old lineage to cold storage |
| Inference lineage not captured for appealed decision |
Low |
Critical — unable to respond to right-to-explanation request |
Post-hoc check when appeal received |
Always-full lineage for decisions above risk threshold; escalation flag |
| Lineage tampered after write |
Very Low |
Critical — regulatory fraud |
Immutable write policy; hash verification |
Append-only lineage store; cryptographic hash of events |
Cascading Failure Scenarios
- Schema change cascade without lineage: Data team changes source table schema → no impact analysis (lineage incomplete) → feature pipeline silently produces incorrect features → model degradation → consequential decisions affected → regulatory enquiry → unable to trace root cause.
- Lineage collector outage during audit: Regulator requests provenance report → Lineage Query API unavailable → raw event store must be replayed → hours of investigation delay.
14. Regulatory Considerations
| Regulation |
Article/Clause |
Requirement |
Pattern Response |
| EU AI Act |
Article 12 |
Record-keeping: logs for high-risk AI must be kept for minimum 10 years |
Lineage graph + raw event store; immutable; retained per regulatory schedule |
| EU AI Act |
Article 13 |
Transparency: users must receive information about training data characteristics |
Backward provenance query enables training data summary for any prediction |
| GDPR |
Article 22 |
Right to explanation for automated decisions |
Backward provenance report provides explanation basis |
| GDPR |
Article 17 |
Right to erasure |
Forward impact query from subject data → identifies models to unlearn |
| APRA CPS 234 |
§32 |
Maintain integrity of information assets |
Immutable lineage events; hash verification |
| Privacy Act (Australia) |
APP 13 |
Correction of personal information |
Lineage enables identifying all model versions trained on corrected data |
| ISO 42001 |
§8.6 |
Traceability of AI system inputs |
Full source-to-prediction lineage per this pattern |
| NIST AI RMF |
GOVERN-6 |
Accountability and transparency |
Lineage provides accountability trail for AI decisions |
15. Reference Implementations
AWS
| Component |
AWS Service |
| OpenLineage Collector |
Marquez on ECS or Amazon Managed Service for Apache Airflow (lineage) |
| Lineage Graph Store |
Amazon Neptune |
| ETL Lineage Emitter |
AWS Glue with OpenLineage integration |
| Training Lineage |
SageMaker ML Lineage Tracking (native) + OpenLineage emitter |
| Inference Lineage |
SageMaker Model Monitor + custom Lambda emitter |
| Raw Event Store |
S3 with Object Lock (WORM) |
Azure
| Component |
Azure Service |
| OpenLineage Collector |
OpenMetadata on AKS |
| Lineage Graph Store |
Azure Cosmos DB (Gremlin API) |
| ETL Lineage |
Azure Purview (native lineage) + OpenLineage bridge |
| Training Lineage |
Azure ML (native) + OpenLineage emitter |
| Inference Lineage |
Custom Azure Function emitter |
GCP
| Component |
AWS Service |
| OpenLineage Collector |
Marquez on Cloud Run |
| Lineage Graph Store |
Cloud Spanner or Neo4j on GKE |
| ETL Lineage |
Dataplex lineage (native) + Cloud Dataflow OpenLineage |
| Training Lineage |
Vertex AI ML Metadata (native) + OpenLineage bridge |
| Inference Lineage |
Custom Cloud Function emitter |
On-Premises
| Component |
Technology |
| OpenLineage Collector |
Marquez (OSS) on Kubernetes |
| Lineage Graph Store |
Neo4j Community (small) or Enterprise (large) |
| ETL Lineage |
Apache Airflow OpenLineage provider + Spark OpenLineage |
| Training Lineage |
MLflow + custom OpenLineage emitter |
| Inference Lineage |
Custom Python middleware |
| Raw Event Store |
MinIO with object locking |
| Pattern |
ID |
Relationship |
Notes |
| AI Data Mesh Integration |
EAAPL-DAT001 |
Depends on |
Lineage is core to data product governance in mesh |
| Data Quality for AI |
EAAPL-DAT002 |
Complements |
Quality scorecard IDs embedded in lineage events |
| AI Training Data Governance |
EAAPL-DAT007 |
Complements |
Training data approval records linked via lineage |
| Privacy by Design for AI Data |
EAAPL-DAT005 |
Enables |
Lineage enables right-to-erasure impact analysis |
| Model Versioning |
EAAPL-MDL001 |
Depends on |
Model version IDs are key lineage graph nodes |
| Model Rollback |
EAAPL-MDL004 |
Enables |
Lineage query identifies which predictions were served by rolled-back version |
| Human Approval Gateway |
EAAPL-HIL001 |
Complements |
Approval decisions are lineage events in consequential AI |
17. Maturity Assessment
Overall Maturity: Proven — OpenLineage standard is mature and broadly adopted. Graph-based lineage storage is operationally proven. Inference-level lineage capture at scale remains an evolving practice.
| Dimension |
Score (1–5) |
Notes |
| Architectural clarity |
5 |
OpenLineage standard provides clear event schema |
| Tooling maturity |
4 |
ETL lineage tools mature; inference lineage tooling maturing |
| Regulatory alignment |
5 |
Strong EU AI Act and GDPR alignment |
| Operational complexity |
3 |
Graph store operations require specialist skills |
| Cost efficiency |
4 |
OSS stack cost-effective; inference lineage at scale requires optimisation |
| Security |
4 |
Immutable event store; strong access controls defined |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2023-11-01 |
EAAPL Working Group |
Initial publication; OpenLineage-based architecture |
| 1.1 |
2024-05-15 |
EAAPL Working Group |
Added EU AI Act Article 12 alignment; inference lineage tiering |
| 1.2 |
2025-03-01 |
EAAPL Working Group |
Added right-to-erasure lineage pattern; updated reference implementations |