EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryData ArchitectureEAAPL-DAT002
EAAPL-DAT002Proven
⇄ Compare

Data Quality for AI

🗄️ Data ArchitectureEU AI ActISO/IEC 42001

[EAAPL-DAT002] Data Quality for AI

Category: Data Architecture
Sub-category: Data Quality / AI Readiness
Version: 1.3
Maturity: Proven
Tags: data-quality, feature-validation, quality-gates, drift-detection, label-quality, AI-readiness
Regulatory Relevance: EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4, NIST AI RMF MAP-2.3


1. Executive Summary

AI systems are uniquely sensitive to data quality failures in ways that traditional BI systems are not: a 3% missingness rate in a key feature can degrade a fraud model's recall by 15–25%. Yet most enterprises apply generic data quality frameworks designed for reporting, not for machine learning. This pattern defines an AI-specific data quality management pipeline that enforces quality gates at every stage of the AI data lifecycle — from source ingestion through training to live inference.

The pattern introduces six AI-specific quality dimensions beyond the classic accuracy/completeness/timeliness trio: representativeness, label quality, inter-feature consistency, distribution stability, temporal validity, and lineage completeness. Automated quality scoring, threshold-based pipeline gates, and remediation workflows ensure that only fit-for-purpose data reaches model training and inference serving.

Organisations that implement this pattern report a 30–50% reduction in model retraining cycles triggered by data quality degradation and a significant improvement in their ability to satisfy regulatory enquiries about training data fitness under EU AI Act Article 10.

Target audience: Chief Data Officers, ML Platform leads, Data Engineering leads.


2. Problem Statement

Business Problem

AI models in production degrade silently because the data feeding them changes without detection. Business decisions based on degraded model outputs cause financial loss, regulatory exposure, and erosion of stakeholder trust in AI programmes.

Technical Problem

  • Standard data quality tools (Great Expectations, Deequ) test for completeness/uniqueness/referential integrity — necessary but insufficient for AI.
  • AI training requires representativeness (does the training distribution match the inference population?), label quality (are ground-truth labels correct and consistent?), and temporal validity (are time-windowed features computed correctly?).
  • Quality checks are typically applied at source ingestion, not at the point of feature computation or model serving — leaving a gap in the AI data pipeline.
  • There is no standard mechanism for quality failures to trigger model rollback or human review rather than silent degradation.

Symptoms

  • Model performance metrics (AUC, F1) decline between retraining cycles without obvious cause.
  • ML engineers discover data quality issues only after model deployment.
  • Training datasets fail regulatory audit because quality assessment documentation is absent.
  • Feature pipelines pass unit tests but produce subtly incorrect features in production (e.g., data leakage from incorrect time-windowing).

Cost of Inaction

Dimension Impact
Model quality Silent degradation; 15–40% performance loss before detection
Regulatory EU AI Act Article 10 audit failure; potential prohibition on high-risk AI use
Engineering Unplanned retraining cycles costing $20K–$200K each in compute + engineering time
Business Incorrect AI-driven decisions (fraud missed, credit mispriced, clinical risk underestimated)

3. Context

When to Apply

  • Any AI system where training data originates from operational systems (ETL pipelines, event streams, external feeds).
  • AI systems subject to regulatory oversight (high-risk AI per EU AI Act Annex III; APRA-regulated institutions).
  • Production AI models where retraining is costly or infrequent (>2 weeks between retraining cycles).
  • Systems where model output quality directly affects business decisions or customer outcomes.

When NOT to Apply

  • Pure research/experimentation environments where data quality enforcement would slow iteration.
  • AI systems consuming already-validated data products from a mature Data Mesh (quality gates already enforced upstream).
  • Very simple rule-based classifiers where feature engineering is trivial and interpretable.

Prerequisites

Prerequisite Minimum Viable Preferred
Data pipeline observability Ad hoc logging Structured logs + metrics pipeline
Feature store None (flat files acceptable) Managed feature store with versioning
Quality framework Great Expectations / Deequ Enterprise quality platform
Model monitoring None (manual review) Automated drift detection
Data catalogue Spreadsheet DataHub / Atlan with lineage

Industry Applicability

Industry Applicability Driver
Financial Services Critical APRA CPS 234; credit/fraud model regulatory requirements
Healthcare Critical Clinical AI; EU AI Act high-risk classification
Insurance High Actuarial model data governance
Retail High Personalisation model quality; recommendation accuracy
Telecommunications Medium Churn/network AI models
Manufacturing Medium Predictive maintenance; sensor data quality

4. Architecture Overview

Design Philosophy

The core insight of this pattern is that data quality for AI is a pipeline property, not a dataset property. A dataset can be perfectly accurate yet produce incorrect features due to wrong join logic, data leakage, or distribution shift. Quality must therefore be assessed and enforced at each stage of the AI data pipeline, not only at source.

Stage 1 — Source Quality. Traditional quality checks (completeness, accuracy, referential integrity, format validity) are applied at data ingestion. These checks use proven frameworks (Great Expectations, Deequ, dbt tests) and block pipeline execution on hard failures. This stage is necessary but not sufficient.

Stage 2 — AI-Specific Feature Quality. After feature engineering, a second quality pass applies AI-specific checks:

  • Representativeness: Statistical tests (Population Stability Index, KS test, chi-squared) compare the feature distribution in the current training cohort against a reference distribution (typically the first production training run). A PSI > 0.25 indicates significant distribution shift requiring human review.
  • Temporal validity: For time-windowed features, validate that no future data has leaked into the training window (a subtle but catastrophic quality failure).
  • Label quality (for supervised learning): Assess label error rate using Cleanlab or cross-validation confidence-based methods. Label error rates above 5% in training data typically degrade model performance below acceptable thresholds.
  • Inter-feature consistency: Validate that combinations of features are logically consistent (e.g., account_age cannot be negative; customer_tenure cannot exceed account_age).

Stage 3 — Training Dataset Quality Gate. Before a training run is initiated, an automated Quality Scorecard is computed across all six AI quality dimensions. Each dimension has a threshold. A training run is blocked if any dimension falls below its hard threshold, or if the weighted quality score falls below an overall threshold (recommended: 0.85 on a 0–1 scale).

Stage 4 — Inference-Time Feature Quality. At inference time, online features are validated before being passed to the model. This catches data pipeline failures that would otherwise cause the model to serve incorrect predictions silently. The validation is lightweight (schema check + null check + range check) to avoid inference latency impact.

Stage 5 — Prediction Quality Monitoring. Model outputs are monitored for prediction distribution drift. A sudden shift in the distribution of predicted classes or scores is an early indicator of underlying data quality degradation. This feeds back into a retraining trigger mechanism.

Quality Scorecard Design. Each of the six dimensions produces a normalised score (0–1). The overall quality score is a weighted average, with weights configurable by use case (e.g., healthcare AI weights label quality higher; fraud detection weights representativeness higher). The scorecard is stored alongside the training dataset version and model version for audit purposes.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Ingestion["Source and Feature Layer"] A[Raw Data Source] B{Source Quality Gate} C[Feature Engineering] end subgraph Training["Training Quality Gate"] D[Quality Scorecard] E{Score above threshold?} F[Training Pipeline] end subgraph Inference["Inference and Monitor"] G[Inference Feature Validation] H[Model Inference] I[Drift Monitor] end A --> B B -->|pass| C B -->|fail| A C --> D D --> E E -->|pass| F E -->|fail| C F --> G G --> H H --> I I -->|drift| D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f3e8ff,stroke:#a855f7 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f3e8ff,stroke:#a855f7 style F fill:#d1fae5,stroke:#10b981 style G fill:#f0fdf4,stroke:#22c55e style H fill:#d1fae5,stroke:#10b981 style I fill:#fee2e2,stroke:#ef4444

6. Components

Component Type Responsibility Technology Options Criticality
Source Quality Checker Processing Service Schema validation; completeness; referential integrity; format checks Great Expectations, Deequ, dbt tests, Soda Core Critical
Quarantine Store Storage Isolates bad records; preserves quality report for remediation S3 / GCS / ADLS quarantine bucket + metadata table High
Feature Computation Engine Processing Service Computes AI features from validated source data Apache Spark, dbt, Flink, Databricks Critical
AI Quality Checker Processing Service Representativeness (PSI/KS); temporal validity; label quality (Cleanlab); inter-feature consistency Custom Python + scipy/statsmodels; Cleanlab; Great Expectations custom expectations Critical
Quality Scorecard Engine Processing Service Aggregates 6-dimension scores into weighted quality score; stores scorecard with dataset version Custom Python service; dbt exposures; MLflow tags High
Quality Gate Controller Orchestration Enforces threshold-based gates; blocks or allows downstream pipeline execution Apache Airflow sensors; Kubeflow pipeline conditions; custom Lambda Critical
Inference-Time Validator Processing Service Lightweight feature validation at inference request time; low-latency path Custom middleware in inference service; Pydantic schema validation High
Prediction Drift Monitor Monitoring Service Monitors prediction distribution over time; triggers retraining alerts Evidently AI, Arize AI, WhyLabs, custom statsmodels pipeline High
Quality Remediation Workflow Human Process + Tooling Routes quality failures to appropriate owners; tracks remediation status Jira integration; Slack alerts; custom workflow tool Medium
Quality Dashboard Observability Visualises quality scores over time per dataset + model Grafana, Metabase, custom React dashboard Medium

7. Data Flow

Primary Flow

Step Actor Action Output
1 Ingestion pipeline Reads raw data from source; runs source quality checks Quality report + pass/fail per record
2 Quality Gate 1 Evaluates source quality against thresholds; passes clean records; quarantines failures Clean dataset partition; quarantine records
3 Feature computation Computes AI features from clean source data Raw feature dataset
4 AI Quality Checker Runs representativeness, temporal validity, label quality, inter-feature consistency checks Per-dimension quality scores
5 Quality Gate 2 Evaluates AI-specific quality; passes if all dimensions within tolerance Pass signal or human review request
6 Quality Scorecard Engine Computes weighted quality score; attaches to dataset version Scored dataset version with quality certificate
7 Quality Gate 3 Checks overall quality score ≥ 0.85; allows or blocks training run Allow signal or escalation
8 Training pipeline Trains model on quality-certified dataset Trained model artefact with quality score in metadata
9 Inference-time validator Validates online features at each inference request Valid feature vector or fallback trigger
10 Prediction drift monitor Continuously monitors prediction distributions Drift alert or retraining trigger

Error Flow

Error Condition Trigger Response Recovery
Source completeness failure >5% on key feature Quality Gate 1 Records quarantined; data owner alerted; pipeline paused Data owner investigates source; missing data remediated or imputed per approved strategy
PSI >0.25 on critical feature (distribution shift) AI Quality Check Human review requested; training blocked ML lead reviews distribution shift; determines if shift is real or artefact; approves or rejects training
Label error rate >5% Cleanlab check Training blocked; label owner alerted Label review workflow; re-labelling of suspect records
Inference-time feature null >10% Inference validator Fallback to cached/default prediction; alert raised Feature pipeline investigated; SLA breach review
Prediction drift >PSI 0.1 Drift monitor Retraining trigger or alert (per configured sensitivity) Retraining cycle initiated; root cause (data vs. concept drift) investigated

8. Security Considerations

Authentication & Authorisation

  • Quality check pipelines run under service identity with read-only access to source data; no write-back to source systems.
  • Quality scorecard store protected by role-based access; model training pipeline must present valid scorecard ID to proceed.

Secrets Management

  • Database credentials for quality checks stored in secrets manager; rotated every 90 days.
  • No credentials embedded in pipeline code or Great Expectations config files.

Data Classification

  • Quarantine store inherits classification of source data; treated as Confidential minimum.
  • Quality reports do not contain sample records — only aggregate statistics — to avoid PII exposure in quality artefacts.

Encryption

  • Quarantine store encrypted at rest (AES-256); in transit TLS 1.3.
  • Quality scorecards and reports encrypted at rest; access logged.

Auditability

  • Every quality check execution logged with: dataset version, check type, result, timestamp, pipeline run ID.
  • Quality gate decisions (pass/block/override) logged immutably; human review decisions captured with reviewer identity.

OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM03: Training Data Poisoning Malicious data injected into training pipeline degrades model Source data integrity checks (hash verification); anomaly detection in quality checks
LLM04: Model Denial of Service Malformed features at inference time cause model errors or crashes Inference-time validator rejects malformed inputs before reaching model
LLM06: Sensitive Information Disclosure PII in training data may leak via model memorisation Quality pipeline enforces data minimisation; PII scanner before training
LLM02: Insecure Output Handling Degraded model outputs consumed without validation Prediction distribution monitoring; consumer alerts for quality degradation

9. Governance Considerations

Responsible AI

  • Representativeness checks enforce demographic parity in training data, reducing systematic bias introduction.
  • Quality scorecards are mandatory artefacts for high-risk AI impact assessments.

Model Risk Management

  • Model risk management frameworks require training data quality attestation; Quality Scorecard provides this automatically.
  • Quality score < 0.85 is an automatic risk flag requiring risk committee review before model deployment.

Human Approval Checkpoints

  • Distribution shift alerts (PSI > 0.25) require human ML lead approval before training proceeds.
  • Label quality failures require label owner review and sign-off before training.
  • Quality score override (proceeding with score < 0.85) requires written justification from CDO-delegated authority.

Governance Artefacts

Artefact Owner Cadence Purpose
Quality Scorecard Data Quality Platform (automated) Per training dataset version Machine-readable quality certificate; linked to model version
Quality Exception Log Data Quality Owner Per exception Records human overrides with justification; retained 7 years
Drift Alert Log ML Platform Continuous Record of prediction drift events; links to remediation action
Label Quality Report Domain / Annotation Team Per labelling cycle Cleanlab output; inter-annotator agreement; label error rate
Quarantine Report Data Engineering Per pipeline run Counts and reasons for quarantined records

10. Operational Considerations

Monitoring

Metric Alert Threshold Tooling
Source data completeness per key feature <95% Great Expectations / Soda alerts
PSI per feature (distribution shift) >0.1 warning; >0.25 block Custom monitor + Grafana
Label error rate (supervised models) >3% warning; >5% block Cleanlab job output
Overall quality score <0.90 warning; <0.85 block Quality Scorecard Engine
Inference-time feature null rate >2% Inference service metrics
Prediction distribution PSI >0.1 alert Evidently / WhyLabs

SLOs

SLO Target Measurement
Source quality check completion (batch) <30 minutes for datasets up to 100GB Pipeline execution time
Quality scorecard availability 99.9% Scorecard service uptime
Inference-time validation latency overhead <5ms p99 Inference service metrics
Drift detection latency (time from drift to alert) <1 hour Monitor execution frequency

Logging

  • All quality check results logged in structured JSON; retained 7 years for regulatory compliance.
  • Quarantine records retained for 90 days minimum; extended for records involved in regulatory investigations.

Incident Management

  • Quality gate block → P2 incident; data owner and ML lead notified within 15 minutes.
  • Prediction drift alert → P2 incident; ML platform team investigates root cause.
  • Label quality failure → P1 if model in production; immediate production model freeze investigation.

Disaster Recovery

Component RTO RPO Strategy
Quality Scorecard Store 4 hours 1 hour Database backup + restore; scores immutable once written
Drift Monitor 2 hours 1 hour Stateless compute; redeploy from IaC
Quarantine Store 8 hours 24 hours Cross-region object storage replication

Capacity Planning

  • Source quality checks run as batch jobs; scale Spark/Deequ executors based on daily data volume.
  • Inference-time validation is in-process; minimal compute overhead (target <5ms).
  • Drift monitors run on prediction sample (typically 1–10% of predictions); compute scales with sampling rate.

11. Cost Considerations

Cost Drivers

Cost Driver Typical Range Notes
Quality check compute (batch) $300–$3,000/month Spark/Deequ job costs; scales with data volume
Great Expectations / Soda licence $0–$2,000/month OSS vs. enterprise tier
Cleanlab (label quality) $500–$5,000/month Managed service pricing; or open-source self-hosted
Drift monitoring platform $500–$5,000/month Evidently OSS (free) vs. Arize/WhyLabs SaaS
Quality scorecard storage $50–$300/month Minimal; JSON artefacts in object store
Engineering operational cost 0.5–1.5 FTE Ongoing threshold tuning; remediation workflow management

Scaling Risks

  • Batch quality checks on very large datasets (>10TB) can become expensive; use sampling for distribution checks.
  • Inference-time validation overhead grows with feature vector size and request volume; keep validation logic lightweight.

Optimisations

  • Use statistical sampling for representativeness checks; full dataset scanning is rarely necessary.
  • Cache quality check results for unchanged dataset partitions (incremental quality check pattern).
  • Use open-source Evidently AI for drift monitoring rather than SaaS platforms for cost-sensitive deployments.
  • Run Cleanlab label quality checks once per labelling cycle, not per training run.

Indicative Cost Range

Scale Monthly Cost Basis
Small (1–3 models, <10GB/day) $500–$3,000 OSS stack (Great Expectations + Evidently + custom scorecard)
Medium (5–15 models, 100GB/day) $3,000–$15,000 Managed quality platform + Soda/Arize
Large (20+ models, 1TB+/day) $15,000–$60,000 Enterprise quality platform + full drift monitoring suite

12. Trade-Off Analysis

Option Comparison

Option Pros Cons Recommended When
A: Full 6-dimension AI quality pipeline (this pattern) Comprehensive; regulatory-grade; catches all known AI quality failure modes High setup complexity; per-dimension threshold tuning required Regulated industry; production AI with business-critical decisions
B: Source-only quality checks (Great Expectations at ingestion) Simple; fast to implement; catches structural data issues Misses AI-specific failures (leakage, representativeness, label quality) Experimental/low-risk AI; no regulatory obligation
C: Post-deployment monitoring only Catches problems after they surface; low pipeline overhead Model quality has already degraded before detection; reactive not preventive Only viable as a complement to A/B, not a replacement

Architectural Tensions

Tension Trade-Off Resolution
Thoroughness vs. pipeline latency Full quality checks add hours to training cycle Parallelise quality checks; use sampling for distribution tests
Strict gates vs. iteration speed Hard quality gates block experimentation Two-tier gates: hard gates for production; soft warning gates for experiments
Centralised vs. domain quality ownership Central team can standardise but creates bottleneck Domain-owned quality checks with central governance of thresholds
Statistical sensitivity vs. false positives Sensitive thresholds catch real drift but also flag seasonal patterns Tune PSI thresholds per feature; exclude known seasonal features from blocking gates

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Data leakage not caught (future data in training) Medium Critical — model appears to perform well but fails in production Temporal validity check in Stage 2 Rebuild feature pipeline with correct time-window logic; retrain
Quality threshold set too high — blocks valid training Medium Medium — unnecessary delays; team loses confidence in quality pipeline Human review requests piling up Review and recalibrate thresholds quarterly; track gate block rate
Quality threshold set too low — lets bad data through Low High — biased or degraded model deployed Post-deployment drift monitoring Emergency retraining; threshold review
Quarantine store full — silent pass of bad records Low High — quality check bypassed Quarantine store capacity monitoring Increase quarantine store capacity; alert on >80% usage
Label quality check not run (annotation skipped) Medium High — noisy labels degrade supervised model Cleanlab job missing from pipeline run Gate training on label quality check completion
Drift monitor false positive — unnecessary retraining Medium Low-Medium — wasted compute; engineering distraction Track retraining trigger rate vs. actual drift Tune drift thresholds; add seasonal adjustment

Cascading Failure Scenarios

  • Silent distribution shift cascade: PSI threshold misconfigured → distribution shift not detected → model trained on shifted data → model deployed → prediction quality degrades → business decisions affected → detected only at quarterly review.
  • Quality gate outage cascade: Quality gate controller service down → pipeline bypasses quality checks → bad data reaches training → biased model trained → deployed to production → regulatory audit finds no quality attestation for model.

14. Regulatory Considerations

Regulation Article/Clause Requirement Pattern Response
EU AI Act Article 10(2) Training data must be relevant, representative, free of errors, complete Representativeness (PSI/KS), completeness, and accuracy checks enforced at Stage 2
EU AI Act Article 10(3) Examine data for biases Demographic distribution checks in representativeness dimension
EU AI Act Article 12 Record-keeping for high-risk AI Quality Scorecard stored with training dataset version; retained 7 years
APRA CPS 234 §32 Data integrity and accuracy controls Source quality gate enforces integrity; audit log of quality check results
Privacy Act (Australia) APP 11 Security of personal information Quarantine store classification; PII minimisation in quality artefacts
ISO 42001 §8.4 Data quality management for AI Six-dimension quality framework maps to ISO 42001 data quality requirements
NIST AI RMF MAP-2.3 Scientific validity; data fitness for purpose Representativeness and temporal validity checks directly address scientific validity

15. Reference Implementations

AWS

Component AWS Service
Source quality checks AWS Glue Data Quality + Great Expectations on Glue
Feature computation AWS Glue / EMR
AI quality checks Custom Python on SageMaker Processing Jobs
Label quality Cleanlab on SageMaker Processing Jobs
Quality scorecard DynamoDB + S3
Quality gate controller Step Functions
Drift monitoring SageMaker Model Monitor
Quarantine store S3 with S3 Object Lock

Azure

Component Azure Service
Source quality checks Azure Data Factory data validation + Great Expectations
Feature computation Azure Databricks
AI quality checks Custom Python on Azure ML Compute
Quality scorecard Azure Cosmos DB + ADLS
Quality gate controller Azure ML Pipelines conditions
Drift monitoring Azure ML Data Drift monitor

GCP

Component GCP Service
Source quality checks Dataplex data quality + Deequ on Dataproc
Feature computation Cloud Dataflow
AI quality checks Custom Python on Vertex AI Custom Jobs
Quality scorecard Firestore + GCS
Quality gate controller Vertex AI Pipelines
Drift monitoring Vertex AI Model Monitoring

On-Premises

Component Technology
Source quality checks Great Expectations on Kubernetes
Feature computation Apache Spark
AI quality checks Custom Python + Cleanlab on Kubernetes
Quality scorecard PostgreSQL + MinIO
Quality gate controller Apache Airflow
Drift monitoring Evidently AI (self-hosted)

Pattern ID Relationship Notes
AI Data Mesh Integration EAAPL-DAT001 Depends on Quality gates are enforced within data product contracts
Data Lineage for AI EAAPL-DAT003 Complements Lineage needed to trace quality failures to source
AI Training Data Governance EAAPL-DAT007 Depends on Quality scorecard is a governance artefact in training data governance
Real-Time Feature Engineering EAAPL-DAT008 Complements Inference-time validation is a component of real-time feature serving
Active Learning Loop EAAPL-HIL002 Complements Label quality improvements feed back via active learning
Model Rollback EAAPL-MDL004 Triggers Severe quality failures may trigger model rollback

17. Maturity Assessment

Overall Maturity: Proven — Source quality frameworks (Great Expectations, Deequ) are mature. AI-specific quality dimensions (representativeness, label quality) are emerging as industry standard practice, supported by Cleanlab and Evidently.

Dimension Score (1–5) Notes
Architectural clarity 5 Well-defined pipeline stages and gate logic
Tooling maturity 4 Source quality tools mature; AI-specific tools (Cleanlab) still maturing
Regulatory alignment 5 Strong EU AI Act Art. 10 and APRA CPS 234 alignment
Operational complexity 3 Threshold tuning and remediation workflow require ongoing attention
Cost efficiency 4 OSS stack is cost-effective; enterprise platforms add cost
Security 4 Good controls defined; PII-safe quality reporting

18. Revision History

Version Date Author Changes
1.0 2023-09-01 EAAPL Working Group Initial publication
1.1 2024-02-15 EAAPL Working Group Added label quality dimension; Cleanlab integration
1.2 2024-08-01 EAAPL Working Group Added EU AI Act Article 10 alignment; inference-time validation
1.3 2025-03-01 EAAPL Working Group Updated drift monitoring options; expanded failure modes
← Back to LibraryMore Data Architecture