[EAAPL-DAT002] Data Quality for AI
Category: Data Architecture
Sub-category: Data Quality / AI Readiness
Version: 1.3
Maturity: Proven
Tags: data-quality, feature-validation, quality-gates, drift-detection, label-quality, AI-readiness
Regulatory Relevance: EU AI Act Article 10, APRA CPS 234, ISO 42001 §8.4, NIST AI RMF MAP-2.3
1. Executive Summary
AI systems are uniquely sensitive to data quality failures in ways that traditional BI systems are not: a 3% missingness rate in a key feature can degrade a fraud model's recall by 15–25%. Yet most enterprises apply generic data quality frameworks designed for reporting, not for machine learning. This pattern defines an AI-specific data quality management pipeline that enforces quality gates at every stage of the AI data lifecycle — from source ingestion through training to live inference.
The pattern introduces six AI-specific quality dimensions beyond the classic accuracy/completeness/timeliness trio: representativeness, label quality, inter-feature consistency, distribution stability, temporal validity, and lineage completeness. Automated quality scoring, threshold-based pipeline gates, and remediation workflows ensure that only fit-for-purpose data reaches model training and inference serving.
Organisations that implement this pattern report a 30–50% reduction in model retraining cycles triggered by data quality degradation and a significant improvement in their ability to satisfy regulatory enquiries about training data fitness under EU AI Act Article 10.
Target audience: Chief Data Officers, ML Platform leads, Data Engineering leads.
2. Problem Statement
Business Problem
AI models in production degrade silently because the data feeding them changes without detection. Business decisions based on degraded model outputs cause financial loss, regulatory exposure, and erosion of stakeholder trust in AI programmes.
Technical Problem
- Standard data quality tools (Great Expectations, Deequ) test for completeness/uniqueness/referential integrity — necessary but insufficient for AI.
- AI training requires representativeness (does the training distribution match the inference population?), label quality (are ground-truth labels correct and consistent?), and temporal validity (are time-windowed features computed correctly?).
- Quality checks are typically applied at source ingestion, not at the point of feature computation or model serving — leaving a gap in the AI data pipeline.
- There is no standard mechanism for quality failures to trigger model rollback or human review rather than silent degradation.
Symptoms
- Model performance metrics (AUC, F1) decline between retraining cycles without obvious cause.
- ML engineers discover data quality issues only after model deployment.
- Training datasets fail regulatory audit because quality assessment documentation is absent.
- Feature pipelines pass unit tests but produce subtly incorrect features in production (e.g., data leakage from incorrect time-windowing).
Cost of Inaction
| Dimension |
Impact |
| Model quality |
Silent degradation; 15–40% performance loss before detection |
| Regulatory |
EU AI Act Article 10 audit failure; potential prohibition on high-risk AI use |
| Engineering |
Unplanned retraining cycles costing $20K–$200K each in compute + engineering time |
| Business |
Incorrect AI-driven decisions (fraud missed, credit mispriced, clinical risk underestimated) |
3. Context
When to Apply
- Any AI system where training data originates from operational systems (ETL pipelines, event streams, external feeds).
- AI systems subject to regulatory oversight (high-risk AI per EU AI Act Annex III; APRA-regulated institutions).
- Production AI models where retraining is costly or infrequent (>2 weeks between retraining cycles).
- Systems where model output quality directly affects business decisions or customer outcomes.
When NOT to Apply
- Pure research/experimentation environments where data quality enforcement would slow iteration.
- AI systems consuming already-validated data products from a mature Data Mesh (quality gates already enforced upstream).
- Very simple rule-based classifiers where feature engineering is trivial and interpretable.
Prerequisites
| Prerequisite |
Minimum Viable |
Preferred |
| Data pipeline observability |
Ad hoc logging |
Structured logs + metrics pipeline |
| Feature store |
None (flat files acceptable) |
Managed feature store with versioning |
| Quality framework |
Great Expectations / Deequ |
Enterprise quality platform |
| Model monitoring |
None (manual review) |
Automated drift detection |
| Data catalogue |
Spreadsheet |
DataHub / Atlan with lineage |
Industry Applicability
| Industry |
Applicability |
Driver |
| Financial Services |
Critical |
APRA CPS 234; credit/fraud model regulatory requirements |
| Healthcare |
Critical |
Clinical AI; EU AI Act high-risk classification |
| Insurance |
High |
Actuarial model data governance |
| Retail |
High |
Personalisation model quality; recommendation accuracy |
| Telecommunications |
Medium |
Churn/network AI models |
| Manufacturing |
Medium |
Predictive maintenance; sensor data quality |
4. Architecture Overview
Design Philosophy
The core insight of this pattern is that data quality for AI is a pipeline property, not a dataset property. A dataset can be perfectly accurate yet produce incorrect features due to wrong join logic, data leakage, or distribution shift. Quality must therefore be assessed and enforced at each stage of the AI data pipeline, not only at source.
Stage 1 — Source Quality. Traditional quality checks (completeness, accuracy, referential integrity, format validity) are applied at data ingestion. These checks use proven frameworks (Great Expectations, Deequ, dbt tests) and block pipeline execution on hard failures. This stage is necessary but not sufficient.
Stage 2 — AI-Specific Feature Quality. After feature engineering, a second quality pass applies AI-specific checks:
- Representativeness: Statistical tests (Population Stability Index, KS test, chi-squared) compare the feature distribution in the current training cohort against a reference distribution (typically the first production training run). A PSI > 0.25 indicates significant distribution shift requiring human review.
- Temporal validity: For time-windowed features, validate that no future data has leaked into the training window (a subtle but catastrophic quality failure).
- Label quality (for supervised learning): Assess label error rate using Cleanlab or cross-validation confidence-based methods. Label error rates above 5% in training data typically degrade model performance below acceptable thresholds.
- Inter-feature consistency: Validate that combinations of features are logically consistent (e.g., account_age cannot be negative; customer_tenure cannot exceed account_age).
Stage 3 — Training Dataset Quality Gate. Before a training run is initiated, an automated Quality Scorecard is computed across all six AI quality dimensions. Each dimension has a threshold. A training run is blocked if any dimension falls below its hard threshold, or if the weighted quality score falls below an overall threshold (recommended: 0.85 on a 0–1 scale).
Stage 4 — Inference-Time Feature Quality. At inference time, online features are validated before being passed to the model. This catches data pipeline failures that would otherwise cause the model to serve incorrect predictions silently. The validation is lightweight (schema check + null check + range check) to avoid inference latency impact.
Stage 5 — Prediction Quality Monitoring. Model outputs are monitored for prediction distribution drift. A sudden shift in the distribution of predicted classes or scores is an early indicator of underlying data quality degradation. This feeds back into a retraining trigger mechanism.
Quality Scorecard Design. Each of the six dimensions produces a normalised score (0–1). The overall quality score is a weighted average, with weights configurable by use case (e.g., healthcare AI weights label quality higher; fraud detection weights representativeness higher). The scorecard is stored alongside the training dataset version and model version for audit purposes.
5. Architecture Diagram
flowchart TD
subgraph Ingestion["Source and Feature Layer"]
A[Raw Data Source]
B{Source Quality Gate}
C[Feature Engineering]
end
subgraph Training["Training Quality Gate"]
D[Quality Scorecard]
E{Score above threshold?}
F[Training Pipeline]
end
subgraph Inference["Inference and Monitor"]
G[Inference Feature Validation]
H[Model Inference]
I[Drift Monitor]
end
A --> B
B -->|pass| C
B -->|fail| A
C --> D
D --> E
E -->|pass| F
E -->|fail| C
F --> G
G --> H
H --> I
I -->|drift| D
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f3e8ff,stroke:#a855f7
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f3e8ff,stroke:#a855f7
style F fill:#d1fae5,stroke:#10b981
style G fill:#f0fdf4,stroke:#22c55e
style H fill:#d1fae5,stroke:#10b981
style I fill:#fee2e2,stroke:#ef4444
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Source Quality Checker |
Processing Service |
Schema validation; completeness; referential integrity; format checks |
Great Expectations, Deequ, dbt tests, Soda Core |
Critical |
| Quarantine Store |
Storage |
Isolates bad records; preserves quality report for remediation |
S3 / GCS / ADLS quarantine bucket + metadata table |
High |
| Feature Computation Engine |
Processing Service |
Computes AI features from validated source data |
Apache Spark, dbt, Flink, Databricks |
Critical |
| AI Quality Checker |
Processing Service |
Representativeness (PSI/KS); temporal validity; label quality (Cleanlab); inter-feature consistency |
Custom Python + scipy/statsmodels; Cleanlab; Great Expectations custom expectations |
Critical |
| Quality Scorecard Engine |
Processing Service |
Aggregates 6-dimension scores into weighted quality score; stores scorecard with dataset version |
Custom Python service; dbt exposures; MLflow tags |
High |
| Quality Gate Controller |
Orchestration |
Enforces threshold-based gates; blocks or allows downstream pipeline execution |
Apache Airflow sensors; Kubeflow pipeline conditions; custom Lambda |
Critical |
| Inference-Time Validator |
Processing Service |
Lightweight feature validation at inference request time; low-latency path |
Custom middleware in inference service; Pydantic schema validation |
High |
| Prediction Drift Monitor |
Monitoring Service |
Monitors prediction distribution over time; triggers retraining alerts |
Evidently AI, Arize AI, WhyLabs, custom statsmodels pipeline |
High |
| Quality Remediation Workflow |
Human Process + Tooling |
Routes quality failures to appropriate owners; tracks remediation status |
Jira integration; Slack alerts; custom workflow tool |
Medium |
| Quality Dashboard |
Observability |
Visualises quality scores over time per dataset + model |
Grafana, Metabase, custom React dashboard |
Medium |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Ingestion pipeline |
Reads raw data from source; runs source quality checks |
Quality report + pass/fail per record |
| 2 |
Quality Gate 1 |
Evaluates source quality against thresholds; passes clean records; quarantines failures |
Clean dataset partition; quarantine records |
| 3 |
Feature computation |
Computes AI features from clean source data |
Raw feature dataset |
| 4 |
AI Quality Checker |
Runs representativeness, temporal validity, label quality, inter-feature consistency checks |
Per-dimension quality scores |
| 5 |
Quality Gate 2 |
Evaluates AI-specific quality; passes if all dimensions within tolerance |
Pass signal or human review request |
| 6 |
Quality Scorecard Engine |
Computes weighted quality score; attaches to dataset version |
Scored dataset version with quality certificate |
| 7 |
Quality Gate 3 |
Checks overall quality score ≥ 0.85; allows or blocks training run |
Allow signal or escalation |
| 8 |
Training pipeline |
Trains model on quality-certified dataset |
Trained model artefact with quality score in metadata |
| 9 |
Inference-time validator |
Validates online features at each inference request |
Valid feature vector or fallback trigger |
| 10 |
Prediction drift monitor |
Continuously monitors prediction distributions |
Drift alert or retraining trigger |
Error Flow
| Error Condition |
Trigger |
Response |
Recovery |
| Source completeness failure >5% on key feature |
Quality Gate 1 |
Records quarantined; data owner alerted; pipeline paused |
Data owner investigates source; missing data remediated or imputed per approved strategy |
| PSI >0.25 on critical feature (distribution shift) |
AI Quality Check |
Human review requested; training blocked |
ML lead reviews distribution shift; determines if shift is real or artefact; approves or rejects training |
| Label error rate >5% |
Cleanlab check |
Training blocked; label owner alerted |
Label review workflow; re-labelling of suspect records |
| Inference-time feature null >10% |
Inference validator |
Fallback to cached/default prediction; alert raised |
Feature pipeline investigated; SLA breach review |
| Prediction drift >PSI 0.1 |
Drift monitor |
Retraining trigger or alert (per configured sensitivity) |
Retraining cycle initiated; root cause (data vs. concept drift) investigated |
8. Security Considerations
Authentication & Authorisation
- Quality check pipelines run under service identity with read-only access to source data; no write-back to source systems.
- Quality scorecard store protected by role-based access; model training pipeline must present valid scorecard ID to proceed.
Secrets Management
- Database credentials for quality checks stored in secrets manager; rotated every 90 days.
- No credentials embedded in pipeline code or Great Expectations config files.
Data Classification
- Quarantine store inherits classification of source data; treated as Confidential minimum.
- Quality reports do not contain sample records — only aggregate statistics — to avoid PII exposure in quality artefacts.
Encryption
- Quarantine store encrypted at rest (AES-256); in transit TLS 1.3.
- Quality scorecards and reports encrypted at rest; access logged.
Auditability
- Every quality check execution logged with: dataset version, check type, result, timestamp, pipeline run ID.
- Quality gate decisions (pass/block/override) logged immutably; human review decisions captured with reviewer identity.
OWASP LLM Top 10 Mapping
| OWASP LLM Risk |
Relevance |
Mitigation |
| LLM03: Training Data Poisoning |
Malicious data injected into training pipeline degrades model |
Source data integrity checks (hash verification); anomaly detection in quality checks |
| LLM04: Model Denial of Service |
Malformed features at inference time cause model errors or crashes |
Inference-time validator rejects malformed inputs before reaching model |
| LLM06: Sensitive Information Disclosure |
PII in training data may leak via model memorisation |
Quality pipeline enforces data minimisation; PII scanner before training |
| LLM02: Insecure Output Handling |
Degraded model outputs consumed without validation |
Prediction distribution monitoring; consumer alerts for quality degradation |
9. Governance Considerations
Responsible AI
- Representativeness checks enforce demographic parity in training data, reducing systematic bias introduction.
- Quality scorecards are mandatory artefacts for high-risk AI impact assessments.
Model Risk Management
- Model risk management frameworks require training data quality attestation; Quality Scorecard provides this automatically.
- Quality score < 0.85 is an automatic risk flag requiring risk committee review before model deployment.
Human Approval Checkpoints
- Distribution shift alerts (PSI > 0.25) require human ML lead approval before training proceeds.
- Label quality failures require label owner review and sign-off before training.
- Quality score override (proceeding with score < 0.85) requires written justification from CDO-delegated authority.
Governance Artefacts
| Artefact |
Owner |
Cadence |
Purpose |
| Quality Scorecard |
Data Quality Platform (automated) |
Per training dataset version |
Machine-readable quality certificate; linked to model version |
| Quality Exception Log |
Data Quality Owner |
Per exception |
Records human overrides with justification; retained 7 years |
| Drift Alert Log |
ML Platform |
Continuous |
Record of prediction drift events; links to remediation action |
| Label Quality Report |
Domain / Annotation Team |
Per labelling cycle |
Cleanlab output; inter-annotator agreement; label error rate |
| Quarantine Report |
Data Engineering |
Per pipeline run |
Counts and reasons for quarantined records |
10. Operational Considerations
Monitoring
| Metric |
Alert Threshold |
Tooling |
| Source data completeness per key feature |
<95% |
Great Expectations / Soda alerts |
| PSI per feature (distribution shift) |
>0.1 warning; >0.25 block |
Custom monitor + Grafana |
| Label error rate (supervised models) |
>3% warning; >5% block |
Cleanlab job output |
| Overall quality score |
<0.90 warning; <0.85 block |
Quality Scorecard Engine |
| Inference-time feature null rate |
>2% |
Inference service metrics |
| Prediction distribution PSI |
>0.1 alert |
Evidently / WhyLabs |
SLOs
| SLO |
Target |
Measurement |
| Source quality check completion (batch) |
<30 minutes for datasets up to 100GB |
Pipeline execution time |
| Quality scorecard availability |
99.9% |
Scorecard service uptime |
| Inference-time validation latency overhead |
<5ms p99 |
Inference service metrics |
| Drift detection latency (time from drift to alert) |
<1 hour |
Monitor execution frequency |
Logging
- All quality check results logged in structured JSON; retained 7 years for regulatory compliance.
- Quarantine records retained for 90 days minimum; extended for records involved in regulatory investigations.
Incident Management
- Quality gate block → P2 incident; data owner and ML lead notified within 15 minutes.
- Prediction drift alert → P2 incident; ML platform team investigates root cause.
- Label quality failure → P1 if model in production; immediate production model freeze investigation.
Disaster Recovery
| Component |
RTO |
RPO |
Strategy |
| Quality Scorecard Store |
4 hours |
1 hour |
Database backup + restore; scores immutable once written |
| Drift Monitor |
2 hours |
1 hour |
Stateless compute; redeploy from IaC |
| Quarantine Store |
8 hours |
24 hours |
Cross-region object storage replication |
Capacity Planning
- Source quality checks run as batch jobs; scale Spark/Deequ executors based on daily data volume.
- Inference-time validation is in-process; minimal compute overhead (target <5ms).
- Drift monitors run on prediction sample (typically 1–10% of predictions); compute scales with sampling rate.
11. Cost Considerations
Cost Drivers
| Cost Driver |
Typical Range |
Notes |
| Quality check compute (batch) |
$300–$3,000/month |
Spark/Deequ job costs; scales with data volume |
| Great Expectations / Soda licence |
$0–$2,000/month |
OSS vs. enterprise tier |
| Cleanlab (label quality) |
$500–$5,000/month |
Managed service pricing; or open-source self-hosted |
| Drift monitoring platform |
$500–$5,000/month |
Evidently OSS (free) vs. Arize/WhyLabs SaaS |
| Quality scorecard storage |
$50–$300/month |
Minimal; JSON artefacts in object store |
| Engineering operational cost |
0.5–1.5 FTE |
Ongoing threshold tuning; remediation workflow management |
Scaling Risks
- Batch quality checks on very large datasets (>10TB) can become expensive; use sampling for distribution checks.
- Inference-time validation overhead grows with feature vector size and request volume; keep validation logic lightweight.
Optimisations
- Use statistical sampling for representativeness checks; full dataset scanning is rarely necessary.
- Cache quality check results for unchanged dataset partitions (incremental quality check pattern).
- Use open-source Evidently AI for drift monitoring rather than SaaS platforms for cost-sensitive deployments.
- Run Cleanlab label quality checks once per labelling cycle, not per training run.
Indicative Cost Range
| Scale |
Monthly Cost |
Basis |
| Small (1–3 models, <10GB/day) |
$500–$3,000 |
OSS stack (Great Expectations + Evidently + custom scorecard) |
| Medium (5–15 models, 100GB/day) |
$3,000–$15,000 |
Managed quality platform + Soda/Arize |
| Large (20+ models, 1TB+/day) |
$15,000–$60,000 |
Enterprise quality platform + full drift monitoring suite |
12. Trade-Off Analysis
Option Comparison
| Option |
Pros |
Cons |
Recommended When |
| A: Full 6-dimension AI quality pipeline (this pattern) |
Comprehensive; regulatory-grade; catches all known AI quality failure modes |
High setup complexity; per-dimension threshold tuning required |
Regulated industry; production AI with business-critical decisions |
| B: Source-only quality checks (Great Expectations at ingestion) |
Simple; fast to implement; catches structural data issues |
Misses AI-specific failures (leakage, representativeness, label quality) |
Experimental/low-risk AI; no regulatory obligation |
| C: Post-deployment monitoring only |
Catches problems after they surface; low pipeline overhead |
Model quality has already degraded before detection; reactive not preventive |
Only viable as a complement to A/B, not a replacement |
Architectural Tensions
| Tension |
Trade-Off |
Resolution |
| Thoroughness vs. pipeline latency |
Full quality checks add hours to training cycle |
Parallelise quality checks; use sampling for distribution tests |
| Strict gates vs. iteration speed |
Hard quality gates block experimentation |
Two-tier gates: hard gates for production; soft warning gates for experiments |
| Centralised vs. domain quality ownership |
Central team can standardise but creates bottleneck |
Domain-owned quality checks with central governance of thresholds |
| Statistical sensitivity vs. false positives |
Sensitive thresholds catch real drift but also flag seasonal patterns |
Tune PSI thresholds per feature; exclude known seasonal features from blocking gates |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Data leakage not caught (future data in training) |
Medium |
Critical — model appears to perform well but fails in production |
Temporal validity check in Stage 2 |
Rebuild feature pipeline with correct time-window logic; retrain |
| Quality threshold set too high — blocks valid training |
Medium |
Medium — unnecessary delays; team loses confidence in quality pipeline |
Human review requests piling up |
Review and recalibrate thresholds quarterly; track gate block rate |
| Quality threshold set too low — lets bad data through |
Low |
High — biased or degraded model deployed |
Post-deployment drift monitoring |
Emergency retraining; threshold review |
| Quarantine store full — silent pass of bad records |
Low |
High — quality check bypassed |
Quarantine store capacity monitoring |
Increase quarantine store capacity; alert on >80% usage |
| Label quality check not run (annotation skipped) |
Medium |
High — noisy labels degrade supervised model |
Cleanlab job missing from pipeline run |
Gate training on label quality check completion |
| Drift monitor false positive — unnecessary retraining |
Medium |
Low-Medium — wasted compute; engineering distraction |
Track retraining trigger rate vs. actual drift |
Tune drift thresholds; add seasonal adjustment |
Cascading Failure Scenarios
- Silent distribution shift cascade: PSI threshold misconfigured → distribution shift not detected → model trained on shifted data → model deployed → prediction quality degrades → business decisions affected → detected only at quarterly review.
- Quality gate outage cascade: Quality gate controller service down → pipeline bypasses quality checks → bad data reaches training → biased model trained → deployed to production → regulatory audit finds no quality attestation for model.
14. Regulatory Considerations
| Regulation |
Article/Clause |
Requirement |
Pattern Response |
| EU AI Act |
Article 10(2) |
Training data must be relevant, representative, free of errors, complete |
Representativeness (PSI/KS), completeness, and accuracy checks enforced at Stage 2 |
| EU AI Act |
Article 10(3) |
Examine data for biases |
Demographic distribution checks in representativeness dimension |
| EU AI Act |
Article 12 |
Record-keeping for high-risk AI |
Quality Scorecard stored with training dataset version; retained 7 years |
| APRA CPS 234 |
§32 |
Data integrity and accuracy controls |
Source quality gate enforces integrity; audit log of quality check results |
| Privacy Act (Australia) |
APP 11 |
Security of personal information |
Quarantine store classification; PII minimisation in quality artefacts |
| ISO 42001 |
§8.4 |
Data quality management for AI |
Six-dimension quality framework maps to ISO 42001 data quality requirements |
| NIST AI RMF |
MAP-2.3 |
Scientific validity; data fitness for purpose |
Representativeness and temporal validity checks directly address scientific validity |
15. Reference Implementations
AWS
| Component |
AWS Service |
| Source quality checks |
AWS Glue Data Quality + Great Expectations on Glue |
| Feature computation |
AWS Glue / EMR |
| AI quality checks |
Custom Python on SageMaker Processing Jobs |
| Label quality |
Cleanlab on SageMaker Processing Jobs |
| Quality scorecard |
DynamoDB + S3 |
| Quality gate controller |
Step Functions |
| Drift monitoring |
SageMaker Model Monitor |
| Quarantine store |
S3 with S3 Object Lock |
Azure
| Component |
Azure Service |
| Source quality checks |
Azure Data Factory data validation + Great Expectations |
| Feature computation |
Azure Databricks |
| AI quality checks |
Custom Python on Azure ML Compute |
| Quality scorecard |
Azure Cosmos DB + ADLS |
| Quality gate controller |
Azure ML Pipelines conditions |
| Drift monitoring |
Azure ML Data Drift monitor |
GCP
| Component |
GCP Service |
| Source quality checks |
Dataplex data quality + Deequ on Dataproc |
| Feature computation |
Cloud Dataflow |
| AI quality checks |
Custom Python on Vertex AI Custom Jobs |
| Quality scorecard |
Firestore + GCS |
| Quality gate controller |
Vertex AI Pipelines |
| Drift monitoring |
Vertex AI Model Monitoring |
On-Premises
| Component |
Technology |
| Source quality checks |
Great Expectations on Kubernetes |
| Feature computation |
Apache Spark |
| AI quality checks |
Custom Python + Cleanlab on Kubernetes |
| Quality scorecard |
PostgreSQL + MinIO |
| Quality gate controller |
Apache Airflow |
| Drift monitoring |
Evidently AI (self-hosted) |
| Pattern |
ID |
Relationship |
Notes |
| AI Data Mesh Integration |
EAAPL-DAT001 |
Depends on |
Quality gates are enforced within data product contracts |
| Data Lineage for AI |
EAAPL-DAT003 |
Complements |
Lineage needed to trace quality failures to source |
| AI Training Data Governance |
EAAPL-DAT007 |
Depends on |
Quality scorecard is a governance artefact in training data governance |
| Real-Time Feature Engineering |
EAAPL-DAT008 |
Complements |
Inference-time validation is a component of real-time feature serving |
| Active Learning Loop |
EAAPL-HIL002 |
Complements |
Label quality improvements feed back via active learning |
| Model Rollback |
EAAPL-MDL004 |
Triggers |
Severe quality failures may trigger model rollback |
17. Maturity Assessment
Overall Maturity: Proven — Source quality frameworks (Great Expectations, Deequ) are mature. AI-specific quality dimensions (representativeness, label quality) are emerging as industry standard practice, supported by Cleanlab and Evidently.
| Dimension |
Score (1–5) |
Notes |
| Architectural clarity |
5 |
Well-defined pipeline stages and gate logic |
| Tooling maturity |
4 |
Source quality tools mature; AI-specific tools (Cleanlab) still maturing |
| Regulatory alignment |
5 |
Strong EU AI Act Art. 10 and APRA CPS 234 alignment |
| Operational complexity |
3 |
Threshold tuning and remediation workflow require ongoing attention |
| Cost efficiency |
4 |
OSS stack is cost-effective; enterprise platforms add cost |
| Security |
4 |
Good controls defined; PII-safe quality reporting |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2023-09-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2024-02-15 |
EAAPL Working Group |
Added label quality dimension; Cleanlab integration |
| 1.2 |
2024-08-01 |
EAAPL Working Group |
Added EU AI Act Article 10 alignment; inference-time validation |
| 1.3 |
2025-03-01 |
EAAPL Working Group |
Updated drift monitoring options; expanded failure modes |