EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryObservability & Monitoring
Proven
⇄ Compare

EAAPL-OBS005 · Model Drift Detection

📊 Observability & Monitoring🏭 Field-tested in AU

EAAPL-OBS005 · Model Drift Detection

Pattern ID: EAAPL-OBS005 Status: Proven Complexity: High Tags: observability model-risk alerting slo high-complexity Version: 1.0.0 Last Reviewed: 2026-06-12


1. Executive Summary

AI models degrade silently. Unlike a server that crashes with a clear error, a drifting model continues returning HTTP 200 responses while its outputs become progressively less accurate, more biased, or less relevant. Data drift — changes in the statistical distribution of inputs — and concept drift — changes in the relationship between inputs and desired outputs — are the two primary mechanisms. Without continuous drift monitoring, organisations discover model degradation through business metric decline, customer complaints, or regulatory findings — all lagging indicators that allow harm to accumulate.

This pattern defines a continuous monitoring system for statistical drift in model inputs and outputs across production AI deployments. It covers: data drift detection using Kolmogorov-Smirnov tests for continuous features, chi-squared tests for categorical features, and Population Stability Index (PSI) for combined assessment; concept drift detection through output distribution monitoring, accuracy on labeled holdout sets, and Jensen-Shannon divergence; reference dataset management with versioning and seasonal adjustment; drift severity classification (warning/alert/critical); automated retraining triggers on critical drift; visualisation dashboards for per-feature drift over time; integration with model registries; and the critical distinction between benign drift (seasonal patterns, legitimate distribution shift) and harmful drift (data quality degradation, adversarial shift, world-state change invalidating model assumptions).

Target Audience: CIO, CTO, Chief Risk Officer, Head of AI/ML Engineering, Model Risk Manager Time to Implement: 8–14 weeks


2. Problem Statement

Business Problem

Organisations deploy AI models and assume they will continue performing as they did in testing. They don't. The world changes, user behaviour changes, data pipelines evolve, and model performance degrades. Most organisations have no systematic mechanism for detecting this until a material business event forces attention — a regulatory finding, a wave of customer complaints, an unexpected decline in conversion or retention. At that point, the degradation may have been occurring for months.

Technical Problem

Drift detection requires statistical comparison of production data distributions against reference baselines, at scale, in near-real-time. The challenge is multi-dimensional: many ML models have hundreds or thousands of features; each requires its own statistical test; feature drift does not always imply output degradation; and the relationship between measured drift and model performance impact is non-linear. Additionally, distinguishing harmful drift from legitimate distribution changes (new product launches, seasonal patterns, geographic expansion) requires both statistical and domain knowledge.

Symptoms

  • Model deployed in January performs well; by June, customer satisfaction with AI features has quietly declined
  • Business metrics (task completion rate, recommendation click-through) declining with no engineering change attributed
  • RAG retrieval quality degraded because the vector index was built on stale embeddings but no monitoring detected this
  • Model retrained annually on schedule, not triggered by evidence of performance degradation
  • Regulatory review reveals model was trained on data no longer representative of current customer base

Cost of Inaction

  • Average enterprise AI model degrades measurably within 6 months of deployment without monitoring (Gartner 2024)
  • Regulatory findings for material models lacking performance monitoring (APRA CPG 234, EU AI Act Article 9)
  • Silent accuracy regression in credit scoring, fraud detection, or clinical triage has direct financial and safety consequences
  • Unnecessary scheduled retraining (without drift evidence) wastes compute and introduces regression risk from needless model changes

3. Context

When to Apply

  • Any production ML model with a defined performance baseline and ongoing inference traffic
  • AI systems where input distributions are expected to be stable (any significant change is an anomaly)
  • Models used for regulated decisions (credit, fraud, clinical, underwriting) requiring ongoing performance evidence
  • RAG systems where retrieval quality depends on embedding models that may become stale
  • Prerequisite: EAAPL-OBS001 provides the input/output data stream required for drift computation

When NOT to Apply

  • One-off batch models with no ongoing deployment
  • Purely generative tasks (creative writing) where output distribution monitoring is not meaningful
  • Models retrained continuously (online learning) where the model itself is always adapting — drift is by design

Prerequisites

Prerequisite Required Notes
EAAPL-OBS001 AI Telemetry Infrastructure Required Input feature logging and output logging required
Reference baseline dataset (labeled) Required Drift comparison requires a reference distribution
Model registry with versioned metadata Required Drift events must link to model versions
Statistical compute runtime Required Python scipy; PySpark for high-volume feature sets
Model performance ground truth mechanism Strongly Recommended Without labels, concept drift detection is indirect

Industry Applicability

Industry Applicability Primary Driver
Financial Services Critical APRA CPG 234, ASIC model risk, credit/fraud model degradation
Healthcare Critical Clinical model safety obligation; performance monitoring mandatory
Insurance Critical Underwriting model accuracy directly impacts financial outcomes
Retail / E-Commerce High Recommendation and personalisation models degrade with catalogue changes
Government High Decision-support models require ongoing performance evidence
Technology / SaaS High RAG freshness; NLP model drift with language evolution

4. Architecture Overview

The Model Drift Detection Architecture is a statistical monitoring system that operates asynchronously on the production inference data stream. It is composed of five functional layers: data capture, reference management, statistical analysis, severity classification, and action triggering.

Data Capture Layer

Every AI inference request logs the input features and the model output. For structured ML models, input features are captured in the telemetry log. For LLM and RAG systems, proxies are used: prompt token count, query type classification (derived from prompt metadata), retrieved document distribution, and output characteristics (length, entropy, sentiment score). The data capture layer feeds both real-time streaming analysis (for rapid drift detection on output distributions) and batch statistical analysis (for feature-level drift computation, which requires sufficient data volume for statistical power).

Reference Dataset Management

The reference dataset defines the expected distribution. It is not a static artifact — it must be managed actively. The reference is versioned: each model version has its own reference distribution. The reference is updated when a new model version is deployed (new baseline from evaluation data) or when a known distribution shift occurs (e.g., product launch changing customer demographic) and the shift is deemed legitimate. Reference datasets are stored in the model registry alongside model artifacts. A reference management API allows data scientists to approve reference updates; unapproved reference changes are blocked and alerted.

Statistical Analysis: Data Drift

Data drift is detected at the feature level. For each numerical feature, the Kolmogorov-Smirnov (KS) test compares the current production distribution against the reference distribution. KS test statistic and p-value are computed. A p-value < 0.05 with KS statistic > 0.10 indicates statistically significant drift. For categorical features, the chi-squared test compares observed vs. expected frequency distributions. For composite drift assessment, the Population Stability Index (PSI) is computed per feature: PSI = sum over bins of (actual% - expected%) × ln(actual% / expected%). PSI < 0.10 is stable (no concern), 0.10–0.25 is moderate drift (warning), > 0.25 is significant drift (alert). The overall drift index aggregates per-feature PSI scores weighted by feature importance.

Statistical Analysis: Concept Drift

Concept drift — the relationship between inputs and the desired output changing — is harder to detect without labels. Three complementary approaches are used. Output distribution monitoring: track the distribution of model outputs (predicted class distribution for classifiers; output length and vocabulary distribution for LLMs). Significant shifts in output distribution without corresponding input drift suggest concept drift. Jensen-Shannon divergence between current and reference output distributions is computed. Accuracy on labeled holdout: a static holdout set with human-labeled ground truth is evaluated periodically against the current model. Declining accuracy on a fixed holdout, while input distribution is stable, indicates concept drift. Error rate trend monitoring: for models with feedback mechanisms, track error rate (user corrections, thumbs down, escalations) as a proxy for accuracy.

Drift Severity Classification

Warning: PSI 0.10–0.25 on one or more features, or JS divergence increase of 0.05–0.10 on output. No immediate action; increased monitoring frequency; notify ML engineer. Alert: PSI > 0.25 on important features, or JS divergence increase > 0.10, or holdout accuracy drop > 5%. Schedule retraining review within 2 weeks. Critical: PSI > 0.50, or holdout accuracy drop > 10%, or error rate 2x baseline. Trigger automated retraining pipeline immediately; page ML engineer on-call; notify model risk manager.

Automated Retraining Trigger

Critical drift triggers the automated retraining pipeline. The trigger event is published to the model registry, which kicks off the organisation's standard model retraining workflow. The retraining pipeline uses the current production data (within the retention window) as training data, trains a new model version, evaluates against the holdout set, and if quality improves or is maintained, submits for deployment review. Human sign-off is required before the new model version is promoted to production.

Benign vs. Harmful Drift Classification

Not all drift is harmful. Seasonal patterns (retail models drifting at Christmas), known distribution shifts from product changes (new customer segment acquired), or deliberate training data diversity expansion are benign. The benign drift classifier consults a calendar of known events (product launches, campaigns, data pipeline changes) and applies a rule: if drift onset correlates with a known event within a 3-day window, classify as potentially benign and route to ML engineer review rather than auto-triggering retraining.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Capture["Data Capture"] A[Production Inference] B[(Feature Log Store)] C[Reference Dataset] end subgraph Analysis["Drift Analysis"] D[Data Drift Tests] E[Concept Drift Monitor] F[Benign Drift Classifier] end subgraph Action["Severity and Action"] G{Severity Classifier} H[Retraining Pipeline] end A --> B B --> D C --> D A --> E D --> F E --> F F --> G G -->|warning| I[Dashboard Alert] G -->|critical| H H --> J[Human Sign-Off] J -->|approved| C style A fill:#dbeafe,stroke:#3b82f6 style B fill:#fef9c3,stroke:#eab308 style C fill:#fef9c3,stroke:#eab308 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#f3e8ff,stroke:#a855f7 style H fill:#fee2e2,stroke:#ef4444 style I fill:#d1fae5,stroke:#10b981 style J fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Feature Logger SDK / Sidecar Capture input features and output at inference time; forward to feature log store Custom wrapper; Arize AI; WhyLabs; Evidently AI agent Critical
Feature Log Store Storage Time-series storage of production inference features ClickHouse, BigQuery, Apache Hudi/Delta Lake on S3 Critical
KS Test Processor Batch Job Kolmogorov-Smirnov test on numerical feature distributions Python scipy.stats; PySpark on Databricks/Glue High
Chi-Squared Processor Batch Job Chi-squared test on categorical feature distributions Python scipy.stats; PySpark High
PSI Calculator Batch Job Population Stability Index per feature and aggregate Python/Spark; Evidently AI; WhyLabs High
Jensen-Shannon Divergence Engine Streaming + Batch JS divergence on output distributions vs. reference Python scipy; Flink streaming High
Accuracy Monitor Batch Job Evaluate current model on fixed holdout set periodically MLflow evaluation; custom Python script High
Reference Dataset Store Storage Versioned reference distributions per model version S3/GCS/Azure Blob + DVC; MLflow artifacts Critical
Benign Drift Classifier Service Correlate drift onset with known events calendar Custom rule engine + event calendar API Medium
Retraining Pipeline Trigger Integration Publish critical drift event to retraining workflow Airflow/Prefect sensor; MLflow webhook; Kubeflow trigger High
Drift Dashboard UI Per-feature drift over time; severity summary; trend Grafana; Evidently AI UI; WhyLabs; custom React app Medium
Model Registry Integration Integration Link drift events to model versions; trigger review workflow MLflow, SageMaker Model Registry, Vertex AI Model Registry High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Feature Logger Captures input features and model output at every inference call Feature log record with modelId, modelVersion, features{}, output, timestamp
2 Feature Log Store Ingests and indexes feature records; enables time-window queries Queryable time-series feature data
3 KS / Chi-Squared / PSI Processors Run hourly batch analysis on 1-hour window vs. reference distribution Per-feature drift scores; PSI values; test statistics and p-values
4 JSD Engine Computes JS divergence on output distribution in rolling window Output distribution divergence score
5 Accuracy Monitor Evaluates model on holdout set (daily or per-deployment) Accuracy, F1, or task-specific quality metrics
6 Severity Classifier Applies severity rules to aggregate drift signals Severity label: stable / warning / alert / critical
7 Benign Drift Classifier Checks drift onset against known events calendar Benign / unknown classification with rationale
8 Action Router Routes by severity and benign classification to appropriate action Alert, scheduled review, or retraining trigger
9 Retraining Pipeline (if critical) Initiates model retraining on recent production data New model version submitted for review
10 Human Reviewer Reviews new model version quality; approves or rejects promotion Approval decision; model registry updated

Error Flow

Error Scenario Detection Action Recovery
Feature log store query times out Batch job failure alert; lag metric Alert ML platform; skip batch; run catch-up Investigate store performance; run catch-up analysis
Reference distribution missing for new model version Drift job raises missing reference error Alert to ML engineer; skip drift computation until reference set ML engineer creates reference on model deployment
Benign drift classifier incorrectly clears harmful drift Accuracy holdout detects concurrent quality decline Accuracy alert overrides benign classification; escalate Investigate; tune benign classifier; enforce dual confirmation
Retraining pipeline fails Pipeline failure alert; Airflow/Prefect failure Alert ML engineer; manual retraining trigger Fix pipeline; retry; monitor new model version
Holdout set becomes stale (labels no longer representative) Holdout accuracy diverges from production feedback Alert to ML team; schedule holdout refresh Refresh holdout with new labels

8. Security Considerations

Authentication: Feature log store access requires service authentication. Drift analysis jobs authenticate via service accounts. Reference dataset store access is write-restricted to approved ML engineers; reads are available to drift analysis services.

Authorisation: Feature log data may contain sensitive model inputs (e.g., credit application features, health data proxies). Access to feature logs is restricted to ML engineers and data scientists with specific model ownership. Audit log of all accesses.

Secrets Management: Cloud storage credentials for feature log store and reference dataset store in secrets manager. Retraining pipeline trigger credentials rotated quarterly.

Data Classification: Feature logs are classified at the level of the most sensitive input feature (often Confidential for financial or health models). Reference datasets are classified as Internal. Drift event records are classified as Internal.

Encryption: Feature log data encrypted at rest (AES-256) and in transit (TLS 1.3). Reference dataset store encrypted. Long-term retention of feature logs for regulatory audit requires customer-managed encryption keys.

Auditability: Every reference dataset update is audited with requester, approver, timestamp, and rationale. Every retraining trigger event is logged immutably. Benign drift classifications are logged with the event calendar entry they matched.

OWASP LLM Top 10 Coverage

OWASP LLM Risk Drift Detection Control Implementation
LLM01 Prompt Injection Prompt length and structure distribution drift detects systematic injection patterns Distribution shift in prompt structure is drift signal
LLM02 Insecure Output Handling Output distribution monitoring detects systematic output changes from injection JSD alert if output distribution shifts toward unsafe patterns
LLM03 Training Data Poisoning Feature distribution drift may indicate poisoned training affecting production distribution Input drift concurrent with accuracy decline = poisoning signal
LLM04 Model Denial of Service Token usage distribution drift detects abusive usage patterns Token count distribution shift = anomaly signal
LLM05 Supply Chain Vulnerabilities Unexpected model version in registry triggers investigation Model version audit in drift monitoring
LLM06 Sensitive Information Disclosure Input feature drift monitoring may detect feature set changes that introduce PII Alert on new feature categories appearing in feature distribution
LLM07 Insecure Plugin Design Tool call distribution monitoring detects shifts in tool usage patterns Tool call frequency is a monitored distribution
LLM08 Excessive Agency Agent action distribution drift detects scope expansion Output action type distribution monitored
LLM09 Overreliance Accuracy monitoring surfaces model quality degradation before users over-rely on degraded outputs Accuracy SLO directly measures overreliance risk
LLM10 Model Theft Unusual output volume distribution shift may indicate bulk extraction Output volume distribution is a monitored signal

9. Governance Considerations

Responsible AI: Drift monitoring is the technical implementation of the principle that AI systems must perform as intended over their operational lifecycle, not only at deployment. Governance frameworks must mandate drift monitoring as a condition of continued production deployment for material models.

Model Risk Management: The drift event history is a key model risk management artefact. Material models must have a documented drift monitoring configuration reviewed by model risk. Critical drift events are Key Risk Indicator (KRI) breaches reported to the model risk committee.

Human Approval: All retraining decisions triggered by drift require human approval before deployment. The retraining pipeline produces a candidate model; an ML engineer and model risk reviewer approve promotion. Automated promotion without human review is not permitted for material models.

Policy: The model drift monitoring policy must define: which models require drift monitoring (materiality threshold), required monitoring frequency, reference dataset update criteria and approval process, drift severity thresholds, retraining trigger criteria, and escalation requirements for critical drift.

Traceability: Every drift event is linked to the model version, the reference dataset version, the statistical test result, and the action taken. This chain supports model risk management audit trails and regulatory evidence production.

Governance Artefacts

Artefact Owner Frequency Format
Model Drift Monitoring Register Model Risk Manager Per model, updated on drift events Registry with model, monitoring config, last assessment
Drift Event Log ML Platform Continuous Immutable event store
Reference Dataset Approval Record ML Engineering + Model Risk Per update Signed approval with rationale
Critical Drift KRI Report Model Risk Manager Monthly Dashboard export + risk committee briefing
Retraining Decision Record ML Engineering + Model Risk Per retraining trigger Signed decision with drift evidence and new model evaluation
Benign Drift Classification Log ML Engineering Per benign classification Event log with matched calendar entry and rationale

10. Operational Considerations

Monitoring: The drift detection system is itself monitored. Batch job completion, processing lag, reference dataset freshness, and holdout evaluation frequency are all tracked. A drift detection system that hasn't run in 48 hours is as dangerous as a smoke alarm with a dead battery.

Logging: Drift analysis job logs stored separately from AI application logs. Drift event records are immutable.

Incident Response: Critical drift triggers the AI incident management process (EAAPL-OBS004) with a P1 quality incident. ML engineer on-call is paged. The retraining trigger is a parallel action, not a substitute for incident response — the current model may need to be limited or disabled while retraining proceeds.

Disaster Recovery: Drift detection is not in the critical inference path. A 4-hour outage of the drift detection system is acceptable. The risk is undetected drift during the outage window. Batch jobs can run catch-up analysis when the system recovers.

Capacity Planning: Feature log storage grows with inference volume. For a model with 100 features and 1M daily requests, each log record is approximately 1–5KB; total daily storage is 1–5GB. Plan for 90-day retention in hot storage and 2-year retention in warm storage for regulatory audit.

SLO Table

SLO Target Measurement Alert Threshold
Drift detection freshness Analysis runs within 2 hours of schedule Job completion timestamp > 4 hours behind schedule
Critical drift alert time < 30 minutes from breach to alert Alert delivery timestamp vs. PSI breach > 60 minutes
Holdout accuracy evaluation Daily for material models Evaluation job completion log > 48 hours since last evaluation
Reference dataset freshness Updated within 5 days of model version deployment Reference update timestamp vs. model deployment > 7 days stale

Disaster Recovery Table

Component RTO RPO Recovery Approach
Feature Log Store 30 minutes 1 hour Replicated storage; catch-up analysis on recovery
Drift Analysis Jobs 4 hours 4 hours (catch-up) Re-run batch jobs for missed windows
Reference Dataset Store 30 minutes Near-zero Replicated object storage
Drift Dashboard 60 minutes N/A (read-only) Redeploy dashboard from version control

11. Cost Considerations

Cost Drivers

Driver Description Relative Cost
Feature log storage 1–5KB per inference record; 90-day hot retention High at scale
Statistical analysis compute (Spark/Glue) Hourly batch jobs on feature data; scales with feature count and volume Medium
Holdout accuracy evaluation Model inference cost on holdout set (daily) Low to Medium
JSD streaming computation Real-time output distribution monitoring; minimal compute Low
Reference dataset storage Relatively small; versioned reference distributions Low

Scaling Risks: Feature log storage is the primary scaling cost. At 10M+ daily inferences with 100+ features, storage costs can exceed $10K/month without optimisation. Use columnar compression (Parquet) and aggressive downsampling for older data.

Optimisations:

  • Store feature summaries (histogram buckets) rather than raw feature values for high-volume models
  • Run statistical tests on stratified samples (10K records sufficient for KS test statistical power) rather than full population
  • Use serverless compute (AWS Glue, BigQuery) to eliminate idle compute costs between batch windows

Indicative Cost Range

Scale Daily Inferences Estimated Drift Detection Cost/Month
Small 10,000 $200–$600
Medium 500,000 $1,500–$4,000
Large 5,000,000 $5,000–$15,000
Enterprise 50,000,000+ $20,000–$60,000 (with summarisation optimisation)

12. Trade-Off Analysis

Approach Comparison

Approach Pros Cons Best For
Full feature-level drift monitoring (KS + Chi-Sq + PSI) Precise; identifies which feature is drifting; enables targeted remediation High compute and storage; requires feature access (not always available for LLMs) Structured ML models with well-defined feature sets; regulated decisions
Output-only distribution monitoring (JSD on outputs) Minimal infrastructure; no feature logging required; applicable to LLMs Detects that something changed but not what; concept drift only, not data drift LLM and generative systems; quick-start implementation
Human-labeled holdout evaluation only Highest accuracy; directly measures real performance Slow (labels take time); samples a small fraction of production High-risk decisions where detection accuracy is paramount; complement to automated methods

Architectural Tensions

Tension Description Resolution
Sensitivity vs. False Alarms Low thresholds detect early drift but generate false alarms that erode trust PSI 0.10 warning (no page), 0.25 alert (notify), 0.50 critical (page) — graduated response
Feature granularity vs. Cost Per-feature monitoring is precise but expensive at scale Monitor all features for regulated models; monitor key features only for lower-risk models
Detection speed vs. Statistical power Very fast detection requires small windows with low statistical power Accept 1-hour minimum window for KS/Chi-sq; use streaming output monitoring for faster preliminary signal
Automation vs. Human oversight Automated retraining is fast but may introduce new problems Automated trigger only; human must approve new model promotion

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Reference dataset not updated after model deployment High High (drift false alarms or misses) Reference freshness SLO alert Enforce reference update as deployment gate
Benign classifier clears actual harmful drift Medium High (harmful drift not actioned) Accuracy holdout detects concurrent quality decline Require dual signal for benign classification; accuracy must not decline
Feature logger adds significant latency Low High (performance degradation) Feature logger latency SLO breach Async logging; decouple from inference path
Statistical test fails with insufficient data High (for low-volume models) Medium (no drift detection) Test failure log; minimum data check Enforce minimum sample size before running tests; alert if sample size insufficient
Retraining pipeline introduces regression Medium High (new model worse) New model evaluation before promotion Holdout gate in retraining pipeline; manual sign-off required

Cascading Scenarios

  • Scenario 1: Reference dataset never updated after major product launch → all features show critical drift → retraining triggered repeatedly → new models trained on shifted distribution → performance remains poor → wasted retraining compute and model risk review cycles. Mitigation: approved reference update is required within 5 days of known distribution shifts.
  • Scenario 2: Feature logger fails silently → feature store contains stale data → drift detection runs on old data → no drift detected → actual drift goes undetected for weeks. Mitigation: feature store freshness SLO; alert if no new records ingested in 30 minutes.

14. Regulatory Considerations

Regulation Clause Requirement Drift Detection Implementation
APRA CPG 234 Section 6 (Model Risk) Material models require ongoing performance monitoring and validation Drift detection implements continuous monitoring; critical drift triggers validation workflow
APRA CPG 234 Section 8 (Model Review) Annual (minimum) or event-triggered model review Critical drift event triggers model review; documentation provided by this pattern
EU AI Act Article 9.5 (Testing) Ongoing testing to identify appropriate risk management measures for high-risk AI Holdout evaluation and drift detection implement ongoing testing requirement
EU AI Act Article 9.7 (Automatically Generated Logs) High-risk AI must keep logs enabling verification of compliance Drift event log with timestamps, metrics, and actions is the compliance verification record
ISO/IEC 42001 Clause 9.1 (Monitoring, Measurement, Analysis) AI system performance must be continuously monitored against objectives Drift detection implements ISO 42001 clause 9.1 at technical layer
NIST AI RMF MANAGE 2.2 AI risk management includes monitoring for changes in performance over time Drift monitoring directly implements NIST AI RMF MANAGE 2.2
Privacy Act 1988 (AU) APP 11 (Security) Model using PII must continue to protect it; model drift may expose new PII risks Drift monitoring detects when input distribution shifts to include new PII categories

15. Reference Implementations

AWS

  • Feature Logger: Custom wrapper; AWS SageMaker Model Monitor (built-in feature capture)
  • Feature Log Store: Amazon S3 (Parquet) + AWS Glue Data Catalog; Amazon Redshift for queries
  • Drift Analysis: SageMaker Model Monitor scheduled monitoring jobs (built-in KS, chi-sq, PSI); AWS Glue ETL jobs
  • Reference Store: SageMaker baseline dataset (S3-backed)
  • Retraining Trigger: SageMaker Model Monitor alert → EventBridge → SageMaker Pipeline
  • Dashboard: SageMaker Model Monitor built-in dashboard; Amazon QuickSight custom dashboards
  • Registry: SageMaker Model Registry

Azure

  • Feature Logger: Azure Machine Learning Data Collector; custom wrapper
  • Feature Log Store: Azure Data Lake Storage Gen2 (Parquet) + Azure Synapse Analytics
  • Drift Analysis: Azure ML Data Drift monitoring (built-in); Azure Databricks jobs
  • Reference Store: Azure ML Dataset with versioning
  • Retraining Trigger: Azure ML Monitoring alert → Azure Event Grid → Azure ML Pipeline
  • Dashboard: Azure ML Studio monitoring dashboard; Power BI
  • Registry: Azure Machine Learning Model Registry

GCP

  • Feature Logger: Vertex AI Feature Store; custom wrapper
  • Feature Log Store: BigQuery (streaming insert); Cloud Storage (Parquet)
  • Drift Analysis: Vertex AI Model Monitoring (built-in skew/drift detection); Dataflow batch jobs
  • Reference Store: Vertex AI Dataset with versioning
  • Retraining Trigger: Vertex AI Model Monitoring alert → Cloud Pub/Sub → Vertex AI Pipeline
  • Dashboard: Vertex AI Model Monitoring dashboard; Looker
  • Registry: Vertex AI Model Registry

On-Premises

  • Feature Logger: Evidently AI (open source); custom wrapper with Kafka sink
  • Feature Log Store: Apache Hudi on HDFS/MinIO; ClickHouse for queries
  • Drift Analysis: Evidently AI reports scheduled via Airflow; custom PySpark jobs
  • Reference Store: MLflow artifacts; DVC
  • Retraining Trigger: Airflow sensor on drift metric; MLflow webhook
  • Dashboard: Evidently AI HTML reports; Grafana with drift metrics
  • Registry: MLflow Model Registry

Pattern ID Pattern Name Relationship Notes
EAAPL-OBS001 AI Telemetry Architecture Foundation Input/output logging infrastructure required
EAAPL-OBS003 Hallucination Detection Sibling Both monitor output quality; drift in hallucination rate is a concept drift signal
EAAPL-OBS004 AI Incident Management Depends On Critical drift triggers P1 quality incident in OBS004
EAAPL-OBS008 AI Performance Benchmarking Sibling Offline benchmarking complements online drift monitoring

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Adoption Breadth 3 Adopted by mature ML organisations; cloud-native tools improving accessibility
Tooling Ecosystem 4 SageMaker Monitor, Vertex AI Monitoring, Evidently AI are mature; LLM drift tooling maturing
Operational Runbook Coverage 3 Statistical drift runbooks exist; benign drift classification is still manual/custom
Regulatory Evidence 4 APRA CPG 234 explicitly references model performance monitoring; EU AI Act adds momentum
Cost Predictability 3 Feature storage cost at scale can surprise teams without upfront capacity planning
Team Skill Availability 3 Statistical ML skills required; data scientist involvement needed for test interpretation

18. Revision History

Version Date Author Changes
1.0.0 2026-06-12 EAAPL Working Group Initial publication
← Back to LibraryMore Observability & Monitoring