EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryPlatform EngineeringEAAPL-PLT008
EAAPL-PLT008Proven
⇄ Compare

AI Experiment Tracking

⚙️ Platform EngineeringEU AI ActISO/IEC 42001

[EAAPL-PLT008] AI Experiment Tracking

Category: Platform Engineering Sub-category: MLOps / Evaluation Version: 1.1 Maturity: Proven Tags: experiment-tracking, mlops, model-evaluation, a-b-testing, evaluation-datasets, metric-tracking, model-registry, promotion-decision Regulatory Relevance: EU AI Act Article 9 (Risk Management), Article 17 (Quality Management), ISO 42001 Clause 9


1. Executive Summary

AI systems that cannot demonstrate systematic evaluation before production deployment are a governance liability. When a new model version, prompt change, or RAG configuration is promoted without structured comparison data, the organisation has no evidence that the change improves outcomes and no baseline to detect if it degrades them. Regulators, auditors, and increasingly, procurement processes require this evidence.

The AI Experiment Tracking pattern establishes the infrastructure for systematic, reproducible evaluation of all AI configuration changes. It covers the metadata schema for experiments (what was changed, what was measured, over what dataset), evaluation dataset management (the golden datasets that make metrics reproducible), metric computation and comparison (including human evaluation workflows for nuanced quality assessment), multi-run comparison dashboards, and the promotion decision audit trail that links every production configuration to the experiment that justified its deployment. This pattern transforms AI quality management from anecdotal to evidence-based, satisfying both engineering and regulatory audiences.


2. Problem Statement

Business Problem

AI feature quality degrades without detection because there is no systematic measurement. A model upgrade that performs better on marketing copy but worse on technical documentation goes undetected until customers complain. A prompt change that reduces cost also reduces accuracy in a way that only manifests at edge cases. Without experiment tracking, these regressions are invisible until they cause business impact.

Technical Problem

AI experiments are informal: engineers test a new model in a local notebook, eyeball a few outputs, and deploy. There is no reproducible evaluation framework, no standard metric set, no golden dataset, no comparison to baseline, and no record of the decision. When the promotion is later questioned, there is no evidence to review.

Symptoms

  • Model or prompt changes deployed without documented evaluation results
  • Different teams using different metrics to evaluate the same AI capability (no standard metric set)
  • Golden datasets living in individual engineers' laptops or ad hoc S3 buckets with no versioning
  • Post-incident analysis unable to identify whether the AI change or a data change caused quality degradation
  • Regulators or auditors requesting evaluation evidence for production AI systems; none available

Cost of Inaction

  • AI quality regressions reaching production that structured evaluation would have caught
  • Regulatory non-compliance due to absence of quality management documentation
  • Duplicated evaluation effort across teams using incompatible methodologies
  • Inability to demonstrate AI improvement over time to business stakeholders

3. Context

When to Apply

  • Organisation has AI models or prompts in production that change over time
  • Multiple teams evaluate AI changes with no standard methodology
  • Regulatory obligations require quality management documentation for AI systems
  • A/B testing of model configurations is needed for production traffic comparison
  • AI programme stakeholders require evidence of continuous improvement

When NOT to Apply

  • Static, never-changing AI system with no evaluation lifecycle
  • One-time AI analysis project with no ongoing deployment: lightweight evaluation sufficient
  • Research prototype: production-grade experiment tracking overhead not warranted

Prerequisites

  • Evaluation datasets (golden datasets) for each AI use case; these are the prerequisite that most organisations lack; they must be built before this pattern fully delivers value
  • AI API Gateway for A/B traffic routing integration (PLT002/PLT003)
  • Model Registry for linking experiment results to model versions (PLT001 Layer 2)
  • Observability infrastructure for metric ingestion

Industry Applicability

Industry Applicability Evaluation Priority
Financial Services Very High Accuracy of AI-assisted decisions; fairness/bias; regulatory documentation
Healthcare Very High Clinical accuracy; safety; regulatory approval evidence
Technology / SaaS High Quality at scale; competitive differentiation through AI quality
Legal / Professional Services High Accuracy; consistency; professional responsibility evidence
Retail / E-commerce Medium User satisfaction; conversion metrics; content quality
Government High Fairness; accuracy; democratic accountability

4. Architecture Overview

The AI Experiment Tracking system is the measurement and evidence layer for all AI quality decisions. It is structurally analogous to a scientific lab notebook for AI—each experiment has a defined setup, methodology, results, and conclusion—but operationalised at engineering scale with automation.

Experiment Metadata Schema defines what an experiment record contains. Every experiment must record: a unique experiment ID, the component under evaluation (model name + version, prompt name + version, or RAG configuration version), the baseline configuration it is being compared to, the evaluation dataset reference (name + version), the evaluation metrics computed, the evaluation execution timestamp, the person or automated system that executed the evaluation, and the promotion decision record (approved/rejected with reason and approver). This schema is the foundation of the audit trail; every production configuration must have a traceable experiment record.

Evaluation Dataset Management is the most operationally demanding part of this pattern. Golden datasets for AI evaluation must be representative, version-controlled, and regularly maintained. A golden dataset consists of: input examples (representative queries or prompts), expected outputs (or reference outputs for similarity scoring), and metadata (creation date, curator, domain coverage statistics, known limitations). Datasets are stored in versioned object storage (S3, GCS) with a content-addressed hash ensuring reproducibility. A dataset update triggers re-evaluation of the current production configuration as a new baseline, ensuring metrics are always comparable on the same dataset version.

Metric Computation Framework standardises the metrics computed across all experiments. Common AI quality metrics include: accuracy (for classification tasks), factuality score (for knowledge-retrieval tasks, computed via citation checking or reference comparison), format compliance rate (for structured output tasks), latency percentiles (P50, P95, P99 inference time), token efficiency (output tokens per quality unit), and for safety-critical applications: toxicity rate, bias score, and hallucination rate. Metric computation is automated in a standardised evaluation harness that can be invoked from CI/CD pipelines and from the experiment tracking service.

Human Evaluation Workflow extends automated evaluation with human judgment for nuanced quality attributes that automated metrics cannot capture reliably: tone and brand voice consistency, logical coherence of long-form outputs, appropriateness of creative content, and clinical appropriateness in healthcare applications. Human evaluation is routed to qualified evaluators via a task queue; results are aggregated with inter-rater reliability scoring (Cohen's kappa) to ensure evaluation quality. Human evaluation is rate-limited by cost and evaluator availability; the pattern defines when human evaluation is required (high-risk use cases, major version changes) versus when automated evaluation is sufficient.

Multi-Run Comparison Dashboard provides the analytical view across experiments. The dashboard allows comparison of metrics across all experiments for a given component, over time (to detect trend improvements or regressions), and against the current production baseline. Statistical significance testing (two-proportion z-test for classification metrics, t-test for continuous metrics) is applied automatically; the dashboard distinguishes between statistically significant improvements and noise.

Promotion Decision Audit Trail links the decision to deploy a new configuration to the experiment that justified it. When a platform team member promotes a model or prompt version to production via the Prompt Registry (PLT005) or Model Registry (PLT001), they must reference an experiment ID that documents the evaluation. This reference is recorded in the promotion record and in the production configuration's metadata. Auditors can trace any production AI configuration to the experiment evidence that justified its deployment.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Triggers["Experiment Triggers"] A[CI Pipeline] B[Manual Trigger] end subgraph Evaluation["Evaluation Service"] C[Evaluation Runner] D[Metric Computation] E[Statistical Comparison] end subgraph Storage["Data Stores"] F[(Golden Dataset Store)] G[(Experiment Metadata DB)] end subgraph Outcome["Outcomes"] H[Comparison Dashboard] I[Promotion Audit Trail] end A --> C B --> C F --> C C --> D D --> E E --> G G --> H G --> I I --> J[Governance Report] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981 style I fill:#d1fae5,stroke:#10b981 style J fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Experiment Scheduler Service Queue and prioritise evaluation jobs Custom Celery queue, Temporal workflow High
Evaluation Runner Service Execute prompts against dataset; collect raw outputs Custom Python harness, Ragas, DeepEval Critical
Metric Computation Engine Service Compute standard and custom metrics from raw outputs Ragas (RAG metrics), custom evaluators Critical
Evaluation Dataset Store Service Version-controlled storage for golden datasets S3 + DVC (Data Version Control), GCS Critical
Experiment Metadata DB Service Store experiment records with full schema PostgreSQL, MongoDB Critical
Metric Time Series DB Service Store metric values for trend analysis and comparison Prometheus, ClickHouse, TimescaleDB High
Human Evaluation Task Queue Service Route nuanced evaluations to human evaluators Custom task queue, Label Studio, Scale AI Medium
Inter-Rater Reliability Calculator Service Compute Cohen's kappa for human evaluation quality Custom Python module Medium
Statistical Comparison Engine Service Compute significance tests between candidate and baseline Custom + scipy.stats High
Comparison Dashboard Service Visualise experiment results and trends Grafana, Metabase, custom React dashboard High
Promotion Decision Recorder Service Link production promotion events to experiment IDs Integration with Prompt Registry and Model Registry APIs Critical
Governance Report Generator Service Produce quality management documentation for auditors Custom, Jupyter notebook pipeline Medium

7. Data Flow

Primary Flow — Automated CI Evaluation on Prompt Change

Step Actor Action Output
1 CI Pipeline Detect prompt change in PR; trigger experiment for customer-faq-v1.2.0 vs baseline v1.1.3 Experiment job created with ID exp-20250612-001
2 Experiment Scheduler Dequeue experiment job; load dataset faq-golden-v3.2 from S3 Dataset loaded: 250 examples
3 Evaluation Runner Execute all 250 examples against baseline v1.1.3 and candidate v1.2.0; collect outputs 500 raw output records
4 Metric Computation Compute: accuracy 94.1% (v1.2) vs 92.8% (v1.1.3); factuality 91.2% vs 90.1%; P95 latency 1.2s vs 1.4s Metric delta: accuracy +1.3%, factuality +1.1%, latency -200ms
5 Statistical Comparison Two-proportion z-test on accuracy: p=0.031 (< 0.05 threshold) → statistically significant improvement p-value + confidence intervals
6 Experiment Record Write to experiment DB: exp-20250612-001; component=customer-faq; candidate=v1.2.0; status=PASS; significant improvement Experiment record written
7 CI Status Post experiment results as PR comment; mark CI check as PASS PR author sees metric comparison
8 Promotion Prompt owner approves PR; references exp-20250612-001 in promotion request Experiment ID recorded in promotion audit trail

Error Flow

Error Detection Response
Evaluation dataset unavailable (S3 outage) Dataset load failure Experiment queued; alert dataset custodian; retry with backoff
Model API rate limit during evaluation Runner error rate Slow down evaluation; use batch API; emit warning
Statistical test inconclusive (p > 0.05) Comparison engine Mark experiment as INCONCLUSIVE; alert prompt owner; may require larger dataset
Human evaluation SLA breach Task queue age monitor Escalate to evaluation team lead; unblock by automated proxy metric

8. Security Considerations

  • Evaluation datasets may contain sensitive representative inputs; they must be classified and stored with the same access controls as production data
  • Experiment results (including individual output comparisons) may reveal model behaviour on sensitive inputs; access to raw outputs restricted to model owner and platform team
  • Human evaluators must be bound by confidentiality agreements if evaluation involves sensitive content

OWASP LLM Controls

OWASP LLM Risk Experiment Tracking Control
LLM09 Overreliance Factuality and hallucination metrics in evaluation suite provide evidence of reliability
LLM03 Training Data Poisoning Evaluation dataset integrity checks (content hash verification) detect dataset tampering

9. Governance Considerations

Quality Management System

  • Experiment tracking constitutes the quality management system for AI as required by EU AI Act Article 17; every production configuration must have a traceable experiment record
  • Promotion without experiment evidence is a governance policy violation; the promotion workflow enforces experiment ID as a required field

Governance Artefacts

Artefact Owner Cadence Location
Evaluation dataset catalogue Data Team + Model Owner Per dataset version Dataset store + metadata DB
Experiment records Model Owner Per experiment Experiment metadata DB
Promotion audit trail Platform Team Per promotion Experiment DB + Registry
Quarterly quality report Model Owner Quarterly Governance dashboard
Human evaluation guidelines Model Owner Annual Internal wiki

10. Operational Considerations

Monitoring

Signal Source Alert Threshold Owner
Evaluation pipeline failure rate Job status >5% failed experiments Platform Team
Experiment queue depth Scheduler metrics >20 jobs pending >30 min Platform Team
Dataset hash mismatch Integrity check Any mismatch Security + Data Team
Production quality metric regression (vs. last deployment) Production sampling >2% drop on key metric Model Owner + Platform Team

SLOs

SLO Target Window
CI evaluation pipeline completion <15 min for standard 250-example dataset Per run
Experiment DB availability 99.9% Rolling 30 days
Production sampling metric freshness <4 hours lag Rolling 7 days

Disaster Recovery

Component RPO RTO Strategy
Experiment metadata DB 5 min 30 min Database replication
Dataset store 0 (content-addressed) 15 min S3 cross-region replication
Metric time series DB 1 hour 30 min Cross-region replication; recomputable from experiment DB

11. Cost Considerations

Cost Drivers

Driver Description Relative Weight
Evaluation API calls Running golden dataset through model on every PR/experiment Medium — scale with dataset size and frequency
Human evaluation Evaluator time for nuanced tasks High — reserved for high-risk changes only
Dataset storage Versioned datasets; relatively small at scale Low
Experiment DB hosting Low-volume OLTP database Low

Optimisations

  • Use the cheapest model that produces comparable signals for evaluation runs where the evaluation target is a more expensive model
  • Maintain a tiered evaluation strategy: fast automated eval (10 examples) for PR gate; comprehensive eval (250 examples) for promotion to staging; extended eval (1000+ examples) for production promotion of high-risk changes

Indicative Cost Range

Scale Monthly Experiment Infra + Evaluation API Cost
Small (10 prompts, weekly changes) $300–$1,000
Medium (50 prompts, daily changes, 3 teams) $2,000–$6,000
Large (200+ prompts, multiple models, continuous evaluation) $10,000–$30,000

12. Trade-Off Analysis

Evaluation Strategy Options

Option Description Pros Cons Best For
Automated Only All evaluation via metric computation against golden dataset Scalable; fast; cheap Misses nuanced quality issues; only as good as golden dataset High-volume, structured output tasks
Human Evaluation Only All evaluation via human judges Highest quality Slow; expensive; not scalable; inter-rater inconsistency Very high-risk, low-volume decisions
Automated + Human Gate Automated for all changes; human required for high-risk/major changes Balanced quality and scalability Requires defining "high-risk" criterion carefully Recommended default

Metric Strategy Options

Option Description Pros Cons
Fixed Standard Metric Set Same metrics for all use cases Comparability; simplicity May not capture use-case-specific quality
Per-Use-Case Custom Metrics Metrics defined per use case Highest relevance Comparison across use cases harder; more maintenance
Composite Score Weighted combination of multiple metrics Single promotion threshold Weight calibration difficult; masks individual metric issues

Architectural Tensions

Tension Option A Option B Resolution
Evaluation completeness vs. CI speed Full evaluation on every PR Fast smoke test on PR; full eval on merge Tiered: 10-example fast check on PR; 250-example full on merge
Golden dataset size vs. cost Large representative dataset Minimal viable dataset Start with 50–100 high-quality examples; expand to 250–500 as infrastructure matures
Automated statistical significance vs. business judgment Block promotion on p>0.05 only Block on any quality drop Statistical gate for automated CI; human judgment gate for production promotion

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Golden dataset staleness (distribution drift from production) High over time High — metrics look good but production quality is low Production quality monitoring diverges from evaluation metrics Regular dataset refresh cadence; production sampling for dataset maintenance
Evaluation runner hitting model rate limits Medium Medium — experiments fail or slow Runner error rate Use batch API; reduce parallelism; queue retry
Human evaluator bias Medium Medium — biased promotion decisions Inter-rater reliability monitoring (low kappa) Evaluator calibration sessions; blind evaluation protocols
Promotion without experiment reference Low (if enforced) High — governance gap Promotion workflow audit Enforce experiment ID as mandatory in promotion workflow; alert on bypass

14. Regulatory Considerations

EU AI Act Articles 9 and 17

  • Article 9 requires risk management measures for high-risk AI; experiment evaluation records constitute the technical documentation of risk management
  • Article 17 quality management system requirement is satisfied by the experiment tracking infrastructure and promotion audit trail
  • Technical documentation required by Article 11 must reference the evaluation methodology, datasets, and metrics used

ISO 42001 Clause 9 (Performance Evaluation)

  • Experiment tracking directly implements Clause 9.1 (monitoring and measurement of AI system performance)
  • Regular production quality sampling satisfies Clause 9.3 (management review of AI performance)

NIST AI RMF MEASURE 2.3

  • Metrics for tracking AI performance over time must be defined and measured; the experiment tracking framework and metric time series DB satisfy this requirement

15. Reference Implementations

AWS

Component AWS Service
Evaluation runner SageMaker Processing Jobs or Lambda (batch)
Dataset store S3 + DVC
Experiment metadata DB Amazon RDS PostgreSQL
Metric time series Amazon Timestream or CloudWatch
Dashboard Amazon Managed Grafana

Azure

Component Azure Service
Evaluation runner + experiment tracking Azure ML Experiments
Dataset store Azure ML Datasets + Azure Blob Storage
Dashboard Azure ML Studio

Open Source / SaaS

Component Technology
Evaluation framework Ragas (RAG), DeepEval, HELM
Experiment tracking MLflow, Weights & Biases, Comet ML
Human evaluation Label Studio (open source), Scale AI, Labelbox

On-Premises

Component Technology
Evaluation runner Custom Python harness + Celery
Experiment tracking MLflow self-hosted
Dataset store MinIO (S3-compatible) + DVC

Pattern ID Name Relationship
EAAPL-PLT001 Enterprise AI Platform Parent — experiment tracking is Layer 4
EAAPL-PLT005 Prompt Version Control Dependency — prompt promotions require experiment references
EAAPL-PLT003 Model Routing Integration — A/B routing data feeds experiment tracking
EAAPL-GOV001 AI Governance Framework Dependency — experiment records are primary governance evidence

17. Maturity Assessment

Overall Maturity: Proven Experiment tracking with MLflow, W&B, and Azure ML is production-proven. AI-specific evaluation frameworks (Ragas, DeepEval) are maturing rapidly. The promotion-decision audit trail link is the least standardised component.

Scoring Matrix

Dimension Score (1–5) Rationale
Pattern Completeness 5 All sections documented
Implementation Evidence 4 MLOps experiment tracking proven; LLM-specific evaluation less so
Tooling Maturity 4 Ragas/DeepEval maturing; human eval tooling stable
Regulatory Alignment 5 Strong EU AI Act / ISO 42001 alignment
Dataset Management Maturity 3 Golden dataset management is the common weak link in practice

18. Revision History

Version Date Author Changes
1.0 2024-08-01 EAAPL Working Group Initial publication
1.1 2025-06-12 EAAPL Working Group Human evaluation workflow expanded; ISO 42001 Clause 9 alignment; production sampling guidance added
← Back to LibraryMore Platform Engineering