Proven

A/B Model Evaluation

Observability & MonitoringEU AI ActISO/IEC 42001

[EAAPL-OBS008] A/B Model Evaluation

Category: Observability & Monitoring Sub-category: Model Lifecycle Version: 1.0 Maturity: Proven Tags: ab-testing, canary-deployment, model-upgrade, traffic-splitting, statistical-significance, challenger-model, side-by-side-evaluation, model-promotion Regulatory Relevance: EU AI Act Article 9 & 13, APRA CPS 230, ISO/IEC 42001 Clause 8.4, NIST AI RMF MANAGE 3.1

1. Executive Summary

Upgrading the LLM powering a production AI application carries risk that cannot be fully characterised in offline evaluation. Offline benchmarks on golden datasets measure quality on known examples under controlled conditions; they do not capture the full distribution of production inputs, real user feedback patterns, actual latency under production load, or the cost-per-unit-of-value economics at scale. A model that scores 5% better on a golden dataset can still perform worse in production — and a model that scores identically on quality metrics can cost 40% more or introduce 200ms additional p99 latency that degrades user experience. The A/B Model Evaluation pattern addresses this gap by routing real production traffic across two models simultaneously, measuring all dimensions of production performance side-by-side, and gating promotion on statistically significant evidence.

This pattern defines a canary deployment architecture for LLM and model upgrades. A traffic splitter routes a configurable percentage of production requests to a challenger model while the control model continues to handle the remainder. Both models operate with identical prompt configurations, context assembly, and output filtering. All quality metrics, latency distributions, cost-per-request, and user feedback signals are collected for both models with the same instrumentation, enabling a true side-by-side comparison on real production data. A promotion controller evaluates the challenger against a multi-dimensional promotion criteria checklist — quality, latency, cost, user signal, and minimum sample size — and gates promotion on statistical significance. The outcome is a model upgrade process where every production model change carries a statistical evidence package demonstrating the upgrade is an improvement (or at minimum, not a regression) on all relevant dimensions.

2. Problem Statement

Business Problem

Model providers release new model versions and organisations must decide: upgrade now and risk regression, or stay on the older model and accumulate capability debt. Neither option is satisfying. The missing ingredient is the ability to run both models simultaneously on real production traffic with side-by-side measurement, so that upgrade decisions are based on observed production evidence rather than benchmark extrapolation or intuition.

Technical Problem

LLM responses are non-deterministic and context-dependent. A challenger model cannot be reliably evaluated by replaying production logs against it — the replay lacks real-time context (user state, session history, real-time data injections) and produces outputs that cannot be compared to user outcomes. The only valid evaluation environment is live production traffic. Implementing traffic splitting for LLMs requires solving: deterministic user assignment (the same user should consistently hit the same model within a session), request-level tracking to correlate model assignment with all downstream signals, statistical analysis of non-Gaussian quality score distributions, and a multi-criteria promotion gate that balances quality, cost, and latency trade-offs.

Symptoms of Absence

Model upgrades are deployed as big-bang switches with no parallel validation; regressions discovered post-deployment require rollback under customer-visible conditions
Team debates model upgrade decisions using benchmark comparisons and intuition; there is no production evidence to resolve disagreements
A model upgrade reduces hallucination rate but increases p99 latency by 400ms; both facts are known only after full deployment
Cost implications of a model upgrade are only understood at the end of the billing month after full deployment
Different product teams upgrade models independently without sharing evaluation evidence; the same model is evaluated multiple times across the organisation

Cost of Inaction

Quality: Big-bang model upgrades produce customer-visible regressions that require rollback; each rollback cycle costs 2–5 days of engineering time and damages user trust
Compliance: EU AI Act Article 13 transparency obligations and Article 9 risk management requirements implicitly require evidence-based model change management; undocumented big-bang upgrades are a compliance gap
Operational: Without side-by-side cost data, model upgrade decisions are made without understanding their P&L impact; a model that appears better on quality alone can increase LLM spend by 60%

3. Context

When to Apply

Any production LLM application considering an upgrade to a new model version or a different model provider
Applications where model upgrade decisions require multi-stakeholder sign-off (product, engineering, finance, legal, compliance)
Systems with defined quality SLOs, latency SLOs, or cost budgets where upgrade impact must be quantified before commitment
Regulated applications where model changes require documented evidence of performance equivalence or improvement
Multi-tenant platforms where a model regression affects a large user population and rollback cost is high

When NOT to Apply

Model upgrades where the new model is architecturally incompatible with the existing prompt format and cannot run with the same prompt configuration as the control (off-premises evaluation is required first)
Applications with insufficient traffic volume to achieve statistical significance within an acceptable time window (rule of thumb: < 100 production requests per day per model makes statistical significance take weeks)
Emergency security patches to model configuration where the risk of staying on the old model outweighs the risk of untested upgrade

Prerequisites

Traffic splitting infrastructure capable of routing requests to alternative model endpoints with consistent user assignment within a session
Per-request model assignment tracking in all telemetry (every request must carry a model_variant label: control or challenger)
Output scoring infrastructure from EAAPL-OBS006 capable of scoring both model variants with the same scorer
User feedback collection mechanism (explicit thumbs up/down, implicit outcome signals, or downstream conversion events)
Statistical analysis capability to run two-sample tests on non-Gaussian quality score distributions
Promotion controller with multi-criteria checklist and minimum sample size gate

Industry Applicability

Industry	Use Case	Value	Adoption Level
Technology / SaaS	Upgrade LLM powering AI product features (code assistant, chat agent, content generator) with evidence-based promotion	Eliminates big-bang upgrade risk; enables faster model iteration with confidence	Proven
Financial Services	Validate challenger model for advice and document AI against control on quality, latency, and cost before promoting	Provides evidence package required by model risk management process	Emerging
Healthcare	Compare clinical AI model versions on factual accuracy and safety metrics before clinical deployment	Quantifies safety impact of model changes with real clinical query distributions	Emerging
Retail / E-Commerce	Evaluate LLM recommendation and personalisation model upgrades on conversion signal alongside quality metrics	Connects model quality to business outcomes (revenue per session)	Proven
Legal Services	Test new legal research AI model versions against control on citation accuracy with real lawyer queries	Validates that upgrade does not degrade the citation accuracy that creates professional liability exposure	Emerging

4. Architecture Overview

The A/B Model Evaluation architecture introduces a traffic splitting layer between the application and the LLM client. The splitter operates at the session level, not the request level: user or session assignment to control or challenger is determined once (using a hash of the user ID or session ID modulo the split percentage) and held constant throughout the evaluation period. This prevents within-session model switches that would confuse user feedback signals and create inconsistent conversation experiences in multi-turn applications.

The traffic split configuration is stored in a feature flag system or a dedicated experiment configuration store. Configuration defines: the challenger model identifier and endpoint, the split percentage (typically starting at 5% and increasing if the challenger shows no regressions), the evaluation start date, the minimum sample size gate, the promotion criteria thresholds per metric, and the maximum experiment duration (a hard deadline prevents perpetual indecision). The experiment controller monitors sample accumulation and sends automated progress reports at 25%, 50%, and 75% of the minimum sample size gate.

Both model variants operate behind the same prompt assembly and context injection pipeline. The LLM client wrapper receives a resolved model endpoint (control or challenger) based on the user's assignment, and the model variant label is injected into all telemetry records at the wrapper level. This ensures the quality scorer, cost calculator, latency recorder, and user feedback collector all see the model_variant dimension without any changes to application logic above the wrapper layer.

The evaluation analysis service runs on a configurable schedule (daily for long-running experiments, more frequently for high-traffic applications where sample accumulation is fast). For each active experiment, it pulls the score distributions for control and challenger, runs two-sample statistical tests (Mann-Whitney U for non-Gaussian quality scores, two-proportion z-test for binary pass/fail rates, Welch's t-test for latency), computes the effect size and confidence intervals, and evaluates the promotion criteria matrix. The promotion criteria matrix is the key governance artefact: it defines, for each measured dimension, whether the challenger must be significantly better, non-inferior (within a defined equivalence margin), or has no requirement. A model upgrade that improves quality by 8% but increases cost by 25% with no latency change may or may not meet promotion criteria — the matrix makes this decision explicit and auditable rather than subjective.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Assignment["User Assignment Layer"] A[Incoming Request] B[Session Hash Splitter] C[Control Model Route] D[Challenger Model Route] end subgraph Inference["Model Inference"] E[Control LLM - GPT-4o] F[Challenger LLM - GPT-4o-mini] end subgraph Collection["Signal Collection"] G[Quality Scorer - Both Variants] H[Latency and Cost Recorder] I[User Feedback Collector] end subgraph Analysis["Evaluation Analysis"] J[Statistical Test Engine] K[Promotion Criteria Matrix] L{Promote?} end subgraph Outcome["Disposition"] M[Promote Challenger to Control] N[Reject Challenger] O[Continue Experiment] end A --> B B -->|90%| C B -->|10%| D C --> E D --> F E --> G F --> G E --> H F --> H G --> J H --> J I --> J J --> K K --> L L -->|yes| M L -->|no - regression| N L -->|insufficient data| O

6. Components

Component	Responsibility	Technology Examples
Session Hash Splitter	Assigns users/sessions deterministically to control or challenger model based on configurable split percentage; maintains assignment consistency within session	Feature flag system (LaunchDarkly, Unleash, AWS AppConfig), custom hash-mod assignment middleware
Experiment Configuration Store	Stores experiment parameters (split %, model identifiers, promotion criteria, start date, max duration, minimum sample size)	LaunchDarkly experiment config, AWS AppConfig, PostgreSQL experiment table
LLM Client Wrapper with Variant Routing	Routes requests to control or challenger model endpoint based on session assignment; injects model_variant label into all telemetry	Custom wrapper over OpenAI/Anthropic/Bedrock SDK; variant label added to OTel span attributes
Quality Scorer	Scores outputs from both variants using the same evaluation pipeline (from EAAPL-OBS006); ensures comparability by using identical judge prompts and deterministic checks	Ragas, DeepEval, judge LLM scorer with pinned model version
User Feedback Collector	Collects explicit user signals (thumbs up/down, star rating, correction actions) and implicit signals (session continuation, task completion) attributed to model variant	Custom feedback widget, PostHog event, Amplitude event, Mixpanel
Evaluation Analysis Service	Runs statistical tests (Mann-Whitney U, Welch's t-test, two-proportion z-test) on accumulated data; computes effect sizes and confidence intervals; evaluates promotion criteria matrix	Python SciPy service, R statistical service, MLflow evaluation module
Promotion Criteria Matrix	Version-controlled configuration defining per-metric promotion requirements (must-beat, non-inferior, no-requirement) with thresholds and minimum effect sizes	YAML config in application repository, reviewed by AI governance team
Promotion Controller	Monitors sample accumulation; generates progress reports; enforces maximum experiment duration; executes promotion or rejection action	AWS Step Functions, Temporal workflow, custom orchestration service

7. Implementation Steps

Step 1: Instrument Traffic Splitting with Deterministic Assignment

Implement session-level assignment using a consistent hash function (e.g., SHA-256(user_id + experiment_id) % 100 < split_percentage). This ensures the same user always hits the same model variant throughout the experiment, which is critical for multi-turn applications and for correlating user feedback to model assignment. Store the assignment as a session attribute accessible to all downstream components. Validate that the assignment distribution is within 0.5% of the target split percentage before activating the experiment (chi-squared test on the assignment distribution against expected proportions). Start with a 5% challenger split and a 95% control split to limit blast radius while accumulating data.

Step 2: Wire Per-Variant Telemetry with Identical Instrumentation

Add the model_variant label (values: control, challenger) to all telemetry records from the LLM client wrapper. This single label is what enables all downstream analysis. Verify that the quality scorer, cost calculator, latency recorder, and user feedback events all carry this label before starting the experiment. Run a pre-experiment data quality check: confirm that model_variant appears on > 99.9% of LLM request telemetry records. A single unlabelled request is not a problem; a systematic labelling gap invalidates the entire experiment. Implement the same check as an ongoing data quality alert throughout the experiment.

Step 3: Define and Version the Promotion Criteria Matrix

Before activating the experiment, define the promotion criteria matrix as a YAML file committed to the application repository. The matrix must specify, for each measured dimension: the measurement (metric name and aggregation), the requirement type (MUST_BEAT: challenger must be statistically significantly better; NON_INFERIOR: challenger must be within the equivalence margin; NO_REQUIREMENT: informational only), the threshold or equivalence margin, the minimum effect size for MUST_BEAT requirements, and the statistical significance level (default p < 0.05). Example criteria: faithfulness score NON_INFERIOR with equivalence margin -0.03; p99 latency NON_INFERIOR with equivalence margin +50ms; cost per request NON_INFERIOR with equivalence margin +10%; user thumbs-up rate MUST_BEAT with minimum effect size 0.02. The matrix must be approved by the AI governance team before experiment activation.

Step 4: Run Analysis, Generate Evidence Package, and Execute Promotion Decision

At minimum sample size gate (typically 1,000+ scored responses per variant for quality metrics; 5,000+ for latency and cost), run the full statistical analysis. Generate the promotion evidence package: a structured document containing the experiment configuration, sample sizes per variant, per-metric score distributions (with box plots or violin plots), statistical test results (test statistic, p-value, confidence interval, effect size), promotion criteria matrix evaluation (each criterion: PASS/FAIL/INSUFFICIENT DATA), and the final promotion recommendation. Route the evidence package to the approvers defined in the experiment configuration (may be fully automated for low-stakes upgrades, or require human sign-off for regulated systems). Execute the approved decision: promote (update experiment configuration to route 100% of traffic to challenger), reject (revert split to 0% challenger, document rejection reason), or extend (increase sample size gate and continue).

8. Security Considerations

OWASP LLM Top 10 Mapping

OWASP ID	Threat	Mitigation
LLM01 Prompt Injection	Adversarial users who discover they are in a challenger cohort may attempt to manipulate challenger outputs to skew evaluation metrics	Assignment must not be detectable by the user; do not expose model variant in API responses; apply the same output filtering to both variants
LLM04 Model Denial of Service	Traffic splitting routes some production traffic to a challenger model that may have different rate limits or throughput characteristics than the control	Implement per-variant rate limit tracking; configure challenger with capacity headroom appropriate to the split percentage; circuit breaker on challenger failover to control
LLM05 Supply Chain Vulnerabilities	The challenger model is a new supply chain dependency; it may have different safety properties, content filter behaviours, or data handling terms than the control	Conduct security and privacy review of challenger model provider terms before experiment activation; verify content filter equivalence on a safety test set before routing live traffic
LLM10 Model Theft	Side-channel analysis of response differences between control and challenger could expose proprietary prompt engineering details	Ensure responses do not include model identifiers; apply the same prompt confidentiality controls to both variants

9. Governance Artefacts

Experiment Configuration Record: version-controlled YAML defining split percentage, model identifiers, promotion criteria matrix, minimum sample size, maximum duration, and approver chain
Promotion Evidence Package: generated at experiment conclusion; contains statistical analysis, per-metric evidence, criteria evaluation, and promotion decision with approver identities and timestamps
Experiment Registry: central log of all past and active experiments with model versions tested, experiment duration, outcome (promoted/rejected), and link to evidence package
Promotion Criteria Matrix Approval Record: sign-off by AI governance team before each experiment activation; ensures criteria are appropriate to the use case and regulatory context
Post-Promotion Monitoring Alert: configures EAAPL-OBS007 Prompt Drift Detection to run a baseline refresh for the newly promoted model version; ensures drift detection is active immediately after promotion

10. SLOs

SLO	Target	Measurement
Experiment setup time	< 1 business day from decision to activate to first challenger traffic routing	Wall-clock time from experiment config commit to first challenger-labelled telemetry record
Statistical analysis turnaround	Evidence package generated within 4 hours of minimum sample size gate reached	Time from minimum sample size threshold crossed to evidence package available
Challenger availability equivalence	Challenger model error rate within 0.5% of control model error rate during experiment	Challenger error rate minus control error rate; monitored hourly; circuit breaker if exceeded
Assignment consistency	> 99.99% of multi-turn sessions maintain consistent model assignment throughout session	Requests with variant switch within session / total requests in sessions spanning > 1 turn
Promotion decision SLA	Promotion or rejection decision made within 5 business days of evidence package generation	Time from evidence package to approved promotion/rejection action executed
Evaluation latency (CI gate)	<90s per 100-sample batch	P99 pipeline duration
Drift alert MTTD (Mean Time to Detect)	<24 hours	Time from regression onset to alert firing

11. Cost Model

Cost Driver	Estimate	Notes
Challenger model inference cost	Varies; typically the primary cost driver of the evaluation	At 10% split: challenger cost = 10% of total LLM cost × (challenger cost per token / control cost per token); a cheaper challenger reduces total cost; a more expensive challenger increases it
Quality scoring for both variants	$30–$300/month	Same scoring infrastructure as EAAPL-OBS006 production monitor; applied to both variants at the same sample rate; effectively doubles scoring cost at 50% split
Evaluation analysis compute	$5–$20/month	Statistical analysis is computationally lightweight; runs on serverless or small instance
Feature flag / experiment platform	$0–$500/month	LaunchDarkly from $200/month; Unleash open-source is free; custom implementation has engineering cost but no recurring fee
Evidence package generation	$10–$50 per experiment	Automated report generation with visualisation; one-time cost per experiment conclusion

12. Trade-off Analysis

Dimension	Benefit	Trade-off
Session-level assignment	Consistent user experience within session; feedback signals are unambiguous	Slower sample accumulation than request-level splitting; minimum sample size gate takes longer to reach in low-traffic applications
Small initial split (5%)	Limits blast radius if challenger has unexpected regressions	Slower sample accumulation at small splits; high-traffic applications can increase split as confidence grows
Multi-criteria promotion matrix	Prevents promotion of a model that wins on quality but loses on cost or latency; makes trade-off decisions explicit and auditable	Matrix calibration requires governance effort; if criteria are too strict, no challenger can ever be promoted; too lenient and the gate provides no protection
Statistical significance gate	Prevents promotion on noise; provides defensible evidence for regulatory purposes	Statistical significance does not equal practical significance; a statistically significant but tiny improvement may not justify the operational change
Automated promotion capability	Removes human bottleneck for routine model upgrades; enables fast iteration	Automated promotion requires high confidence in the promotion criteria matrix and the quality of the evaluation signals; inappropriate for regulated high-risk AI systems where human sign-off is required

13. Failure Modes

Failure	Trigger	Recovery
Novelty effect biases early results	Users assigned to the challenger variant interact differently because the responses are slightly different, not because the model is better; early metrics are inflated	Implement a burn-in period (first 48 hours of challenger traffic excluded from analysis); require minimum experiment duration regardless of sample size accumulation speed
Assignment consistency failure corrupts signal	Experiment configuration change mid-experiment causes some users to switch variants; feedback signals are no longer attributable to a single model	Freeze experiment configuration after activation; any configuration change requires a new experiment with a fresh baseline; contaminated period data must be excluded from analysis
Challenger model rate limit reached at scale	Challenger model endpoint has lower rate limits than control; as split percentage increases, challenger requests are throttled while control requests succeed	Monitor per-variant error rates continuously; implement circuit breaker that reverts all traffic to control if challenger error rate exceeds control by more than 0.5%; provision challenger capacity before increasing split
Promotion criteria matrix is wrong for the context	Matrix was calibrated for a different use case; a model is promoted that has a meaningful regression on a dimension not covered by the matrix	Post-promotion monitoring via EAAPL-OBS007 detects regressions that the matrix did not gate on; quarterly matrix review process with retrospective analysis of post-promotion quality trends
Experiment runs indefinitely without decision	Minimum sample size never reached (traffic too low); maximum experiment duration not enforced; experiment becomes stale	Enforce hard maximum experiment duration; generate an escalation alert at 90% of maximum duration; declare inconclusive result and revert to control if sample size gate is not reached

14. Regulatory Mapping

Regulation	Requirement	How Pattern Addresses It
EU AI Act Article 9	Risk management for high-risk AI must include evaluation of performance across the intended purpose with real-world evidence	Production A/B evaluation provides real-world evidence; promotion evidence package is the documented validation artefact
EU AI Act Article 13	Transparency obligations require that high-risk AI systems be designed and developed to allow providers to comply with transparency requirements including performance documentation	Experiment evidence package and experiment registry provide the performance documentation required for Article 13 compliance
APRA CPS 230	Material model changes in financial services require model validation evidence before deployment	Promotion evidence package satisfies the model validation evidence requirement; promotion criteria matrix satisfies the validation criteria documentation requirement
APRA CPS 230 §21	AI systems classified as critical operations require monitoring that demonstrates the system is operating within defined performance parameters	The evaluation pipeline produces the evidence artefact (evaluation scorecard with rolling baseline) that satisfies the 'regular testing of operational resilience' requirement; A/B promotion gate ensures a challenger model meets the same operational resilience bar as the control before assuming production load
APRA CPS 234 §36	Material changes to AI system behaviour may constitute a 'material information security incident' or 'material service provider change' requiring APRA notification within 72 hours	The detection capability provided by this pattern is the prerequisite for meeting that notification timeline; model version promotion is a recorded change event, and per-variant quality divergence detected during the experiment surfaces material behavioural changes before they reach full production
ISO/IEC 42001 Clause 8.4	AI systems must be evaluated before deployment and after significant changes	Pattern implements evaluation before promotion (by design); post-promotion monitoring integration (EAAPL-OBS007) implements the after-change monitoring
NIST AI RMF MANAGE 3.1	AI risks identified in deployment must be tracked and managed including through testing and validation mechanisms	A/B evaluation with promotion gate implements the pre-deployment risk management mechanism; circuit breaker and rollback implement the deployment risk management mechanism

15. Reference Implementations

AWS

Traffic Splitter: AWS AppConfig feature flag with weighted variant assignment; Lambda@Edge or API Gateway request routing
Experiment Configuration: AWS AppConfig hosted configuration; version-controlled in CodeCommit with approval workflow
LLM Client Wrapper: Python wrapper over Boto3 Bedrock client; variant injection into CloudWatch structured log dimensions
Quality Scoring: AWS Lambda async scorer from SQS; same infrastructure as EAAPL-OBS006
User Feedback: Amazon Pinpoint event ingestion; custom feedback endpoint writing to DynamoDB
Analysis Service: AWS Lambda scheduled via EventBridge; SciPy in Lambda layer; evidence package written to S3
Promotion Controller: AWS Step Functions state machine; approval step via Amazon SNS + human approval token

Azure

Traffic Splitter: Azure App Configuration feature flags with percentage-based targeting; APIM policy for routing
Experiment Configuration: Azure App Configuration with Key Vault reference for model endpoint secrets
LLM Client Wrapper: Python wrapper over Azure OpenAI SDK; variant label in Application Insights custom dimension
Quality Scoring: Azure Functions async scorer via Service Bus
User Feedback: Azure Event Hubs event ingestion; Cosmos DB for feedback records
Analysis Service: Azure Functions timer trigger; SciPy Python runtime
Promotion Controller: Azure Logic Apps workflow with approval action via Teams Adaptive Card

On-Premises

Traffic Splitter: Nginx upstream split with consistent hash on user ID; or custom middleware in the application API gateway
Experiment Configuration: PostgreSQL experiment table; GitOps-managed YAML merged via pull request approval
LLM Client Wrapper: Python wrapper with variant label injected into structured log
Quality Scoring: Kubernetes Job consumer on Redis queue
User Feedback: REST endpoint writing to PostgreSQL feedback table
Analysis Service: Python script running as Kubernetes CronJob; SciPy for statistical tests; Jinja2 for evidence package report generation
Promotion Controller: Jenkins pipeline with manual approval gate for production deployment step

EAAPL-OBS001 AI Telemetry Architecture — provides the per-request model_variant telemetry labelling conventions and the metrics backend that stores evaluation data for both variants
EAAPL-OBS006 LLM Evaluation Pipeline — provides the quality scoring infrastructure used to score both model variants; the CI/CD evaluation gate should be passed by the challenger model before it is activated in an A/B experiment
EAAPL-OBS007 Prompt Drift Detection — should be activated for the promoted model version immediately after promotion to detect post-promotion quality changes; the control model's stable baseline is the reference for the newly promoted model's drift detection
EAAPL-OBS005 Model Drift Detection — population-level input distribution monitoring; run alongside this pattern to detect if the experiment cohorts have diverged in input distribution (selection bias that would invalidate the comparison)
EAAPL-OBS004 AI Incident Management — defines the incident response procedure if the challenger model produces a P0 quality event during the experiment; includes the automatic circuit breaker and rollback steps

17. Maturity Assessment

Dimension	Level	Notes
Adoption Breadth	4 — Proven	A/B testing of ML models is a well-established practice at technology companies; application to LLM model upgrades specifically is proven at AI-native companies and is becoming standard practice
Tooling Ecosystem	4 — Proven	Feature flag platforms (LaunchDarkly, Unleash, Split.io), statistical testing libraries (SciPy, statsmodels), and experiment tracking platforms (MLflow, Neptune) are all mature and widely deployed
Regulatory Evidence	3 — Developing	A/B model evaluation aligns with model risk management validation requirements but specific regulatory guidance on LLM A/B evaluation practices is still emerging; early adopters in financial services are defining the standard
Cost Predictability	4 — Predictable	The primary variable cost is the inference cost differential between control and challenger at the configured split percentage; this is precisely calculable once per-token costs and expected traffic volume are known

18. Revision History

Version	Date	Change
1.0	2026-06-14	Initial release

Track this pattern for APRA/ASIC review

← Back to Library More Observability & Monitoring →