[EAAPL-OBS008] A/B Model Evaluation
Category: Observability & Monitoring Sub-category: Model Lifecycle Version: 1.0 Maturity: Proven Tags: ab-testing, canary-deployment, model-upgrade, traffic-splitting, statistical-significance, challenger-model, side-by-side-evaluation, model-promotion Regulatory Relevance: EU AI Act Article 9 & 13, APRA CPS 230, ISO/IEC 42001 Clause 8.4, NIST AI RMF MANAGE 3.1
1. Executive Summary
Upgrading the LLM powering a production AI application carries risk that cannot be fully characterised in offline evaluation. Offline benchmarks on golden datasets measure quality on known examples under controlled conditions; they do not capture the full distribution of production inputs, real user feedback patterns, actual latency under production load, or the cost-per-unit-of-value economics at scale. A model that scores 5% better on a golden dataset can still perform worse in production — and a model that scores identically on quality metrics can cost 40% more or introduce 200ms additional p99 latency that degrades user experience. The A/B Model Evaluation pattern addresses this gap by routing real production traffic across two models simultaneously, measuring all dimensions of production performance side-by-side, and gating promotion on statistically significant evidence.
This pattern defines a canary deployment architecture for LLM and model upgrades. A traffic splitter routes a configurable percentage of production requests to a challenger model while the control model continues to handle the remainder. Both models operate with identical prompt configurations, context assembly, and output filtering. All quality metrics, latency distributions, cost-per-request, and user feedback signals are collected for both models with the same instrumentation, enabling a true side-by-side comparison on real production data. A promotion controller evaluates the challenger against a multi-dimensional promotion criteria checklist — quality, latency, cost, user signal, and minimum sample size — and gates promotion on statistical significance. The outcome is a model upgrade process where every production model change carries a statistical evidence package demonstrating the upgrade is an improvement (or at minimum, not a regression) on all relevant dimensions.
2. Problem Statement
Business Problem
Model providers release new model versions and organisations must decide: upgrade now and risk regression, or stay on the older model and accumulate capability debt. Neither option is satisfying. The missing ingredient is the ability to run both models simultaneously on real production traffic with side-by-side measurement, so that upgrade decisions are based on observed production evidence rather than benchmark extrapolation or intuition.
Technical Problem
LLM responses are non-deterministic and context-dependent. A challenger model cannot be reliably evaluated by replaying production logs against it — the replay lacks real-time context (user state, session history, real-time data injections) and produces outputs that cannot be compared to user outcomes. The only valid evaluation environment is live production traffic. Implementing traffic splitting for LLMs requires solving: deterministic user assignment (the same user should consistently hit the same model within a session), request-level tracking to correlate model assignment with all downstream signals, statistical analysis of non-Gaussian quality score distributions, and a multi-criteria promotion gate that balances quality, cost, and latency trade-offs.
Symptoms of Absence
- Model upgrades are deployed as big-bang switches with no parallel validation; regressions discovered post-deployment require rollback under customer-visible conditions
- Team debates model upgrade decisions using benchmark comparisons and intuition; there is no production evidence to resolve disagreements
- A model upgrade reduces hallucination rate but increases p99 latency by 400ms; both facts are known only after full deployment
- Cost implications of a model upgrade are only understood at the end of the billing month after full deployment
- Different product teams upgrade models independently without sharing evaluation evidence; the same model is evaluated multiple times across the organisation
Cost of Inaction
- Quality: Big-bang model upgrades produce customer-visible regressions that require rollback; each rollback cycle costs 2–5 days of engineering time and damages user trust
- Compliance: EU AI Act Article 13 transparency obligations and Article 9 risk management requirements implicitly require evidence-based model change management; undocumented big-bang upgrades are a compliance gap
- Operational: Without side-by-side cost data, model upgrade decisions are made without understanding their P&L impact; a model that appears better on quality alone can increase LLM spend by 60%
3. Context
When to Apply
- Any production LLM application considering an upgrade to a new model version or a different model provider
- Applications where model upgrade decisions require multi-stakeholder sign-off (product, engineering, finance, legal, compliance)
- Systems with defined quality SLOs, latency SLOs, or cost budgets where upgrade impact must be quantified before commitment
- Regulated applications where model changes require documented evidence of performance equivalence or improvement
- Multi-tenant platforms where a model regression affects a large user population and rollback cost is high
When NOT to Apply
- Model upgrades where the new model is architecturally incompatible with the existing prompt format and cannot run with the same prompt configuration as the control (off-premises evaluation is required first)
- Applications with insufficient traffic volume to achieve statistical significance within an acceptable time window (rule of thumb: < 100 production requests per day per model makes statistical significance take weeks)
- Emergency security patches to model configuration where the risk of staying on the old model outweighs the risk of untested upgrade
Prerequisites
- Traffic splitting infrastructure capable of routing requests to alternative model endpoints with consistent user assignment within a session
- Per-request model assignment tracking in all telemetry (every request must carry a
model_variantlabel:controlorchallenger) - Output scoring infrastructure from EAAPL-OBS006 capable of scoring both model variants with the same scorer
- User feedback collection mechanism (explicit thumbs up/down, implicit outcome signals, or downstream conversion events)
- Statistical analysis capability to run two-sample tests on non-Gaussian quality score distributions
- Promotion controller with multi-criteria checklist and minimum sample size gate
Industry Applicability
| Industry | Use Case | Value | Adoption Level |
|---|---|---|---|
| Technology / SaaS | Upgrade LLM powering AI product features (code assistant, chat agent, content generator) with evidence-based promotion | Eliminates big-bang upgrade risk; enables faster model iteration with confidence | Proven |
| Financial Services | Validate challenger model for advice and document AI against control on quality, latency, and cost before promoting | Provides evidence package required by model risk management process | Emerging |
| Healthcare | Compare clinical AI model versions on factual accuracy and safety metrics before clinical deployment | Quantifies safety impact of model changes with real clinical query distributions | Emerging |
| Retail / E-Commerce | Evaluate LLM recommendation and personalisation model upgrades on conversion signal alongside quality metrics | Connects model quality to business outcomes (revenue per session) | Proven |
| Legal Services | Test new legal research AI model versions against control on citation accuracy with real lawyer queries | Validates that upgrade does not degrade the citation accuracy that creates professional liability exposure | Emerging |
4. Architecture Overview
The A/B Model Evaluation architecture introduces a traffic splitting layer between the application and the LLM client. The splitter operates at the session level, not the request level: user or session assignment to control or challenger is determined once (using a hash of the user ID or session ID modulo the split percentage) and held constant throughout the evaluation period. This prevents within-session model switches that would confuse user feedback signals and create inconsistent conversation experiences in multi-turn applications.
The traffic split configuration is stored in a feature flag system or a dedicated experiment configuration store. Configuration defines: the challenger model identifier and endpoint, the split percentage (typically starting at 5% and increasing if the challenger shows no regressions), the evaluation start date, the minimum sample size gate, the promotion criteria thresholds per metric, and the maximum experiment duration (a hard deadline prevents perpetual indecision). The experiment controller monitors sample accumulation and sends automated progress reports at 25%, 50%, and 75% of the minimum sample size gate.
Both model variants operate behind the same prompt assembly and context injection pipeline. The LLM client wrapper receives a resolved model endpoint (control or challenger) based on the user's assignment, and the model variant label is injected into all telemetry records at the wrapper level. This ensures the quality scorer, cost calculator, latency recorder, and user feedback collector all see the model_variant dimension without any changes to application logic above the wrapper layer.
The evaluation analysis service runs on a configurable schedule (daily for long-running experiments, more frequently for high-traffic applications where sample accumulation is fast). For each active experiment, it pulls the score distributions for control and challenger, runs two-sample statistical tests (Mann-Whitney U for non-Gaussian quality scores, two-proportion z-test for binary pass/fail rates, Welch's t-test for latency), computes the effect size and confidence intervals, and evaluates the promotion criteria matrix. The promotion criteria matrix is the key governance artefact: it defines, for each measured dimension, whether the challenger must be significantly better, non-inferior (within a defined equivalence margin), or has no requirement. A model upgrade that improves quality by 8% but increases cost by 25% with no latency change may or may not meet promotion criteria — the matrix makes this decision explicit and auditable rather than subjective.
5. Architecture Diagram
6. Components
| Component | Responsibility | Technology Examples |
|---|---|---|
| Session Hash Splitter | Assigns users/sessions deterministically to control or challenger model based on configurable split percentage; maintains assignment consistency within session | Feature flag system (LaunchDarkly, Unleash, AWS AppConfig), custom hash-mod assignment middleware |
| Experiment Configuration Store | Stores experiment parameters (split %, model identifiers, promotion criteria, start date, max duration, minimum sample size) | LaunchDarkly experiment config, AWS AppConfig, PostgreSQL experiment table |
| LLM Client Wrapper with Variant Routing | Routes requests to control or challenger model endpoint based on session assignment; injects model_variant label into all telemetry | Custom wrapper over OpenAI/Anthropic/Bedrock SDK; variant label added to OTel span attributes |
| Quality Scorer | Scores outputs from both variants using the same evaluation pipeline (from EAAPL-OBS006); ensures comparability by using identical judge prompts and deterministic checks | Ragas, DeepEval, judge LLM scorer with pinned model version |
| User Feedback Collector | Collects explicit user signals (thumbs up/down, star rating, correction actions) and implicit signals (session continuation, task completion) attributed to model variant | Custom feedback widget, PostHog event, Amplitude event, Mixpanel |
| Evaluation Analysis Service | Runs statistical tests (Mann-Whitney U, Welch's t-test, two-proportion z-test) on accumulated data; computes effect sizes and confidence intervals; evaluates promotion criteria matrix | Python SciPy service, R statistical service, MLflow evaluation module |
| Promotion Criteria Matrix | Version-controlled configuration defining per-metric promotion requirements (must-beat, non-inferior, no-requirement) with thresholds and minimum effect sizes | YAML config in application repository, reviewed by AI governance team |
| Promotion Controller | Monitors sample accumulation; generates progress reports; enforces maximum experiment duration; executes promotion or rejection action | AWS Step Functions, Temporal workflow, custom orchestration service |
7. Implementation Steps
Step 1: Instrument Traffic Splitting with Deterministic Assignment
Implement session-level assignment using a consistent hash function (e.g., SHA-256(user_id + experiment_id) % 100 < split_percentage). This ensures the same user always hits the same model variant throughout the experiment, which is critical for multi-turn applications and for correlating user feedback to model assignment. Store the assignment as a session attribute accessible to all downstream components. Validate that the assignment distribution is within 0.5% of the target split percentage before activating the experiment (chi-squared test on the assignment distribution against expected proportions). Start with a 5% challenger split and a 95% control split to limit blast radius while accumulating data.
Step 2: Wire Per-Variant Telemetry with Identical Instrumentation
Add the model_variant label (values: control, challenger) to all telemetry records from the LLM client wrapper. This single label is what enables all downstream analysis. Verify that the quality scorer, cost calculator, latency recorder, and user feedback events all carry this label before starting the experiment. Run a pre-experiment data quality check: confirm that model_variant appears on > 99.9% of LLM request telemetry records. A single unlabelled request is not a problem; a systematic labelling gap invalidates the entire experiment. Implement the same check as an ongoing data quality alert throughout the experiment.
Step 3: Define and Version the Promotion Criteria Matrix
Before activating the experiment, define the promotion criteria matrix as a YAML file committed to the application repository. The matrix must specify, for each measured dimension: the measurement (metric name and aggregation), the requirement type (MUST_BEAT: challenger must be statistically significantly better; NON_INFERIOR: challenger must be within the equivalence margin; NO_REQUIREMENT: informational only), the threshold or equivalence margin, the minimum effect size for MUST_BEAT requirements, and the statistical significance level (default p < 0.05). Example criteria: faithfulness score NON_INFERIOR with equivalence margin -0.03; p99 latency NON_INFERIOR with equivalence margin +50ms; cost per request NON_INFERIOR with equivalence margin +10%; user thumbs-up rate MUST_BEAT with minimum effect size 0.02. The matrix must be approved by the AI governance team before experiment activation.
Step 4: Run Analysis, Generate Evidence Package, and Execute Promotion Decision
At minimum sample size gate (typically 1,000+ scored responses per variant for quality metrics; 5,000+ for latency and cost), run the full statistical analysis. Generate the promotion evidence package: a structured document containing the experiment configuration, sample sizes per variant, per-metric score distributions (with box plots or violin plots), statistical test results (test statistic, p-value, confidence interval, effect size), promotion criteria matrix evaluation (each criterion: PASS/FAIL/INSUFFICIENT DATA), and the final promotion recommendation. Route the evidence package to the approvers defined in the experiment configuration (may be fully automated for low-stakes upgrades, or require human sign-off for regulated systems). Execute the approved decision: promote (update experiment configuration to route 100% of traffic to challenger), reject (revert split to 0% challenger, document rejection reason), or extend (increase sample size gate and continue).
8. Security Considerations
OWASP LLM Top 10 Mapping
| OWASP ID | Threat | Mitigation |
|---|---|---|
| LLM01 Prompt Injection | Adversarial users who discover they are in a challenger cohort may attempt to manipulate challenger outputs to skew evaluation metrics | Assignment must not be detectable by the user; do not expose model variant in API responses; apply the same output filtering to both variants |
| LLM04 Model Denial of Service | Traffic splitting routes some production traffic to a challenger model that may have different rate limits or throughput characteristics than the control | Implement per-variant rate limit tracking; configure challenger with capacity headroom appropriate to the split percentage; circuit breaker on challenger failover to control |
| LLM05 Supply Chain Vulnerabilities | The challenger model is a new supply chain dependency; it may have different safety properties, content filter behaviours, or data handling terms than the control | Conduct security and privacy review of challenger model provider terms before experiment activation; verify content filter equivalence on a safety test set before routing live traffic |
| LLM10 Model Theft | Side-channel analysis of response differences between control and challenger could expose proprietary prompt engineering details | Ensure responses do not include model identifiers; apply the same prompt confidentiality controls to both variants |
9. Governance Artefacts
- Experiment Configuration Record: version-controlled YAML defining split percentage, model identifiers, promotion criteria matrix, minimum sample size, maximum duration, and approver chain
- Promotion Evidence Package: generated at experiment conclusion; contains statistical analysis, per-metric evidence, criteria evaluation, and promotion decision with approver identities and timestamps
- Experiment Registry: central log of all past and active experiments with model versions tested, experiment duration, outcome (promoted/rejected), and link to evidence package
- Promotion Criteria Matrix Approval Record: sign-off by AI governance team before each experiment activation; ensures criteria are appropriate to the use case and regulatory context
- Post-Promotion Monitoring Alert: configures EAAPL-OBS007 Prompt Drift Detection to run a baseline refresh for the newly promoted model version; ensures drift detection is active immediately after promotion
10. SLOs
| SLO | Target | Measurement |
|---|---|---|
| Experiment setup time | < 1 business day from decision to activate to first challenger traffic routing | Wall-clock time from experiment config commit to first challenger-labelled telemetry record |
| Statistical analysis turnaround | Evidence package generated within 4 hours of minimum sample size gate reached | Time from minimum sample size threshold crossed to evidence package available |
| Challenger availability equivalence | Challenger model error rate within 0.5% of control model error rate during experiment | Challenger error rate minus control error rate; monitored hourly; circuit breaker if exceeded |
| Assignment consistency | > 99.99% of multi-turn sessions maintain consistent model assignment throughout session | Requests with variant switch within session / total requests in sessions spanning > 1 turn |
| Promotion decision SLA | Promotion or rejection decision made within 5 business days of evidence package generation | Time from evidence package to approved promotion/rejection action executed |
| Evaluation latency (CI gate) | <90s per 100-sample batch | P99 pipeline duration |
| Drift alert MTTD (Mean Time to Detect) | <24 hours | Time from regression onset to alert firing |
11. Cost Model
| Cost Driver | Estimate | Notes |
|---|---|---|
| Challenger model inference cost | Varies; typically the primary cost driver of the evaluation | At 10% split: challenger cost = 10% of total LLM cost × (challenger cost per token / control cost per token); a cheaper challenger reduces total cost; a more expensive challenger increases it |
| Quality scoring for both variants | $30–$300/month | Same scoring infrastructure as EAAPL-OBS006 production monitor; applied to both variants at the same sample rate; effectively doubles scoring cost at 50% split |
| Evaluation analysis compute | $5–$20/month | Statistical analysis is computationally lightweight; runs on serverless or small instance |
| Feature flag / experiment platform | $0–$500/month | LaunchDarkly from $200/month; Unleash open-source is free; custom implementation has engineering cost but no recurring fee |
| Evidence package generation | $10–$50 per experiment | Automated report generation with visualisation; one-time cost per experiment conclusion |
12. Trade-off Analysis
| Dimension | Benefit | Trade-off |
|---|---|---|
| Session-level assignment | Consistent user experience within session; feedback signals are unambiguous | Slower sample accumulation than request-level splitting; minimum sample size gate takes longer to reach in low-traffic applications |
| Small initial split (5%) | Limits blast radius if challenger has unexpected regressions | Slower sample accumulation at small splits; high-traffic applications can increase split as confidence grows |
| Multi-criteria promotion matrix | Prevents promotion of a model that wins on quality but loses on cost or latency; makes trade-off decisions explicit and auditable | Matrix calibration requires governance effort; if criteria are too strict, no challenger can ever be promoted; too lenient and the gate provides no protection |
| Statistical significance gate | Prevents promotion on noise; provides defensible evidence for regulatory purposes | Statistical significance does not equal practical significance; a statistically significant but tiny improvement may not justify the operational change |
| Automated promotion capability | Removes human bottleneck for routine model upgrades; enables fast iteration | Automated promotion requires high confidence in the promotion criteria matrix and the quality of the evaluation signals; inappropriate for regulated high-risk AI systems where human sign-off is required |
13. Failure Modes
| Failure | Trigger | Recovery |
|---|---|---|
| Novelty effect biases early results | Users assigned to the challenger variant interact differently because the responses are slightly different, not because the model is better; early metrics are inflated | Implement a burn-in period (first 48 hours of challenger traffic excluded from analysis); require minimum experiment duration regardless of sample size accumulation speed |
| Assignment consistency failure corrupts signal | Experiment configuration change mid-experiment causes some users to switch variants; feedback signals are no longer attributable to a single model | Freeze experiment configuration after activation; any configuration change requires a new experiment with a fresh baseline; contaminated period data must be excluded from analysis |
| Challenger model rate limit reached at scale | Challenger model endpoint has lower rate limits than control; as split percentage increases, challenger requests are throttled while control requests succeed | Monitor per-variant error rates continuously; implement circuit breaker that reverts all traffic to control if challenger error rate exceeds control by more than 0.5%; provision challenger capacity before increasing split |
| Promotion criteria matrix is wrong for the context | Matrix was calibrated for a different use case; a model is promoted that has a meaningful regression on a dimension not covered by the matrix | Post-promotion monitoring via EAAPL-OBS007 detects regressions that the matrix did not gate on; quarterly matrix review process with retrospective analysis of post-promotion quality trends |
| Experiment runs indefinitely without decision | Minimum sample size never reached (traffic too low); maximum experiment duration not enforced; experiment becomes stale | Enforce hard maximum experiment duration; generate an escalation alert at 90% of maximum duration; declare inconclusive result and revert to control if sample size gate is not reached |
14. Regulatory Mapping
| Regulation | Requirement | How Pattern Addresses It |
|---|---|---|
| EU AI Act Article 9 | Risk management for high-risk AI must include evaluation of performance across the intended purpose with real-world evidence | Production A/B evaluation provides real-world evidence; promotion evidence package is the documented validation artefact |
| EU AI Act Article 13 | Transparency obligations require that high-risk AI systems be designed and developed to allow providers to comply with transparency requirements including performance documentation | Experiment evidence package and experiment registry provide the performance documentation required for Article 13 compliance |
| APRA CPS 230 | Material model changes in financial services require model validation evidence before deployment | Promotion evidence package satisfies the model validation evidence requirement; promotion criteria matrix satisfies the validation criteria documentation requirement |
| APRA CPS 230 §21 | AI systems classified as critical operations require monitoring that demonstrates the system is operating within defined performance parameters | The evaluation pipeline produces the evidence artefact (evaluation scorecard with rolling baseline) that satisfies the 'regular testing of operational resilience' requirement; A/B promotion gate ensures a challenger model meets the same operational resilience bar as the control before assuming production load |
| APRA CPS 234 §36 | Material changes to AI system behaviour may constitute a 'material information security incident' or 'material service provider change' requiring APRA notification within 72 hours | The detection capability provided by this pattern is the prerequisite for meeting that notification timeline; model version promotion is a recorded change event, and per-variant quality divergence detected during the experiment surfaces material behavioural changes before they reach full production |
| ISO/IEC 42001 Clause 8.4 | AI systems must be evaluated before deployment and after significant changes | Pattern implements evaluation before promotion (by design); post-promotion monitoring integration (EAAPL-OBS007) implements the after-change monitoring |
| NIST AI RMF MANAGE 3.1 | AI risks identified in deployment must be tracked and managed including through testing and validation mechanisms | A/B evaluation with promotion gate implements the pre-deployment risk management mechanism; circuit breaker and rollback implement the deployment risk management mechanism |
15. Reference Implementations
AWS
- Traffic Splitter: AWS AppConfig feature flag with weighted variant assignment; Lambda@Edge or API Gateway request routing
- Experiment Configuration: AWS AppConfig hosted configuration; version-controlled in CodeCommit with approval workflow
- LLM Client Wrapper: Python wrapper over Boto3 Bedrock client; variant injection into CloudWatch structured log dimensions
- Quality Scoring: AWS Lambda async scorer from SQS; same infrastructure as EAAPL-OBS006
- User Feedback: Amazon Pinpoint event ingestion; custom feedback endpoint writing to DynamoDB
- Analysis Service: AWS Lambda scheduled via EventBridge; SciPy in Lambda layer; evidence package written to S3
- Promotion Controller: AWS Step Functions state machine; approval step via Amazon SNS + human approval token
Azure
- Traffic Splitter: Azure App Configuration feature flags with percentage-based targeting; APIM policy for routing
- Experiment Configuration: Azure App Configuration with Key Vault reference for model endpoint secrets
- LLM Client Wrapper: Python wrapper over Azure OpenAI SDK; variant label in Application Insights custom dimension
- Quality Scoring: Azure Functions async scorer via Service Bus
- User Feedback: Azure Event Hubs event ingestion; Cosmos DB for feedback records
- Analysis Service: Azure Functions timer trigger; SciPy Python runtime
- Promotion Controller: Azure Logic Apps workflow with approval action via Teams Adaptive Card
On-Premises
- Traffic Splitter: Nginx upstream split with consistent hash on user ID; or custom middleware in the application API gateway
- Experiment Configuration: PostgreSQL experiment table; GitOps-managed YAML merged via pull request approval
- LLM Client Wrapper: Python wrapper with variant label injected into structured log
- Quality Scoring: Kubernetes Job consumer on Redis queue
- User Feedback: REST endpoint writing to PostgreSQL feedback table
- Analysis Service: Python script running as Kubernetes CronJob; SciPy for statistical tests; Jinja2 for evidence package report generation
- Promotion Controller: Jenkins pipeline with manual approval gate for production deployment step
16. Related Patterns
- EAAPL-OBS001 AI Telemetry Architecture — provides the per-request model_variant telemetry labelling conventions and the metrics backend that stores evaluation data for both variants
- EAAPL-OBS006 LLM Evaluation Pipeline — provides the quality scoring infrastructure used to score both model variants; the CI/CD evaluation gate should be passed by the challenger model before it is activated in an A/B experiment
- EAAPL-OBS007 Prompt Drift Detection — should be activated for the promoted model version immediately after promotion to detect post-promotion quality changes; the control model's stable baseline is the reference for the newly promoted model's drift detection
- EAAPL-OBS005 Model Drift Detection — population-level input distribution monitoring; run alongside this pattern to detect if the experiment cohorts have diverged in input distribution (selection bias that would invalidate the comparison)
- EAAPL-OBS004 AI Incident Management — defines the incident response procedure if the challenger model produces a P0 quality event during the experiment; includes the automatic circuit breaker and rollback steps
17. Maturity Assessment
| Dimension | Level | Notes |
|---|---|---|
| Adoption Breadth | 4 — Proven | A/B testing of ML models is a well-established practice at technology companies; application to LLM model upgrades specifically is proven at AI-native companies and is becoming standard practice |
| Tooling Ecosystem | 4 — Proven | Feature flag platforms (LaunchDarkly, Unleash, Split.io), statistical testing libraries (SciPy, statsmodels), and experiment tracking platforms (MLflow, Neptune) are all mature and widely deployed |
| Regulatory Evidence | 3 — Developing | A/B model evaluation aligns with model risk management validation requirements but specific regulatory guidance on LLM A/B evaluation practices is still emerging; early adopters in financial services are defining the standard |
| Cost Predictability | 4 — Predictable | The primary variable cost is the inference cost differential between control and challenger at the configured split percentage; this is precisely calculable once per-token costs and expected traffic volume are known |
18. Revision History
| Version | Date | Change |
|---|---|---|
| 1.0 | 2026-06-14 | Initial release |