AI Confidence Threshold Routing
Pattern ID: EAAPL-HIL005
Status: Proven
Tags: human-oversight observability llm medium-complexity
Version: 1.0
Last Updated: 2026-06-12
1. Executive Summary
The AI Confidence Threshold Routing pattern dynamically routes inference requests to different handling tiers — fully automated, human-assisted, or human-primary — based on the model's calibrated output confidence. It solves a foundational enterprise AI problem: a single automated handling path produces unacceptable error rates, while routing everything to human review eliminates the economic benefit of AI. Confidence-based routing finds the middle ground, automating only what the model handles reliably.
The pattern addresses four specific technical challenges: raw model confidence scores are poorly calibrated and must be adjusted before they are actionable; thresholds must be set empirically against accuracy data, not guessed; thresholds drift as model behaviour shifts and must be recalibrated quarterly; and business rules must overlay the confidence signal so that certain topics always route to human review regardless of confidence. CIOs and CTOs implementing this pattern gain a precision instrument for balancing automation rate, quality, and human cost — with dashboards that make the trade-off visible and adjustable. It is the prerequisite infrastructure for all other human-in-the-loop patterns that rely on confidence-based escalation.
2. Problem Statement
Business Problem
A flat automation policy — "AI handles everything" or "AI handles if confidence > X" with an arbitrarily chosen X — produces either unacceptable error rates or inefficient human bottlenecks. Organisations need a principled, empirically grounded, and continuously maintained mechanism to determine which requests are safe to automate and which require human involvement.
Technical Problem
Neural network softmax outputs are not calibrated probabilities. A model that outputs 0.95 confidence is often correct only 80–85% of the time. Using raw softmax as a routing gate systematically over-automates, routing to the automated path cases the model is actually uncertain about. The converse is also true: an overly conservative threshold under-automates, routing to human review cases the model handles reliably — wasting expert time. Neither error is easily visible without calibration infrastructure.
Symptoms
- Automation rate is set but never updated; the original threshold was chosen without empirical data
- Model accuracy by confidence band has never been measured
- Human reviewers complain that most escalated AI cases are trivially correct
- Alternatively: downstream quality metrics show AI error rate higher than expected for the stated confidence threshold
- No mechanism exists to test whether a different threshold would improve the cost/quality trade-off
Cost of Inaction
- Over-automation: AI errors accumulate at scale in domains where the model is poorly calibrated; errors surface in customer complaints, regulatory findings, or quality audits
- Under-automation: human review cost exceeds AI's economic benefit; ROI fails; AI project is defunded
- No recalibration: model behaviour shifts over time; fixed thresholds become increasingly misaligned with actual accuracy; automation rate creeps upward as model confidence inflates on distribution-shifted inputs
3. Context
When to Apply
- Any production AI system where a subset of requests should be handled by humans
- Systems where the accuracy requirement varies by confidence tier (e.g. routine automation at 99% accuracy; exceptional items at 95%)
- Regulated environments where the human oversight boundary must be empirically defensible
- Systems using LLMs where confidence extraction requires bespoke engineering (LLMs do not natively produce calibrated confidence)
When NOT to Apply
- Binary all-or-nothing automation decisions where every case requires the same handling (use a static policy instead)
- Generative output use cases without a well-defined output space (confidence routing requires a probability distribution over a defined outcome set)
- Use cases where the latency of confidence estimation and routing exceeds the acceptable response SLO
Prerequisites
- Model produces a probability distribution over a discrete output space (or a proxy can be constructed — see LLM section in Architecture Overview)
- A validation dataset with known ground truth is available for threshold calibration
- Human review workforce available to handle the routed tier
Industry Applicability
| Industry |
Routing Decision |
Automation Tier |
Human-Assist Tier |
Human-Primary Tier |
| Financial Services |
Transaction risk classification |
Score < 0.2 (low risk, auto-approve) |
0.2–0.7 (review queue) |
> 0.7 or regulatory flag (analyst) |
| Insurance |
Claims triage |
Routine claims with high confidence |
Complex claims |
Fraud signals + large claims |
| Healthcare |
Document classification (ICD coding) |
Standard codes, confidence > 0.90 |
Confidence 0.70–0.90 |
New codes / patient exceptions |
| Legal |
Contract clause risk flagging |
Standard clauses (boilerplate) |
Unusual clauses, medium confidence |
High-risk clause types always |
| Customer Service |
Intent classification |
Known intents, high confidence |
Ambiguous intents |
Sensitive topics, always |
| Retail |
Product category classification |
Well-established categories |
New product types |
Brand-sensitive or restricted categories |
4. Architecture Overview
The AI Confidence Threshold Routing pattern has five layers that must be implemented and maintained together.
Layer 1 — Confidence Extraction and Calibration. For traditional classifiers (logistic regression, gradient boosting, neural networks), the model produces a softmax probability vector. These raw probabilities are well-known to be overconfident. Platt scaling fits a logistic regression on a held-out validation set, mapping raw probabilities to calibrated probabilities. Temperature scaling is simpler: it divides the logit vector by a learned scalar T before the softmax, scaling probabilities toward the centre without changing predictions. Both methods require a validation set with ground truth. The calibration function is stored alongside the model and applied at inference time. For LLMs that do not natively produce probabilities, proxies can be constructed: ask the LLM to rate its own confidence (with known limitations — LLM self-assessment is poorly calibrated); use logprob sampling to estimate token-level uncertainty; use an ensemble of LLM calls with different seeds and measure output consistency; or train a classifier to predict LLM accuracy from input features.
Layer 2 — Threshold Setting. Thresholds are not chosen arbitrarily — they are derived empirically from the calibrated confidence versus accuracy curve on a validation set. For each confidence bin (e.g. 0.50–0.55, 0.55–0.60, ..., 0.95–1.00), compute the actual accuracy. Plot this curve and identify where accuracy meets the required threshold for automation (typically 99%+ for fully automated decisions). The confidence value at that intersection becomes the automation threshold. The lower threshold (below which human-primary handling is required) is set where accuracy falls below the acceptable bar for human-assisted review. These thresholds should be set independently for each domain or topic cluster if accuracy varies significantly across categories.
Layer 3 — Multi-Tier Routing Architecture. The routing engine evaluates confidence against the calibrated thresholds and applies business rules to determine the handling tier. Tier 1 (high confidence, above automation threshold): request is handled automatically with no human involvement. Tier 2 (medium confidence, between thresholds): request is routed to a human-assisted queue where the human reviews and approves the AI recommendation. Tier 3 (low confidence, below lower threshold): request is routed to a human-primary queue where the human makes the decision independently with AI output as an optional reference. Business rules overlay applies regardless of confidence: topics flagged as always-human (legal advice, clinical diagnosis, high-value financial transactions above configured amount) route to Tier 3 unconditionally.
Layer 4 — Threshold Monitoring and Recalibration. Model behaviour shifts over time due to data distribution changes, model updates, and real-world evolution. The calibration relationship between raw confidence and actual accuracy drifts. The system monitors calibration health in three ways: Expected Calibration Error (ECE) is computed on a rolling window of recently processed items where ground truth is available; Reliability Diagrams are generated monthly to visualise calibration drift; and Population Stability Index (PSI) detects input distribution shift that precedes calibration degradation. When any of these signals exceeds its threshold, a recalibration job is triggered. Recalibration uses recent labelled data from the human review tiers — the ground truth generated by human reviewers is fed back to update the calibration function.
Layer 5 — A/B Testing Threshold Values. Threshold values represent a trade-off between automation rate and quality. The system supports A/B testing threshold values to find the optimal point. A control group receives the current threshold configuration; a treatment group receives modified thresholds (tighter or looser). Both groups are tracked for: automation rate (efficiency), error rate on automated decisions (quality), human review volume (cost), and customer or business outcome metrics. After a predefined period (typically 2–4 weeks), statistical significance is tested and the winning configuration is promoted. This enables data-driven threshold optimisation rather than intuition-driven adjustment.
5. Architecture Diagram
flowchart TD
subgraph Scoring["Confidence Scoring"]
A[Inference Request]
B[Model + Calibration Function]
end
subgraph Routing["Threshold Routing"]
C{Business Rules Overlay}
D{Threshold Evaluator}
end
subgraph Handling["Handling Tiers"]
E[Tier 1 Automated Decision]
F[Tier 2 Human-Assisted Queue]
G[Tier 3 Human-Primary Queue]
H[Calibration Monitor]
end
A --> B
B --> C
C -->|always-human rule| G
C -->|pass| D
D -->|high confidence| E
D -->|medium confidence| F
D -->|low confidence| G
E --> H
F --> H
G --> H
H -->|drift detected| B
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#f3e8ff,stroke:#a855f7
style D fill:#f3e8ff,stroke:#a855f7
style E fill:#d1fae5,stroke:#10b981
style F fill:#f0fdf4,stroke:#22c55e
style G fill:#fee2e2,stroke:#ef4444
style H fill:#fef9c3,stroke:#eab308
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Inference Engine |
ML Serving |
Run model forward pass; return raw logits or probabilities |
SageMaker, Vertex AI, Azure ML, TorchServe, vLLM |
Critical |
| Calibration Function |
ML Utility |
Apply Platt/temperature scaling to raw outputs |
scikit-learn CalibratedClassifierCV, custom temperature scaling layer, ONNX post-processing node |
Critical |
| Business Rules Overlay |
Rules Engine |
Apply topic-based and value-based always-human rules |
Python rules engine, Drools, AWS Business Rules Engine |
High |
| Threshold Evaluator |
Application Service |
Compare calibrated confidence to T_high and T_low; output routing decision |
Python microservice; sub-millisecond latency target |
Critical |
| Tier 1 Automation Handler |
Application Service |
Execute automated decision; log with full audit trail |
Domain-specific service |
Critical |
| Tier 2 Human-Assisted Queue |
Queue + Interface |
Hold medium-confidence items; present to reviewer with AI recommendation |
PostgreSQL queue + custom React interface; Zendesk + AI integration |
High |
| Tier 3 Human-Primary Queue |
Queue + Interface |
Hold low-confidence items; present to human with AI as optional reference |
Same infrastructure as Tier 2; different routing and interface mode |
High |
| Outcome Tracker |
Data Service |
Collect ground truth when available; link to original request |
Batch job or event-driven consumer; PostgreSQL |
High |
| Calibration Monitor |
Analytics Service |
Compute ECE, generate reliability diagrams, compute PSI |
Python (scikit-learn, scipy); Evidently AI; custom visualisation |
High |
| Recalibration Job |
ML Pipeline |
Re-fit calibration function on recent labelled data |
Python scikit-learn; Airflow DAG; model registry integration |
High |
| A/B Traffic Router |
Routing Service |
Split traffic between threshold configurations for testing |
Feature flag service (LaunchDarkly, AWS CloudWatch Evidently) |
Medium |
| Threshold Registry |
Configuration Store |
Store current threshold values; version history; A/B test state |
PostgreSQL or AWS Parameter Store with version history |
High |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Client |
Submits request |
Request payload with request_id |
| 2 |
Inference Engine |
Runs model; returns raw probabilities |
raw_probabilities[], predicted_class, request_id |
| 3 |
Calibration Function |
Applies scaling; returns calibrated confidence |
calibrated_confidence, calibration_method, calibration_version |
| 4 |
Business Rules Overlay |
Evaluates topic and value rules |
always_human: true/false, rule_fired: null or rule_id |
| 5 |
Threshold Evaluator |
Compares to T_high and T_low from Threshold Registry |
routing_tier: [1, 2, 3], routing_reason, threshold_snapshot |
| 6a |
Tier 1 Handler |
Executes automated decision |
decision_outcome, automation_audit_record |
| 6b |
Tier 2 Queue |
Presents to human reviewer; captures approval or override |
human_decision, override_reason, reviewer_id, review_latency_ms |
| 6c |
Tier 3 Queue |
Presents to human decision-maker |
human_decision, decision_rationale, decision_maker_id |
| 7 |
Outcome Tracker |
Receives downstream outcome event |
outcome_type, outcome_value, linked to request_id |
| 8 |
Calibration Monitor |
Computes ECE on recent labelled items; computes PSI on input distribution |
ece_score, psi_score, reliability_diagram, alert if threshold exceeded |
| 9 |
Recalibration Job |
Re-fits calibration function; updates Threshold Registry |
New calibration_version; updated T_high and T_low recommendations |
Error Flow
| Error Condition |
Detected By |
Recovery Action |
Notification |
| Calibration function fails at inference time |
Health check; inference error log |
Route all requests to Tier 2 (conservative fallback); never fail to Tier 1 automated |
ML Ops on-call; human review capacity alert |
| Calibration monitor detects ECE > 0.10 |
Calibration Monitor |
Trigger emergency recalibration; lower T_high temporarily |
Model Risk Officer; ML Ops |
| PSI > 0.2 (distribution shift) |
PSI Monitor |
Alert Model Risk; trigger recalibration; optionally pause automation |
Model Risk Officer |
| A/B test produces harmful outcome (error rate significantly worse) |
A/B experiment monitor |
Halt treatment; revert to control thresholds; document result |
ML Ops; Model Risk |
| Threshold Registry unavailable |
Routing service health check |
Route all requests to Tier 2 (conservative fallback) |
Operations on-call |
8. Security Considerations
Authentication and Authorisation
- Threshold Registry write access restricted to ML Ops pipeline service accounts and authorised Model Risk personnel
- Calibration function parameters treated as model artefacts; stored in model registry under same access controls as model weights
- A/B test configuration changes require dual-authorisation (ML Ops + Model Risk)
Secrets Management
- Inference endpoint credentials stored in secrets manager
- Human review interface authentication via SSO; no shared credentials
Data Classification
- Confidence scores themselves are not sensitive, but they reveal information about model capability boundaries
- Do not expose raw confidence scores to end users; only expose routing outcomes (automated vs review) if required by transparency obligations
Encryption
- All inference logs (including confidence scores) encrypted at rest
- Calibration data (ground truth labels used for recalibration) encrypted at rest and in transit
Auditability
- Every routing decision logged with: request_id, calibrated_confidence, routing_tier, routing_reason, threshold_values_used, timestamp
- Threshold changes logged with: previous values, new values, change reason, approver, timestamp
OWASP LLM Top 10 Considerations
| OWASP LLM Risk |
Applicability |
Mitigation |
| LLM01: Prompt Injection |
Low — routing happens at inference output level |
N/A for routing layer; mitigate in upstream inference |
| LLM02: Insecure Output Handling |
Low |
N/A for routing layer |
| LLM03: Training Data Poisoning |
Medium — ground truth used for recalibration could be poisoned |
Validate recalibration data quality; anomaly detection on label distribution changes |
| LLM04: Model Denial of Service |
Medium — threshold set too high routes everything to human, creating queue overflow |
Human review capacity monitoring; threshold bounds checking |
| LLM05: Supply Chain Vulnerabilities |
Low |
Standard model provenance controls |
| LLM06: Sensitive Information Disclosure |
Low — routing layer does not process raw content beyond confidence extraction |
N/A |
| LLM07: Insecure Plugin Design |
Low |
N/A |
| LLM08: Excessive Agency |
Low — routing pattern is an oversight mechanism, not an autonomy mechanism |
By design |
| LLM09: Overreliance |
High — if T_high is set too permissively, too many items are automated |
Calibration monitoring; ECE alerting; regular threshold review |
| LLM10: Model Theft |
Low — threshold values reveal model capability profile |
Restrict exposure of T_high value to authorised personnel only |
9. Governance Considerations
Responsible AI
- Threshold settings must be validated independently for each protected demographic group; a threshold that achieves 99% accuracy on the aggregate may achieve only 94% accuracy for a specific group
- Routing tier distribution monitored by input segment; if certain input types disproportionately route to Tier 3, investigate for systematic model deficiency
Model Risk Management
- Threshold values are a model risk decision — changing them changes the effective automation scope
- All threshold changes (including recalibration-triggered changes) are model risk events requiring documentation
- Quarterly review of calibration health is a Model Risk responsibility
Human Approval Gates
- Changes to T_high (automation threshold) require Model Risk Officer sign-off
- A/B test designs require Model Risk review before launch
- Emergency recalibrations triggered by ECE breach require post-event review within 5 business days
Policy Compliance
- Regulatory domains (financial advice, clinical, legal) have T_high = 0.0 (effectively: always routed to Tier 3 regardless of confidence); this is a policy decision, not a statistical one
Traceability
- Each automated Tier 1 decision is traceable to: calibrated confidence, threshold version, calibration version
- This trace is the evidence for "the model met the confidence required for automation" in regulatory examination
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Calibration Health Report (ECE + Reliability Diagram) |
ML Ops |
Monthly |
Track calibration drift; trigger recalibration decisions |
| Threshold Review Record |
Model Risk Officer |
Quarterly |
Document threshold review with supporting data |
| A/B Test Results Report |
ML Ops |
Per test |
Document test design, results, statistical significance, decision |
| Routing Distribution Report |
Model Risk |
Monthly |
Track % of requests in each tier; detect threshold drift |
| Emergency Recalibration Post-Event Report |
Model Risk Officer |
As triggered |
Root cause and resolution for any emergency recalibration event |
10. Operational Considerations
Monitoring
| Metric |
SLO |
Alert Threshold |
Owner |
| Calibration Expected Calibration Error (ECE) |
< 0.05 |
> 0.08 |
ML Ops |
| Tier 1 automation rate |
Baseline ± 10% |
> +20% (over-automating) or < -20% (under-automating) |
ML Ops |
| Tier 3 queue depth |
< 2x daily human capacity |
> 3x daily capacity |
Operations Manager |
| Routing decision latency |
< 5ms p99 |
> 20ms p99 |
Engineering |
| PSI (input distribution shift) |
< 0.1 |
> 0.2 |
ML Ops |
| Tier 1 error rate (sampled audit) |
Meets accuracy SLA |
> accuracy threshold breach |
Model Risk Officer |
Logging
- Structured JSON logs for every routing decision with full confidence, threshold, and tier metadata
- Retained 7 years for regulated decisions; 90 days for non-regulated
- Calibration monitor outputs stored with timestamp for trend analysis
Incident Response
- Calibration failure: immediately lower T_high to conservative default; page ML Ops; recalibrate before raising threshold
- PSI alert: notify Model Risk; request domain expert assessment of distribution change; do not automatically recalibrate without Model Risk sign-off (distribution shift may require new training data, not just recalibration)
Disaster Recovery
| Component |
RTO |
RPO |
Strategy |
| Threshold Evaluator |
5 min |
0 (stateless) |
Multi-AZ; auto-scaling; conservative fallback to Tier 2 if unavailable |
| Threshold Registry |
15 min |
5 min |
PostgreSQL synchronous standby; fallback to last-known-good threshold from application memory cache |
| Calibration Function |
15 min |
N/A (versioned artefact) |
Stored in model registry with version history; load previous version on failure |
Capacity Planning
- Tier 2 and Tier 3 human review capacity must be sized for peak routing volume at the configured thresholds
- Scenario-plan for calibration failure event: if T_high drops to conservative default, Tier 2 volume may spike 3–5x; human review capacity must accommodate this
11. Cost Considerations
Cost Drivers
| Driver |
Description |
Relative Weight |
| Human Review Labour |
Tier 2 and Tier 3 human review costs; directly proportional to routing volume at each tier |
Very High |
| Calibration Compute |
Periodic recalibration job; lightweight compared to full model retraining |
Low |
| Routing Infrastructure |
Sub-millisecond routing service; very low compute cost |
Very Low |
| Threshold Management |
Model Risk and ML Ops labour for quarterly reviews and A/B tests |
Medium |
Scaling Risks
- Conservative thresholds (T_high very high) route most requests to human review; human cost dominates
- Aggressive thresholds (T_high too low) automate too much; quality costs (complaints, regulatory findings) dominate
- Calibration drift without detection causes the actual Tier 1 error rate to diverge from the intended SLA
Optimisations
- Set domain-specific thresholds: do not use a single threshold for all input types; low-variance standard inputs can have higher T_high (more automation); high-variance novel inputs need lower T_high
- Use outcome data to continuously refine thresholds toward the optimal efficiency-quality trade-off
- Invest in model accuracy improvement to shift the calibration curve; a more accurate model allows the same accuracy SLA with a lower T_high (more automation)
Indicative Cost Range
| Automation Rate |
Tier 2+3 Human Volume (1M/month requests) |
Monthly Human Review Cost |
Notes |
| 95% automation |
50,000 human reviews/month |
$50,000–$150,000 |
Suitable for well-calibrated, low-stakes domains |
| 80% automation |
200,000 human reviews/month |
$200,000–$600,000 |
Balanced profile for medium-stakes domains |
| 60% automation |
400,000 human reviews/month |
$400,000–$1.2M |
Conservative; appropriate for regulated high-stakes domains |
12. Trade-Off Analysis
Calibration Method Options
| Method |
Accuracy of Calibration |
Implementation Complexity |
Inference Latency Impact |
Recommended |
| Platt Scaling |
Good for sigmoid/binary |
Low |
Negligible |
Default for binary classification |
| Temperature Scaling |
Good for multi-class; simple |
Very Low |
Negligible |
Default for multi-class; simplest to implement and maintain |
| Isotonic Regression |
Very good; non-parametric |
Medium |
Low |
Use when Platt/temperature fit poorly on validation data |
| Conformal Prediction |
Statistically rigorous coverage guarantees |
High |
Low-Medium |
Use in regulated domains requiring provable coverage bounds |
| LLM Self-Assessment (ask LLM its confidence) |
Poor — known to be miscalibrated |
Very Low |
Medium (extra LLM call) |
Not recommended as primary calibration; acceptable as supplementary signal |
Architectural Tensions
| Tension |
Option A |
Option B |
Resolution Guidance |
| Single global threshold vs per-domain thresholds |
Single threshold: simpler operations |
Per-domain: higher accuracy but more maintenance |
Per-domain is always more accurate; implement per-domain from day one to avoid costly migration later |
| Recalibrate on drift vs retrain on drift |
Recalibrate: fast, cheap; fixes probability distortion |
Retrain: fixes underlying accuracy; more expensive |
Recalibrate first; if recalibration does not restore ECE, retrain |
| Hard threshold vs soft threshold (probabilistic routing) |
Hard: deterministic, auditable |
Soft (probabilistic routing at threshold boundary): smooths human review workload |
Use hard threshold for audit and compliance; soft routing only in non-regulated contexts where load smoothing is valuable |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Confidence overinflation post-distribution-shift |
High |
Critical — Tier 1 automation rate rises; error rate rises silently |
PSI monitoring; ECE monitoring on sampled Tier 2 ground truth |
Emergency threshold reduction; recalibration with recent data |
| Threshold too permissive from launch |
Medium |
High — over-automation from day one |
Tier 1 error rate monitoring on sampled audit |
Raise T_high immediately; recalibrate |
| Always-human business rule misconfigured (topic missed) |
Medium |
High — regulated topic handled in Tier 1 automatically |
Quality audit of Tier 1 automated decisions by topic |
Add missing topic to always-human list; retroactive review of affected Tier 1 decisions |
| Calibration function stale (not recalibrated for > 6 months) |
Medium |
High — silent calibration drift; routing based on outdated accuracy relationship |
Calibration recency monitoring |
Emergency recalibration job |
| A/B test contamination (control and treatment groups overlap) |
Low |
Medium — invalid A/B test results lead to wrong threshold decision |
A/B test design review before launch |
Halt contaminated test; restart with clean groups |
Cascading Failure Scenario
- Input distribution shift (new product type introduced) → PSI alert missed → calibration drifts over 3 months → T_high unchanged → Tier 1 error rate rises from 0.5% to 4% → 40× increase in errors on automated Tier 1 volume → customer complaints, regulatory finding
- Mitigation: PSI monitoring as independent signal from ECE; automated PSI alert to Model Risk triggers mandatory threshold review even before ECE deteriorates
14. Regulatory Considerations
| Regulation |
Specific Clause |
Requirement |
Implementation |
| EU AI Act |
Article 9 — Risk management |
High-risk AI systems must have technical measures controlling automated decision scope |
Confidence threshold routing is the prescribed technical control |
| EU AI Act |
Article 15 — Accuracy specifications |
AI system accuracy must be specified and maintained |
Tier 1 automation threshold defines the accuracy guarantee for automated decisions; ECE monitoring demonstrates maintenance |
| APRA CPS 234 |
§36 — Information security controls testing |
Automated decision systems must have controls validated against current threat/accuracy profile |
Quarterly threshold review + calibration health report satisfy this obligation |
| APRA CPS 230 |
§52 — Operational resilience |
Degradation of calibration model must not cause operational failure |
Conservative fallback to Tier 2 on calibration failure satisfies this requirement |
| Privacy Act 1988 (Australia) |
APP 1.4 — Automated decision making |
Organisations must identify when automated decision making is used |
Routing tier metadata in audit log identifies automated vs human decisions |
| ISO 42001:2023 |
§8.4 — AI system operation |
Operational controls must address AI system performance boundaries |
Calibration monitoring and threshold management are the operational controls |
| NIST AI RMF |
MEASURE 2.2 — AI risk measurement |
Quantitative measures of AI system accuracy must be tracked |
ECE, Tier 1 error rate, and reliability diagrams are the NIST-prescribed measurement artefacts |
| SR 11-7 (US Banking) |
Model validation — performance monitoring |
Model performance including effective automation boundary must be monitored post-deployment |
Calibration monitoring and routing distribution report satisfy SR 11-7 post-deployment monitoring |
| GDPR Article 22 |
Automated individual decision-making |
Solely automated decisions with legal or significant effects require human involvement |
Tier 3 routing for high-impact decisions; Tier 1 restricted to low-significance automation |
15. Reference Implementations
AWS
- Inference: SageMaker Real-time Endpoints
- Calibration: SageMaker Pipeline step running scikit-learn CalibratedClassifierCV; artefact stored in S3 model registry
- Threshold Evaluator: AWS Lambda function (sub-millisecond; cold start managed with Provisioned Concurrency)
- Threshold Registry: AWS Parameter Store (versioned parameters) or DynamoDB single-record config table
- Business Rules Overlay: AWS Lambda with JSON-defined rule set
- Tier 2/3 Queues: Amazon SQS FIFO with separate queues per tier
- Calibration Monitor: EventBridge scheduled Lambda; CloudWatch custom metrics; CloudWatch Alarms for ECE and PSI
- A/B Testing: AWS CloudWatch Evidently
Azure
- Inference: Azure Machine Learning Managed Online Endpoints
- Calibration: Azure ML Pipeline with Python calibration step
- Threshold Evaluator: Azure Functions (Consumption or Premium for latency control)
- Threshold Registry: Azure App Configuration with versioned keys
- Calibration Monitor: Azure Monitor custom metrics + Logic Apps for alert workflow
- A/B Testing: Azure Experimentation (Azure App Configuration feature flags)
GCP
- Inference: Vertex AI Online Prediction
- Calibration: Vertex AI Pipeline step
- Threshold Evaluator: Cloud Run service (low-latency container)
- Threshold Registry: Firestore single-document config with audit history
- Calibration Monitor: Cloud Scheduler + Cloud Functions; Cloud Monitoring custom metrics
- A/B Testing: Firebase Remote Config or Google Optimize
On-Premises / Private Cloud
- Inference: TorchServe or BentoML on Kubernetes
- Calibration: scikit-learn calibration stored in MLflow Model Registry
- Threshold Evaluator: Python FastAPI service on Kubernetes; HPA for auto-scaling
- Threshold Registry: PostgreSQL with versioned threshold records
- Calibration Monitor: Airflow DAG with Evidently AI reports; Grafana dashboards
- A/B Testing: LaunchDarkly or custom feature flag service
| Pattern |
ID |
Relationship |
Notes |
| Active Learning Loop |
EAAPL-HIL002 |
Dependency — active learning requires calibrated confidence for candidate selection |
Active learning candidate selection is powered by the calibrated confidence produced by this pattern |
| Human Escalation Pattern |
EAAPL-HIL003 |
Dependency — escalation trigger uses confidence threshold as one signal |
Confidence threshold routing is the technical implementation of the confidence-based escalation trigger |
| Collaborative AI Decision |
EAAPL-HIL004 |
Dependency — collaborative review boundary is defined by thresholds |
Tier 2 routing corresponds to collaborative review; Tier 1 to automation; Tier 3 to escalation |
| Human Override Pattern |
EAAPL-HIL006 |
Complementary — overrides on Tier 1 automated decisions are a valuable calibration signal |
Human overrides on automated decisions indicate the automation threshold may be too permissive |
| Annotation and Feedback Loop |
EAAPL-HIL007 |
Complementary — Tier 2 and 3 human decisions are annotation inputs |
Human review decisions feed the annotation pipeline for model training |
| Supervisor Agent |
EAAPL-MAG002 |
Complementary — supervisor agent can use confidence routing to determine when to invoke worker agents vs human review |
Agent architectures benefit from the same confidence-based routing principles |
17. Maturity Assessment
Overall Maturity Level: Proven
| Dimension |
Score (1–5) |
Rationale |
| Technical Maturity |
5 |
Platt scaling and temperature scaling are textbook ML techniques; well-supported in scikit-learn and most ML frameworks |
| Operational Maturity |
4 |
Calibration monitoring and threshold management require ML Ops discipline; most organisations lack formal recalibration processes |
| Governance Maturity |
4 |
EU AI Act and APRA model risk obligations directly require automation boundary governance; threshold management is the implementation |
| Tooling Ecosystem |
5 |
scikit-learn, Evidently AI, MLflow, and cloud ML platforms provide native calibration and monitoring support |
| Enterprise Adoption |
4 |
Widely adopted in financial services; growing in healthcare and insurance; threshold management formalism is less mature outside financial services |
| Risk Profile |
Low-Medium |
Well-understood; primary risk is calibration drift without monitoring; ECE monitoring is the standard control |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2026-06-12 |
EAAPL Working Group |
Initial publication covering calibration methods, threshold setting methodology, multi-tier routing, threshold monitoring, and A/B testing framework |