Proven

AI Confidence Threshold Routing

Pattern ID: EAAPL-HIL005 Status: Proven Tags: human-oversight observability llm medium-complexity Version: 1.0 Last Updated: 2026-06-12

1. Executive Summary

The AI Confidence Threshold Routing pattern dynamically routes inference requests to different handling tiers — fully automated, human-assisted, or human-primary — based on the model's calibrated output confidence. It solves a foundational enterprise AI problem: a single automated handling path produces unacceptable error rates, while routing everything to human review eliminates the economic benefit of AI. Confidence-based routing finds the middle ground, automating only what the model handles reliably.

The pattern addresses four specific technical challenges: raw model confidence scores are poorly calibrated and must be adjusted before they are actionable; thresholds must be set empirically against accuracy data, not guessed; thresholds drift as model behaviour shifts and must be recalibrated quarterly; and business rules must overlay the confidence signal so that certain topics always route to human review regardless of confidence. CIOs and CTOs implementing this pattern gain a precision instrument for balancing automation rate, quality, and human cost — with dashboards that make the trade-off visible and adjustable. It is the prerequisite infrastructure for all other human-in-the-loop patterns that rely on confidence-based escalation.

2. Problem Statement

Business Problem

A flat automation policy — "AI handles everything" or "AI handles if confidence > X" with an arbitrarily chosen X — produces either unacceptable error rates or inefficient human bottlenecks. Organisations need a principled, empirically grounded, and continuously maintained mechanism to determine which requests are safe to automate and which require human involvement.

Technical Problem

Neural network softmax outputs are not calibrated probabilities. A model that outputs 0.95 confidence is often correct only 80–85% of the time. Using raw softmax as a routing gate systematically over-automates, routing to the automated path cases the model is actually uncertain about. The converse is also true: an overly conservative threshold under-automates, routing to human review cases the model handles reliably — wasting expert time. Neither error is easily visible without calibration infrastructure.

Symptoms

Automation rate is set but never updated; the original threshold was chosen without empirical data
Model accuracy by confidence band has never been measured
Human reviewers complain that most escalated AI cases are trivially correct
Alternatively: downstream quality metrics show AI error rate higher than expected for the stated confidence threshold
No mechanism exists to test whether a different threshold would improve the cost/quality trade-off

Cost of Inaction

Over-automation: AI errors accumulate at scale in domains where the model is poorly calibrated; errors surface in customer complaints, regulatory findings, or quality audits
Under-automation: human review cost exceeds AI's economic benefit; ROI fails; AI project is defunded
No recalibration: model behaviour shifts over time; fixed thresholds become increasingly misaligned with actual accuracy; automation rate creeps upward as model confidence inflates on distribution-shifted inputs

3. Context

When to Apply

Any production AI system where a subset of requests should be handled by humans
Systems where the accuracy requirement varies by confidence tier (e.g. routine automation at 99% accuracy; exceptional items at 95%)
Regulated environments where the human oversight boundary must be empirically defensible
Systems using LLMs where confidence extraction requires bespoke engineering (LLMs do not natively produce calibrated confidence)

When NOT to Apply

Binary all-or-nothing automation decisions where every case requires the same handling (use a static policy instead)
Generative output use cases without a well-defined output space (confidence routing requires a probability distribution over a defined outcome set)
Use cases where the latency of confidence estimation and routing exceeds the acceptable response SLO

Prerequisites

Model produces a probability distribution over a discrete output space (or a proxy can be constructed — see LLM section in Architecture Overview)
A validation dataset with known ground truth is available for threshold calibration
Human review workforce available to handle the routed tier

Industry Applicability

Industry	Routing Decision	Automation Tier	Human-Assist Tier	Human-Primary Tier
Financial Services	Transaction risk classification	Score < 0.2 (low risk, auto-approve)	0.2–0.7 (review queue)	> 0.7 or regulatory flag (analyst)
Insurance	Claims triage	Routine claims with high confidence	Complex claims	Fraud signals + large claims
Healthcare	Document classification (ICD coding)	Standard codes, confidence > 0.90	Confidence 0.70–0.90	New codes / patient exceptions
Legal	Contract clause risk flagging	Standard clauses (boilerplate)	Unusual clauses, medium confidence	High-risk clause types always
Customer Service	Intent classification	Known intents, high confidence	Ambiguous intents	Sensitive topics, always
Retail	Product category classification	Well-established categories	New product types	Brand-sensitive or restricted categories

4. Architecture Overview

The AI Confidence Threshold Routing pattern has five layers that must be implemented and maintained together.

Layer 1 — Confidence Extraction and Calibration. For traditional classifiers (logistic regression, gradient boosting, neural networks), the model produces a softmax probability vector. These raw probabilities are well-known to be overconfident. Platt scaling fits a logistic regression on a held-out validation set, mapping raw probabilities to calibrated probabilities. Temperature scaling is simpler: it divides the logit vector by a learned scalar T before the softmax, scaling probabilities toward the centre without changing predictions. Both methods require a validation set with ground truth. The calibration function is stored alongside the model and applied at inference time. For LLMs that do not natively produce probabilities, proxies can be constructed: ask the LLM to rate its own confidence (with known limitations — LLM self-assessment is poorly calibrated); use logprob sampling to estimate token-level uncertainty; use an ensemble of LLM calls with different seeds and measure output consistency; or train a classifier to predict LLM accuracy from input features.

Layer 2 — Threshold Setting. Thresholds are not chosen arbitrarily — they are derived empirically from the calibrated confidence versus accuracy curve on a validation set. For each confidence bin (e.g. 0.50–0.55, 0.55–0.60, ..., 0.95–1.00), compute the actual accuracy. Plot this curve and identify where accuracy meets the required threshold for automation (typically 99%+ for fully automated decisions). The confidence value at that intersection becomes the automation threshold. The lower threshold (below which human-primary handling is required) is set where accuracy falls below the acceptable bar for human-assisted review. These thresholds should be set independently for each domain or topic cluster if accuracy varies significantly across categories.

Layer 3 — Multi-Tier Routing Architecture. The routing engine evaluates confidence against the calibrated thresholds and applies business rules to determine the handling tier. Tier 1 (high confidence, above automation threshold): request is handled automatically with no human involvement. Tier 2 (medium confidence, between thresholds): request is routed to a human-assisted queue where the human reviews and approves the AI recommendation. Tier 3 (low confidence, below lower threshold): request is routed to a human-primary queue where the human makes the decision independently with AI output as an optional reference. Business rules overlay applies regardless of confidence: topics flagged as always-human (legal advice, clinical diagnosis, high-value financial transactions above configured amount) route to Tier 3 unconditionally.

Layer 4 — Threshold Monitoring and Recalibration. Model behaviour shifts over time due to data distribution changes, model updates, and real-world evolution. The calibration relationship between raw confidence and actual accuracy drifts. The system monitors calibration health in three ways: Expected Calibration Error (ECE) is computed on a rolling window of recently processed items where ground truth is available; Reliability Diagrams are generated monthly to visualise calibration drift; and Population Stability Index (PSI) detects input distribution shift that precedes calibration degradation. When any of these signals exceeds its threshold, a recalibration job is triggered. Recalibration uses recent labelled data from the human review tiers — the ground truth generated by human reviewers is fed back to update the calibration function.

Layer 5 — A/B Testing Threshold Values. Threshold values represent a trade-off between automation rate and quality. The system supports A/B testing threshold values to find the optimal point. A control group receives the current threshold configuration; a treatment group receives modified thresholds (tighter or looser). Both groups are tracked for: automation rate (efficiency), error rate on automated decisions (quality), human review volume (cost), and customer or business outcome metrics. After a predefined period (typically 2–4 weeks), statistical significance is tested and the winning configuration is promoted. This enables data-driven threshold optimisation rather than intuition-driven adjustment.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Scoring["Confidence Scoring"] A[Inference Request] B[Model + Calibration Function] end subgraph Routing["Threshold Routing"] C{Business Rules Overlay} D{Threshold Evaluator} end subgraph Handling["Handling Tiers"] E[Tier 1 Automated Decision] F[Tier 2 Human-Assisted Queue] G[Tier 3 Human-Primary Queue] H[Calibration Monitor] end A --> B B --> C C -->|always-human rule| G C -->|pass| D D -->|high confidence| E D -->|medium confidence| F D -->|low confidence| G E --> H F --> H G --> H H -->|drift detected| B style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#d1fae5,stroke:#10b981 style F fill:#f0fdf4,stroke:#22c55e style G fill:#fee2e2,stroke:#ef4444 style H fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Inference Engine	ML Serving	Run model forward pass; return raw logits or probabilities	SageMaker, Vertex AI, Azure ML, TorchServe, vLLM	Critical
Calibration Function	ML Utility	Apply Platt/temperature scaling to raw outputs	scikit-learn CalibratedClassifierCV, custom temperature scaling layer, ONNX post-processing node	Critical
Business Rules Overlay	Rules Engine	Apply topic-based and value-based always-human rules	Python rules engine, Drools, AWS Business Rules Engine	High
Threshold Evaluator	Application Service	Compare calibrated confidence to T_high and T_low; output routing decision	Python microservice; sub-millisecond latency target	Critical
Tier 1 Automation Handler	Application Service	Execute automated decision; log with full audit trail	Domain-specific service	Critical
Tier 2 Human-Assisted Queue	Queue + Interface	Hold medium-confidence items; present to reviewer with AI recommendation	PostgreSQL queue + custom React interface; Zendesk + AI integration	High
Tier 3 Human-Primary Queue	Queue + Interface	Hold low-confidence items; present to human with AI as optional reference	Same infrastructure as Tier 2; different routing and interface mode	High
Outcome Tracker	Data Service	Collect ground truth when available; link to original request	Batch job or event-driven consumer; PostgreSQL	High
Calibration Monitor	Analytics Service	Compute ECE, generate reliability diagrams, compute PSI	Python (scikit-learn, scipy); Evidently AI; custom visualisation	High
Recalibration Job	ML Pipeline	Re-fit calibration function on recent labelled data	Python scikit-learn; Airflow DAG; model registry integration	High
A/B Traffic Router	Routing Service	Split traffic between threshold configurations for testing	Feature flag service (LaunchDarkly, AWS CloudWatch Evidently)	Medium
Threshold Registry	Configuration Store	Store current threshold values; version history; A/B test state	PostgreSQL or AWS Parameter Store with version history	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Client	Submits request	Request payload with request_id
2	Inference Engine	Runs model; returns raw probabilities	raw_probabilities[], predicted_class, request_id
3	Calibration Function	Applies scaling; returns calibrated confidence	calibrated_confidence, calibration_method, calibration_version
4	Business Rules Overlay	Evaluates topic and value rules	always_human: true/false, rule_fired: null or rule_id
5	Threshold Evaluator	Compares to T_high and T_low from Threshold Registry	routing_tier: [1, 2, 3], routing_reason, threshold_snapshot
6a	Tier 1 Handler	Executes automated decision	decision_outcome, automation_audit_record
6b	Tier 2 Queue	Presents to human reviewer; captures approval or override	human_decision, override_reason, reviewer_id, review_latency_ms
6c	Tier 3 Queue	Presents to human decision-maker	human_decision, decision_rationale, decision_maker_id
7	Outcome Tracker	Receives downstream outcome event	outcome_type, outcome_value, linked to request_id
8	Calibration Monitor	Computes ECE on recent labelled items; computes PSI on input distribution	ece_score, psi_score, reliability_diagram, alert if threshold exceeded
9	Recalibration Job	Re-fits calibration function; updates Threshold Registry	New calibration_version; updated T_high and T_low recommendations

Error Flow

Error Condition	Detected By	Recovery Action	Notification
Calibration function fails at inference time	Health check; inference error log	Route all requests to Tier 2 (conservative fallback); never fail to Tier 1 automated	ML Ops on-call; human review capacity alert
Calibration monitor detects ECE > 0.10	Calibration Monitor	Trigger emergency recalibration; lower T_high temporarily	Model Risk Officer; ML Ops
PSI > 0.2 (distribution shift)	PSI Monitor	Alert Model Risk; trigger recalibration; optionally pause automation	Model Risk Officer
A/B test produces harmful outcome (error rate significantly worse)	A/B experiment monitor	Halt treatment; revert to control thresholds; document result	ML Ops; Model Risk
Threshold Registry unavailable	Routing service health check	Route all requests to Tier 2 (conservative fallback)	Operations on-call

8. Security Considerations

Authentication and Authorisation

Threshold Registry write access restricted to ML Ops pipeline service accounts and authorised Model Risk personnel
Calibration function parameters treated as model artefacts; stored in model registry under same access controls as model weights
A/B test configuration changes require dual-authorisation (ML Ops + Model Risk)

Secrets Management

Inference endpoint credentials stored in secrets manager
Human review interface authentication via SSO; no shared credentials

Data Classification

Confidence scores themselves are not sensitive, but they reveal information about model capability boundaries
Do not expose raw confidence scores to end users; only expose routing outcomes (automated vs review) if required by transparency obligations

Encryption

All inference logs (including confidence scores) encrypted at rest
Calibration data (ground truth labels used for recalibration) encrypted at rest and in transit

Auditability

Every routing decision logged with: request_id, calibrated_confidence, routing_tier, routing_reason, threshold_values_used, timestamp
Threshold changes logged with: previous values, new values, change reason, approver, timestamp

OWASP LLM Top 10 Considerations

OWASP LLM Risk	Applicability	Mitigation
LLM01: Prompt Injection	Low — routing happens at inference output level	N/A for routing layer; mitigate in upstream inference
LLM02: Insecure Output Handling	Low	N/A for routing layer
LLM03: Training Data Poisoning	Medium — ground truth used for recalibration could be poisoned	Validate recalibration data quality; anomaly detection on label distribution changes
LLM04: Model Denial of Service	Medium — threshold set too high routes everything to human, creating queue overflow	Human review capacity monitoring; threshold bounds checking
LLM05: Supply Chain Vulnerabilities	Low	Standard model provenance controls
LLM06: Sensitive Information Disclosure	Low — routing layer does not process raw content beyond confidence extraction	N/A
LLM07: Insecure Plugin Design	Low	N/A
LLM08: Excessive Agency	Low — routing pattern is an oversight mechanism, not an autonomy mechanism	By design
LLM09: Overreliance	High — if T_high is set too permissively, too many items are automated	Calibration monitoring; ECE alerting; regular threshold review
LLM10: Model Theft	Low — threshold values reveal model capability profile	Restrict exposure of T_high value to authorised personnel only

9. Governance Considerations

Responsible AI

Threshold settings must be validated independently for each protected demographic group; a threshold that achieves 99% accuracy on the aggregate may achieve only 94% accuracy for a specific group
Routing tier distribution monitored by input segment; if certain input types disproportionately route to Tier 3, investigate for systematic model deficiency

Model Risk Management

Threshold values are a model risk decision — changing them changes the effective automation scope
All threshold changes (including recalibration-triggered changes) are model risk events requiring documentation
Quarterly review of calibration health is a Model Risk responsibility

Human Approval Gates

Changes to T_high (automation threshold) require Model Risk Officer sign-off
A/B test designs require Model Risk review before launch
Emergency recalibrations triggered by ECE breach require post-event review within 5 business days

Policy Compliance

Regulatory domains (financial advice, clinical, legal) have T_high = 0.0 (effectively: always routed to Tier 3 regardless of confidence); this is a policy decision, not a statistical one

Traceability

Each automated Tier 1 decision is traceable to: calibrated confidence, threshold version, calibration version
This trace is the evidence for "the model met the confidence required for automation" in regulatory examination

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Calibration Health Report (ECE + Reliability Diagram)	ML Ops	Monthly	Track calibration drift; trigger recalibration decisions
Threshold Review Record	Model Risk Officer	Quarterly	Document threshold review with supporting data
A/B Test Results Report	ML Ops	Per test	Document test design, results, statistical significance, decision
Routing Distribution Report	Model Risk	Monthly	Track % of requests in each tier; detect threshold drift
Emergency Recalibration Post-Event Report	Model Risk Officer	As triggered	Root cause and resolution for any emergency recalibration event

10. Operational Considerations

Monitoring

Metric	SLO	Alert Threshold	Owner
Calibration Expected Calibration Error (ECE)	< 0.05	> 0.08	ML Ops
Tier 1 automation rate	Baseline ± 10%	> +20% (over-automating) or < -20% (under-automating)	ML Ops
Tier 3 queue depth	< 2x daily human capacity	> 3x daily capacity	Operations Manager
Routing decision latency	< 5ms p99	> 20ms p99	Engineering
PSI (input distribution shift)	< 0.1	> 0.2	ML Ops
Tier 1 error rate (sampled audit)	Meets accuracy SLA	> accuracy threshold breach	Model Risk Officer

Logging

Structured JSON logs for every routing decision with full confidence, threshold, and tier metadata
Retained 7 years for regulated decisions; 90 days for non-regulated
Calibration monitor outputs stored with timestamp for trend analysis

Incident Response

Calibration failure: immediately lower T_high to conservative default; page ML Ops; recalibrate before raising threshold
PSI alert: notify Model Risk; request domain expert assessment of distribution change; do not automatically recalibrate without Model Risk sign-off (distribution shift may require new training data, not just recalibration)

Disaster Recovery

Component	RTO	RPO	Strategy
Threshold Evaluator	5 min	0 (stateless)	Multi-AZ; auto-scaling; conservative fallback to Tier 2 if unavailable
Threshold Registry	15 min	5 min	PostgreSQL synchronous standby; fallback to last-known-good threshold from application memory cache
Calibration Function	15 min	N/A (versioned artefact)	Stored in model registry with version history; load previous version on failure

Capacity Planning

Tier 2 and Tier 3 human review capacity must be sized for peak routing volume at the configured thresholds
Scenario-plan for calibration failure event: if T_high drops to conservative default, Tier 2 volume may spike 3–5x; human review capacity must accommodate this

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Weight
Human Review Labour	Tier 2 and Tier 3 human review costs; directly proportional to routing volume at each tier	Very High
Calibration Compute	Periodic recalibration job; lightweight compared to full model retraining	Low
Routing Infrastructure	Sub-millisecond routing service; very low compute cost	Very Low
Threshold Management	Model Risk and ML Ops labour for quarterly reviews and A/B tests	Medium

Scaling Risks

Conservative thresholds (T_high very high) route most requests to human review; human cost dominates
Aggressive thresholds (T_high too low) automate too much; quality costs (complaints, regulatory findings) dominate
Calibration drift without detection causes the actual Tier 1 error rate to diverge from the intended SLA

Optimisations

Set domain-specific thresholds: do not use a single threshold for all input types; low-variance standard inputs can have higher T_high (more automation); high-variance novel inputs need lower T_high
Use outcome data to continuously refine thresholds toward the optimal efficiency-quality trade-off
Invest in model accuracy improvement to shift the calibration curve; a more accurate model allows the same accuracy SLA with a lower T_high (more automation)

Indicative Cost Range

Automation Rate	Tier 2+3 Human Volume (1M/month requests)	Monthly Human Review Cost	Notes
95% automation	50,000 human reviews/month	$50,000–$150,000	Suitable for well-calibrated, low-stakes domains
80% automation	200,000 human reviews/month	$200,000–$600,000	Balanced profile for medium-stakes domains
60% automation	400,000 human reviews/month	$400,000–$1.2M	Conservative; appropriate for regulated high-stakes domains

12. Trade-Off Analysis

Calibration Method Options

Method	Accuracy of Calibration	Implementation Complexity	Inference Latency Impact	Recommended
Platt Scaling	Good for sigmoid/binary	Low	Negligible	Default for binary classification
Temperature Scaling	Good for multi-class; simple	Very Low	Negligible	Default for multi-class; simplest to implement and maintain
Isotonic Regression	Very good; non-parametric	Medium	Low	Use when Platt/temperature fit poorly on validation data
Conformal Prediction	Statistically rigorous coverage guarantees	High	Low-Medium	Use in regulated domains requiring provable coverage bounds
LLM Self-Assessment (ask LLM its confidence)	Poor — known to be miscalibrated	Very Low	Medium (extra LLM call)	Not recommended as primary calibration; acceptable as supplementary signal

Architectural Tensions

Tension	Option A	Option B	Resolution Guidance
Single global threshold vs per-domain thresholds	Single threshold: simpler operations	Per-domain: higher accuracy but more maintenance	Per-domain is always more accurate; implement per-domain from day one to avoid costly migration later
Recalibrate on drift vs retrain on drift	Recalibrate: fast, cheap; fixes probability distortion	Retrain: fixes underlying accuracy; more expensive	Recalibrate first; if recalibration does not restore ECE, retrain
Hard threshold vs soft threshold (probabilistic routing)	Hard: deterministic, auditable	Soft (probabilistic routing at threshold boundary): smooths human review workload	Use hard threshold for audit and compliance; soft routing only in non-regulated contexts where load smoothing is valuable

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
Confidence overinflation post-distribution-shift	High	Critical — Tier 1 automation rate rises; error rate rises silently	PSI monitoring; ECE monitoring on sampled Tier 2 ground truth	Emergency threshold reduction; recalibration with recent data
Threshold too permissive from launch	Medium	High — over-automation from day one	Tier 1 error rate monitoring on sampled audit	Raise T_high immediately; recalibrate
Always-human business rule misconfigured (topic missed)	Medium	High — regulated topic handled in Tier 1 automatically	Quality audit of Tier 1 automated decisions by topic	Add missing topic to always-human list; retroactive review of affected Tier 1 decisions
Calibration function stale (not recalibrated for > 6 months)	Medium	High — silent calibration drift; routing based on outdated accuracy relationship	Calibration recency monitoring	Emergency recalibration job
A/B test contamination (control and treatment groups overlap)	Low	Medium — invalid A/B test results lead to wrong threshold decision	A/B test design review before launch	Halt contaminated test; restart with clean groups

Cascading Failure Scenario

Input distribution shift (new product type introduced) → PSI alert missed → calibration drifts over 3 months → T_high unchanged → Tier 1 error rate rises from 0.5% to 4% → 40× increase in errors on automated Tier 1 volume → customer complaints, regulatory finding
Mitigation: PSI monitoring as independent signal from ECE; automated PSI alert to Model Risk triggers mandatory threshold review even before ECE deteriorates

14. Regulatory Considerations

Regulation	Specific Clause	Requirement	Implementation
EU AI Act	Article 9 — Risk management	High-risk AI systems must have technical measures controlling automated decision scope	Confidence threshold routing is the prescribed technical control
EU AI Act	Article 15 — Accuracy specifications	AI system accuracy must be specified and maintained	Tier 1 automation threshold defines the accuracy guarantee for automated decisions; ECE monitoring demonstrates maintenance
APRA CPS 234	§36 — Information security controls testing	Automated decision systems must have controls validated against current threat/accuracy profile	Quarterly threshold review + calibration health report satisfy this obligation
APRA CPS 230	§52 — Operational resilience	Degradation of calibration model must not cause operational failure	Conservative fallback to Tier 2 on calibration failure satisfies this requirement
Privacy Act 1988 (Australia)	APP 1.4 — Automated decision making	Organisations must identify when automated decision making is used	Routing tier metadata in audit log identifies automated vs human decisions
ISO 42001:2023	§8.4 — AI system operation	Operational controls must address AI system performance boundaries	Calibration monitoring and threshold management are the operational controls
NIST AI RMF	MEASURE 2.2 — AI risk measurement	Quantitative measures of AI system accuracy must be tracked	ECE, Tier 1 error rate, and reliability diagrams are the NIST-prescribed measurement artefacts
SR 11-7 (US Banking)	Model validation — performance monitoring	Model performance including effective automation boundary must be monitored post-deployment	Calibration monitoring and routing distribution report satisfy SR 11-7 post-deployment monitoring
GDPR Article 22	Automated individual decision-making	Solely automated decisions with legal or significant effects require human involvement	Tier 3 routing for high-impact decisions; Tier 1 restricted to low-significance automation

15. Reference Implementations

AWS

Inference: SageMaker Real-time Endpoints
Calibration: SageMaker Pipeline step running scikit-learn CalibratedClassifierCV; artefact stored in S3 model registry
Threshold Evaluator: AWS Lambda function (sub-millisecond; cold start managed with Provisioned Concurrency)
Threshold Registry: AWS Parameter Store (versioned parameters) or DynamoDB single-record config table
Business Rules Overlay: AWS Lambda with JSON-defined rule set
Tier 2/3 Queues: Amazon SQS FIFO with separate queues per tier
Calibration Monitor: EventBridge scheduled Lambda; CloudWatch custom metrics; CloudWatch Alarms for ECE and PSI
A/B Testing: AWS CloudWatch Evidently

Azure

Inference: Azure Machine Learning Managed Online Endpoints
Calibration: Azure ML Pipeline with Python calibration step
Threshold Evaluator: Azure Functions (Consumption or Premium for latency control)
Threshold Registry: Azure App Configuration with versioned keys
Calibration Monitor: Azure Monitor custom metrics + Logic Apps for alert workflow
A/B Testing: Azure Experimentation (Azure App Configuration feature flags)

GCP

Inference: Vertex AI Online Prediction
Calibration: Vertex AI Pipeline step
Threshold Evaluator: Cloud Run service (low-latency container)
Threshold Registry: Firestore single-document config with audit history
Calibration Monitor: Cloud Scheduler + Cloud Functions; Cloud Monitoring custom metrics
A/B Testing: Firebase Remote Config or Google Optimize

On-Premises / Private Cloud

Inference: TorchServe or BentoML on Kubernetes
Calibration: scikit-learn calibration stored in MLflow Model Registry
Threshold Evaluator: Python FastAPI service on Kubernetes; HPA for auto-scaling
Threshold Registry: PostgreSQL with versioned threshold records
Calibration Monitor: Airflow DAG with Evidently AI reports; Grafana dashboards
A/B Testing: LaunchDarkly or custom feature flag service

Pattern	ID	Relationship	Notes
Active Learning Loop	EAAPL-HIL002	Dependency — active learning requires calibrated confidence for candidate selection	Active learning candidate selection is powered by the calibrated confidence produced by this pattern
Human Escalation Pattern	EAAPL-HIL003	Dependency — escalation trigger uses confidence threshold as one signal	Confidence threshold routing is the technical implementation of the confidence-based escalation trigger
Collaborative AI Decision	EAAPL-HIL004	Dependency — collaborative review boundary is defined by thresholds	Tier 2 routing corresponds to collaborative review; Tier 1 to automation; Tier 3 to escalation
Human Override Pattern	EAAPL-HIL006	Complementary — overrides on Tier 1 automated decisions are a valuable calibration signal	Human overrides on automated decisions indicate the automation threshold may be too permissive
Annotation and Feedback Loop	EAAPL-HIL007	Complementary — Tier 2 and 3 human decisions are annotation inputs	Human review decisions feed the annotation pipeline for model training
Supervisor Agent	EAAPL-MAG002	Complementary — supervisor agent can use confidence routing to determine when to invoke worker agents vs human review	Agent architectures benefit from the same confidence-based routing principles

17. Maturity Assessment

Overall Maturity Level: Proven

Dimension	Score (1–5)	Rationale
Technical Maturity	5	Platt scaling and temperature scaling are textbook ML techniques; well-supported in scikit-learn and most ML frameworks
Operational Maturity	4	Calibration monitoring and threshold management require ML Ops discipline; most organisations lack formal recalibration processes
Governance Maturity	4	EU AI Act and APRA model risk obligations directly require automation boundary governance; threshold management is the implementation
Tooling Ecosystem	5	scikit-learn, Evidently AI, MLflow, and cloud ML platforms provide native calibration and monitoring support
Enterprise Adoption	4	Widely adopted in financial services; growing in healthcare and insurance; threshold management formalism is less mature outside financial services
Risk Profile	Low-Medium	Well-understood; primary risk is calibration drift without monitoring; ECE monitoring is the standard control

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Working Group	Initial publication covering calibration methods, threshold setting methodology, multi-tier routing, threshold monitoring, and A/B testing framework

Track this pattern for APRA/ASIC review

← Back to Library More Human-in-the-Loop →

AI Confidence Threshold Routing

AI Confidence Threshold Routing

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

Authentication and Authorisation

Secrets Management

Data Classification

Encryption

Auditability

OWASP LLM Top 10 Considerations

9. Governance Considerations

Responsible AI

Model Risk Management

Human Approval Gates

Policy Compliance

Traceability

Governance Artefacts

10. Operational Considerations

Monitoring

Logging

Incident Response

Disaster Recovery

Capacity Planning

11. Cost Considerations

Cost Drivers

Scaling Risks

Optimisations

Indicative Cost Range

12. Trade-Off Analysis

Calibration Method Options

Architectural Tensions

13. Failure Modes

Cascading Failure Scenario

14. Regulatory Considerations

15. Reference Implementations

AWS

Azure

GCP

On-Premises / Private Cloud

16. Related Patterns

17. Maturity Assessment

18. Revision History