EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryHuman-in-the-Loop
Proven
⇄ Compare

AI Confidence Threshold Routing

👁️ Human-in-the-Loop🏭 Field-tested in AU

AI Confidence Threshold Routing

Pattern ID: EAAPL-HIL005 Status: Proven Tags: human-oversight observability llm medium-complexity Version: 1.0 Last Updated: 2026-06-12


1. Executive Summary

The AI Confidence Threshold Routing pattern dynamically routes inference requests to different handling tiers — fully automated, human-assisted, or human-primary — based on the model's calibrated output confidence. It solves a foundational enterprise AI problem: a single automated handling path produces unacceptable error rates, while routing everything to human review eliminates the economic benefit of AI. Confidence-based routing finds the middle ground, automating only what the model handles reliably.

The pattern addresses four specific technical challenges: raw model confidence scores are poorly calibrated and must be adjusted before they are actionable; thresholds must be set empirically against accuracy data, not guessed; thresholds drift as model behaviour shifts and must be recalibrated quarterly; and business rules must overlay the confidence signal so that certain topics always route to human review regardless of confidence. CIOs and CTOs implementing this pattern gain a precision instrument for balancing automation rate, quality, and human cost — with dashboards that make the trade-off visible and adjustable. It is the prerequisite infrastructure for all other human-in-the-loop patterns that rely on confidence-based escalation.


2. Problem Statement

Business Problem

A flat automation policy — "AI handles everything" or "AI handles if confidence > X" with an arbitrarily chosen X — produces either unacceptable error rates or inefficient human bottlenecks. Organisations need a principled, empirically grounded, and continuously maintained mechanism to determine which requests are safe to automate and which require human involvement.

Technical Problem

Neural network softmax outputs are not calibrated probabilities. A model that outputs 0.95 confidence is often correct only 80–85% of the time. Using raw softmax as a routing gate systematically over-automates, routing to the automated path cases the model is actually uncertain about. The converse is also true: an overly conservative threshold under-automates, routing to human review cases the model handles reliably — wasting expert time. Neither error is easily visible without calibration infrastructure.

Symptoms

  • Automation rate is set but never updated; the original threshold was chosen without empirical data
  • Model accuracy by confidence band has never been measured
  • Human reviewers complain that most escalated AI cases are trivially correct
  • Alternatively: downstream quality metrics show AI error rate higher than expected for the stated confidence threshold
  • No mechanism exists to test whether a different threshold would improve the cost/quality trade-off

Cost of Inaction

  • Over-automation: AI errors accumulate at scale in domains where the model is poorly calibrated; errors surface in customer complaints, regulatory findings, or quality audits
  • Under-automation: human review cost exceeds AI's economic benefit; ROI fails; AI project is defunded
  • No recalibration: model behaviour shifts over time; fixed thresholds become increasingly misaligned with actual accuracy; automation rate creeps upward as model confidence inflates on distribution-shifted inputs

3. Context

When to Apply

  • Any production AI system where a subset of requests should be handled by humans
  • Systems where the accuracy requirement varies by confidence tier (e.g. routine automation at 99% accuracy; exceptional items at 95%)
  • Regulated environments where the human oversight boundary must be empirically defensible
  • Systems using LLMs where confidence extraction requires bespoke engineering (LLMs do not natively produce calibrated confidence)

When NOT to Apply

  • Binary all-or-nothing automation decisions where every case requires the same handling (use a static policy instead)
  • Generative output use cases without a well-defined output space (confidence routing requires a probability distribution over a defined outcome set)
  • Use cases where the latency of confidence estimation and routing exceeds the acceptable response SLO

Prerequisites

  • Model produces a probability distribution over a discrete output space (or a proxy can be constructed — see LLM section in Architecture Overview)
  • A validation dataset with known ground truth is available for threshold calibration
  • Human review workforce available to handle the routed tier

Industry Applicability

Industry Routing Decision Automation Tier Human-Assist Tier Human-Primary Tier
Financial Services Transaction risk classification Score < 0.2 (low risk, auto-approve) 0.2–0.7 (review queue) > 0.7 or regulatory flag (analyst)
Insurance Claims triage Routine claims with high confidence Complex claims Fraud signals + large claims
Healthcare Document classification (ICD coding) Standard codes, confidence > 0.90 Confidence 0.70–0.90 New codes / patient exceptions
Legal Contract clause risk flagging Standard clauses (boilerplate) Unusual clauses, medium confidence High-risk clause types always
Customer Service Intent classification Known intents, high confidence Ambiguous intents Sensitive topics, always
Retail Product category classification Well-established categories New product types Brand-sensitive or restricted categories

4. Architecture Overview

The AI Confidence Threshold Routing pattern has five layers that must be implemented and maintained together.

Layer 1 — Confidence Extraction and Calibration. For traditional classifiers (logistic regression, gradient boosting, neural networks), the model produces a softmax probability vector. These raw probabilities are well-known to be overconfident. Platt scaling fits a logistic regression on a held-out validation set, mapping raw probabilities to calibrated probabilities. Temperature scaling is simpler: it divides the logit vector by a learned scalar T before the softmax, scaling probabilities toward the centre without changing predictions. Both methods require a validation set with ground truth. The calibration function is stored alongside the model and applied at inference time. For LLMs that do not natively produce probabilities, proxies can be constructed: ask the LLM to rate its own confidence (with known limitations — LLM self-assessment is poorly calibrated); use logprob sampling to estimate token-level uncertainty; use an ensemble of LLM calls with different seeds and measure output consistency; or train a classifier to predict LLM accuracy from input features.

Layer 2 — Threshold Setting. Thresholds are not chosen arbitrarily — they are derived empirically from the calibrated confidence versus accuracy curve on a validation set. For each confidence bin (e.g. 0.50–0.55, 0.55–0.60, ..., 0.95–1.00), compute the actual accuracy. Plot this curve and identify where accuracy meets the required threshold for automation (typically 99%+ for fully automated decisions). The confidence value at that intersection becomes the automation threshold. The lower threshold (below which human-primary handling is required) is set where accuracy falls below the acceptable bar for human-assisted review. These thresholds should be set independently for each domain or topic cluster if accuracy varies significantly across categories.

Layer 3 — Multi-Tier Routing Architecture. The routing engine evaluates confidence against the calibrated thresholds and applies business rules to determine the handling tier. Tier 1 (high confidence, above automation threshold): request is handled automatically with no human involvement. Tier 2 (medium confidence, between thresholds): request is routed to a human-assisted queue where the human reviews and approves the AI recommendation. Tier 3 (low confidence, below lower threshold): request is routed to a human-primary queue where the human makes the decision independently with AI output as an optional reference. Business rules overlay applies regardless of confidence: topics flagged as always-human (legal advice, clinical diagnosis, high-value financial transactions above configured amount) route to Tier 3 unconditionally.

Layer 4 — Threshold Monitoring and Recalibration. Model behaviour shifts over time due to data distribution changes, model updates, and real-world evolution. The calibration relationship between raw confidence and actual accuracy drifts. The system monitors calibration health in three ways: Expected Calibration Error (ECE) is computed on a rolling window of recently processed items where ground truth is available; Reliability Diagrams are generated monthly to visualise calibration drift; and Population Stability Index (PSI) detects input distribution shift that precedes calibration degradation. When any of these signals exceeds its threshold, a recalibration job is triggered. Recalibration uses recent labelled data from the human review tiers — the ground truth generated by human reviewers is fed back to update the calibration function.

Layer 5 — A/B Testing Threshold Values. Threshold values represent a trade-off between automation rate and quality. The system supports A/B testing threshold values to find the optimal point. A control group receives the current threshold configuration; a treatment group receives modified thresholds (tighter or looser). Both groups are tracked for: automation rate (efficiency), error rate on automated decisions (quality), human review volume (cost), and customer or business outcome metrics. After a predefined period (typically 2–4 weeks), statistical significance is tested and the winning configuration is promoted. This enables data-driven threshold optimisation rather than intuition-driven adjustment.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Scoring["Confidence Scoring"] A[Inference Request] B[Model + Calibration Function] end subgraph Routing["Threshold Routing"] C{Business Rules Overlay} D{Threshold Evaluator} end subgraph Handling["Handling Tiers"] E[Tier 1 Automated Decision] F[Tier 2 Human-Assisted Queue] G[Tier 3 Human-Primary Queue] H[Calibration Monitor] end A --> B B --> C C -->|always-human rule| G C -->|pass| D D -->|high confidence| E D -->|medium confidence| F D -->|low confidence| G E --> H F --> H G --> H H -->|drift detected| B style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f3e8ff,stroke:#a855f7 style D fill:#f3e8ff,stroke:#a855f7 style E fill:#d1fae5,stroke:#10b981 style F fill:#f0fdf4,stroke:#22c55e style G fill:#fee2e2,stroke:#ef4444 style H fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Inference Engine ML Serving Run model forward pass; return raw logits or probabilities SageMaker, Vertex AI, Azure ML, TorchServe, vLLM Critical
Calibration Function ML Utility Apply Platt/temperature scaling to raw outputs scikit-learn CalibratedClassifierCV, custom temperature scaling layer, ONNX post-processing node Critical
Business Rules Overlay Rules Engine Apply topic-based and value-based always-human rules Python rules engine, Drools, AWS Business Rules Engine High
Threshold Evaluator Application Service Compare calibrated confidence to T_high and T_low; output routing decision Python microservice; sub-millisecond latency target Critical
Tier 1 Automation Handler Application Service Execute automated decision; log with full audit trail Domain-specific service Critical
Tier 2 Human-Assisted Queue Queue + Interface Hold medium-confidence items; present to reviewer with AI recommendation PostgreSQL queue + custom React interface; Zendesk + AI integration High
Tier 3 Human-Primary Queue Queue + Interface Hold low-confidence items; present to human with AI as optional reference Same infrastructure as Tier 2; different routing and interface mode High
Outcome Tracker Data Service Collect ground truth when available; link to original request Batch job or event-driven consumer; PostgreSQL High
Calibration Monitor Analytics Service Compute ECE, generate reliability diagrams, compute PSI Python (scikit-learn, scipy); Evidently AI; custom visualisation High
Recalibration Job ML Pipeline Re-fit calibration function on recent labelled data Python scikit-learn; Airflow DAG; model registry integration High
A/B Traffic Router Routing Service Split traffic between threshold configurations for testing Feature flag service (LaunchDarkly, AWS CloudWatch Evidently) Medium
Threshold Registry Configuration Store Store current threshold values; version history; A/B test state PostgreSQL or AWS Parameter Store with version history High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Client Submits request Request payload with request_id
2 Inference Engine Runs model; returns raw probabilities raw_probabilities[], predicted_class, request_id
3 Calibration Function Applies scaling; returns calibrated confidence calibrated_confidence, calibration_method, calibration_version
4 Business Rules Overlay Evaluates topic and value rules always_human: true/false, rule_fired: null or rule_id
5 Threshold Evaluator Compares to T_high and T_low from Threshold Registry routing_tier: [1, 2, 3], routing_reason, threshold_snapshot
6a Tier 1 Handler Executes automated decision decision_outcome, automation_audit_record
6b Tier 2 Queue Presents to human reviewer; captures approval or override human_decision, override_reason, reviewer_id, review_latency_ms
6c Tier 3 Queue Presents to human decision-maker human_decision, decision_rationale, decision_maker_id
7 Outcome Tracker Receives downstream outcome event outcome_type, outcome_value, linked to request_id
8 Calibration Monitor Computes ECE on recent labelled items; computes PSI on input distribution ece_score, psi_score, reliability_diagram, alert if threshold exceeded
9 Recalibration Job Re-fits calibration function; updates Threshold Registry New calibration_version; updated T_high and T_low recommendations

Error Flow

Error Condition Detected By Recovery Action Notification
Calibration function fails at inference time Health check; inference error log Route all requests to Tier 2 (conservative fallback); never fail to Tier 1 automated ML Ops on-call; human review capacity alert
Calibration monitor detects ECE > 0.10 Calibration Monitor Trigger emergency recalibration; lower T_high temporarily Model Risk Officer; ML Ops
PSI > 0.2 (distribution shift) PSI Monitor Alert Model Risk; trigger recalibration; optionally pause automation Model Risk Officer
A/B test produces harmful outcome (error rate significantly worse) A/B experiment monitor Halt treatment; revert to control thresholds; document result ML Ops; Model Risk
Threshold Registry unavailable Routing service health check Route all requests to Tier 2 (conservative fallback) Operations on-call

8. Security Considerations

Authentication and Authorisation

  • Threshold Registry write access restricted to ML Ops pipeline service accounts and authorised Model Risk personnel
  • Calibration function parameters treated as model artefacts; stored in model registry under same access controls as model weights
  • A/B test configuration changes require dual-authorisation (ML Ops + Model Risk)

Secrets Management

  • Inference endpoint credentials stored in secrets manager
  • Human review interface authentication via SSO; no shared credentials

Data Classification

  • Confidence scores themselves are not sensitive, but they reveal information about model capability boundaries
  • Do not expose raw confidence scores to end users; only expose routing outcomes (automated vs review) if required by transparency obligations

Encryption

  • All inference logs (including confidence scores) encrypted at rest
  • Calibration data (ground truth labels used for recalibration) encrypted at rest and in transit

Auditability

  • Every routing decision logged with: request_id, calibrated_confidence, routing_tier, routing_reason, threshold_values_used, timestamp
  • Threshold changes logged with: previous values, new values, change reason, approver, timestamp

OWASP LLM Top 10 Considerations

OWASP LLM Risk Applicability Mitigation
LLM01: Prompt Injection Low — routing happens at inference output level N/A for routing layer; mitigate in upstream inference
LLM02: Insecure Output Handling Low N/A for routing layer
LLM03: Training Data Poisoning Medium — ground truth used for recalibration could be poisoned Validate recalibration data quality; anomaly detection on label distribution changes
LLM04: Model Denial of Service Medium — threshold set too high routes everything to human, creating queue overflow Human review capacity monitoring; threshold bounds checking
LLM05: Supply Chain Vulnerabilities Low Standard model provenance controls
LLM06: Sensitive Information Disclosure Low — routing layer does not process raw content beyond confidence extraction N/A
LLM07: Insecure Plugin Design Low N/A
LLM08: Excessive Agency Low — routing pattern is an oversight mechanism, not an autonomy mechanism By design
LLM09: Overreliance High — if T_high is set too permissively, too many items are automated Calibration monitoring; ECE alerting; regular threshold review
LLM10: Model Theft Low — threshold values reveal model capability profile Restrict exposure of T_high value to authorised personnel only

9. Governance Considerations

Responsible AI

  • Threshold settings must be validated independently for each protected demographic group; a threshold that achieves 99% accuracy on the aggregate may achieve only 94% accuracy for a specific group
  • Routing tier distribution monitored by input segment; if certain input types disproportionately route to Tier 3, investigate for systematic model deficiency

Model Risk Management

  • Threshold values are a model risk decision — changing them changes the effective automation scope
  • All threshold changes (including recalibration-triggered changes) are model risk events requiring documentation
  • Quarterly review of calibration health is a Model Risk responsibility

Human Approval Gates

  • Changes to T_high (automation threshold) require Model Risk Officer sign-off
  • A/B test designs require Model Risk review before launch
  • Emergency recalibrations triggered by ECE breach require post-event review within 5 business days

Policy Compliance

  • Regulatory domains (financial advice, clinical, legal) have T_high = 0.0 (effectively: always routed to Tier 3 regardless of confidence); this is a policy decision, not a statistical one

Traceability

  • Each automated Tier 1 decision is traceable to: calibrated confidence, threshold version, calibration version
  • This trace is the evidence for "the model met the confidence required for automation" in regulatory examination

Governance Artefacts

Artefact Owner Frequency Purpose
Calibration Health Report (ECE + Reliability Diagram) ML Ops Monthly Track calibration drift; trigger recalibration decisions
Threshold Review Record Model Risk Officer Quarterly Document threshold review with supporting data
A/B Test Results Report ML Ops Per test Document test design, results, statistical significance, decision
Routing Distribution Report Model Risk Monthly Track % of requests in each tier; detect threshold drift
Emergency Recalibration Post-Event Report Model Risk Officer As triggered Root cause and resolution for any emergency recalibration event

10. Operational Considerations

Monitoring

Metric SLO Alert Threshold Owner
Calibration Expected Calibration Error (ECE) < 0.05 > 0.08 ML Ops
Tier 1 automation rate Baseline ± 10% > +20% (over-automating) or < -20% (under-automating) ML Ops
Tier 3 queue depth < 2x daily human capacity > 3x daily capacity Operations Manager
Routing decision latency < 5ms p99 > 20ms p99 Engineering
PSI (input distribution shift) < 0.1 > 0.2 ML Ops
Tier 1 error rate (sampled audit) Meets accuracy SLA > accuracy threshold breach Model Risk Officer

Logging

  • Structured JSON logs for every routing decision with full confidence, threshold, and tier metadata
  • Retained 7 years for regulated decisions; 90 days for non-regulated
  • Calibration monitor outputs stored with timestamp for trend analysis

Incident Response

  • Calibration failure: immediately lower T_high to conservative default; page ML Ops; recalibrate before raising threshold
  • PSI alert: notify Model Risk; request domain expert assessment of distribution change; do not automatically recalibrate without Model Risk sign-off (distribution shift may require new training data, not just recalibration)

Disaster Recovery

Component RTO RPO Strategy
Threshold Evaluator 5 min 0 (stateless) Multi-AZ; auto-scaling; conservative fallback to Tier 2 if unavailable
Threshold Registry 15 min 5 min PostgreSQL synchronous standby; fallback to last-known-good threshold from application memory cache
Calibration Function 15 min N/A (versioned artefact) Stored in model registry with version history; load previous version on failure

Capacity Planning

  • Tier 2 and Tier 3 human review capacity must be sized for peak routing volume at the configured thresholds
  • Scenario-plan for calibration failure event: if T_high drops to conservative default, Tier 2 volume may spike 3–5x; human review capacity must accommodate this

11. Cost Considerations

Cost Drivers

Driver Description Relative Weight
Human Review Labour Tier 2 and Tier 3 human review costs; directly proportional to routing volume at each tier Very High
Calibration Compute Periodic recalibration job; lightweight compared to full model retraining Low
Routing Infrastructure Sub-millisecond routing service; very low compute cost Very Low
Threshold Management Model Risk and ML Ops labour for quarterly reviews and A/B tests Medium

Scaling Risks

  • Conservative thresholds (T_high very high) route most requests to human review; human cost dominates
  • Aggressive thresholds (T_high too low) automate too much; quality costs (complaints, regulatory findings) dominate
  • Calibration drift without detection causes the actual Tier 1 error rate to diverge from the intended SLA

Optimisations

  • Set domain-specific thresholds: do not use a single threshold for all input types; low-variance standard inputs can have higher T_high (more automation); high-variance novel inputs need lower T_high
  • Use outcome data to continuously refine thresholds toward the optimal efficiency-quality trade-off
  • Invest in model accuracy improvement to shift the calibration curve; a more accurate model allows the same accuracy SLA with a lower T_high (more automation)

Indicative Cost Range

Automation Rate Tier 2+3 Human Volume (1M/month requests) Monthly Human Review Cost Notes
95% automation 50,000 human reviews/month $50,000–$150,000 Suitable for well-calibrated, low-stakes domains
80% automation 200,000 human reviews/month $200,000–$600,000 Balanced profile for medium-stakes domains
60% automation 400,000 human reviews/month $400,000–$1.2M Conservative; appropriate for regulated high-stakes domains

12. Trade-Off Analysis

Calibration Method Options

Method Accuracy of Calibration Implementation Complexity Inference Latency Impact Recommended
Platt Scaling Good for sigmoid/binary Low Negligible Default for binary classification
Temperature Scaling Good for multi-class; simple Very Low Negligible Default for multi-class; simplest to implement and maintain
Isotonic Regression Very good; non-parametric Medium Low Use when Platt/temperature fit poorly on validation data
Conformal Prediction Statistically rigorous coverage guarantees High Low-Medium Use in regulated domains requiring provable coverage bounds
LLM Self-Assessment (ask LLM its confidence) Poor — known to be miscalibrated Very Low Medium (extra LLM call) Not recommended as primary calibration; acceptable as supplementary signal

Architectural Tensions

Tension Option A Option B Resolution Guidance
Single global threshold vs per-domain thresholds Single threshold: simpler operations Per-domain: higher accuracy but more maintenance Per-domain is always more accurate; implement per-domain from day one to avoid costly migration later
Recalibrate on drift vs retrain on drift Recalibrate: fast, cheap; fixes probability distortion Retrain: fixes underlying accuracy; more expensive Recalibrate first; if recalibration does not restore ECE, retrain
Hard threshold vs soft threshold (probabilistic routing) Hard: deterministic, auditable Soft (probabilistic routing at threshold boundary): smooths human review workload Use hard threshold for audit and compliance; soft routing only in non-regulated contexts where load smoothing is valuable

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Confidence overinflation post-distribution-shift High Critical — Tier 1 automation rate rises; error rate rises silently PSI monitoring; ECE monitoring on sampled Tier 2 ground truth Emergency threshold reduction; recalibration with recent data
Threshold too permissive from launch Medium High — over-automation from day one Tier 1 error rate monitoring on sampled audit Raise T_high immediately; recalibrate
Always-human business rule misconfigured (topic missed) Medium High — regulated topic handled in Tier 1 automatically Quality audit of Tier 1 automated decisions by topic Add missing topic to always-human list; retroactive review of affected Tier 1 decisions
Calibration function stale (not recalibrated for > 6 months) Medium High — silent calibration drift; routing based on outdated accuracy relationship Calibration recency monitoring Emergency recalibration job
A/B test contamination (control and treatment groups overlap) Low Medium — invalid A/B test results lead to wrong threshold decision A/B test design review before launch Halt contaminated test; restart with clean groups

Cascading Failure Scenario

  • Input distribution shift (new product type introduced) → PSI alert missed → calibration drifts over 3 months → T_high unchanged → Tier 1 error rate rises from 0.5% to 4% → 40× increase in errors on automated Tier 1 volume → customer complaints, regulatory finding
  • Mitigation: PSI monitoring as independent signal from ECE; automated PSI alert to Model Risk triggers mandatory threshold review even before ECE deteriorates

14. Regulatory Considerations

Regulation Specific Clause Requirement Implementation
EU AI Act Article 9 — Risk management High-risk AI systems must have technical measures controlling automated decision scope Confidence threshold routing is the prescribed technical control
EU AI Act Article 15 — Accuracy specifications AI system accuracy must be specified and maintained Tier 1 automation threshold defines the accuracy guarantee for automated decisions; ECE monitoring demonstrates maintenance
APRA CPS 234 §36 — Information security controls testing Automated decision systems must have controls validated against current threat/accuracy profile Quarterly threshold review + calibration health report satisfy this obligation
APRA CPS 230 §52 — Operational resilience Degradation of calibration model must not cause operational failure Conservative fallback to Tier 2 on calibration failure satisfies this requirement
Privacy Act 1988 (Australia) APP 1.4 — Automated decision making Organisations must identify when automated decision making is used Routing tier metadata in audit log identifies automated vs human decisions
ISO 42001:2023 §8.4 — AI system operation Operational controls must address AI system performance boundaries Calibration monitoring and threshold management are the operational controls
NIST AI RMF MEASURE 2.2 — AI risk measurement Quantitative measures of AI system accuracy must be tracked ECE, Tier 1 error rate, and reliability diagrams are the NIST-prescribed measurement artefacts
SR 11-7 (US Banking) Model validation — performance monitoring Model performance including effective automation boundary must be monitored post-deployment Calibration monitoring and routing distribution report satisfy SR 11-7 post-deployment monitoring
GDPR Article 22 Automated individual decision-making Solely automated decisions with legal or significant effects require human involvement Tier 3 routing for high-impact decisions; Tier 1 restricted to low-significance automation

15. Reference Implementations

AWS

  • Inference: SageMaker Real-time Endpoints
  • Calibration: SageMaker Pipeline step running scikit-learn CalibratedClassifierCV; artefact stored in S3 model registry
  • Threshold Evaluator: AWS Lambda function (sub-millisecond; cold start managed with Provisioned Concurrency)
  • Threshold Registry: AWS Parameter Store (versioned parameters) or DynamoDB single-record config table
  • Business Rules Overlay: AWS Lambda with JSON-defined rule set
  • Tier 2/3 Queues: Amazon SQS FIFO with separate queues per tier
  • Calibration Monitor: EventBridge scheduled Lambda; CloudWatch custom metrics; CloudWatch Alarms for ECE and PSI
  • A/B Testing: AWS CloudWatch Evidently

Azure

  • Inference: Azure Machine Learning Managed Online Endpoints
  • Calibration: Azure ML Pipeline with Python calibration step
  • Threshold Evaluator: Azure Functions (Consumption or Premium for latency control)
  • Threshold Registry: Azure App Configuration with versioned keys
  • Calibration Monitor: Azure Monitor custom metrics + Logic Apps for alert workflow
  • A/B Testing: Azure Experimentation (Azure App Configuration feature flags)

GCP

  • Inference: Vertex AI Online Prediction
  • Calibration: Vertex AI Pipeline step
  • Threshold Evaluator: Cloud Run service (low-latency container)
  • Threshold Registry: Firestore single-document config with audit history
  • Calibration Monitor: Cloud Scheduler + Cloud Functions; Cloud Monitoring custom metrics
  • A/B Testing: Firebase Remote Config or Google Optimize

On-Premises / Private Cloud

  • Inference: TorchServe or BentoML on Kubernetes
  • Calibration: scikit-learn calibration stored in MLflow Model Registry
  • Threshold Evaluator: Python FastAPI service on Kubernetes; HPA for auto-scaling
  • Threshold Registry: PostgreSQL with versioned threshold records
  • Calibration Monitor: Airflow DAG with Evidently AI reports; Grafana dashboards
  • A/B Testing: LaunchDarkly or custom feature flag service

Pattern ID Relationship Notes
Active Learning Loop EAAPL-HIL002 Dependency — active learning requires calibrated confidence for candidate selection Active learning candidate selection is powered by the calibrated confidence produced by this pattern
Human Escalation Pattern EAAPL-HIL003 Dependency — escalation trigger uses confidence threshold as one signal Confidence threshold routing is the technical implementation of the confidence-based escalation trigger
Collaborative AI Decision EAAPL-HIL004 Dependency — collaborative review boundary is defined by thresholds Tier 2 routing corresponds to collaborative review; Tier 1 to automation; Tier 3 to escalation
Human Override Pattern EAAPL-HIL006 Complementary — overrides on Tier 1 automated decisions are a valuable calibration signal Human overrides on automated decisions indicate the automation threshold may be too permissive
Annotation and Feedback Loop EAAPL-HIL007 Complementary — Tier 2 and 3 human decisions are annotation inputs Human review decisions feed the annotation pipeline for model training
Supervisor Agent EAAPL-MAG002 Complementary — supervisor agent can use confidence routing to determine when to invoke worker agents vs human review Agent architectures benefit from the same confidence-based routing principles

17. Maturity Assessment

Overall Maturity Level: Proven

Dimension Score (1–5) Rationale
Technical Maturity 5 Platt scaling and temperature scaling are textbook ML techniques; well-supported in scikit-learn and most ML frameworks
Operational Maturity 4 Calibration monitoring and threshold management require ML Ops discipline; most organisations lack formal recalibration processes
Governance Maturity 4 EU AI Act and APRA model risk obligations directly require automation boundary governance; threshold management is the implementation
Tooling Ecosystem 5 scikit-learn, Evidently AI, MLflow, and cloud ML platforms provide native calibration and monitoring support
Enterprise Adoption 4 Widely adopted in financial services; growing in healthcare and insurance; threshold management formalism is less mature outside financial services
Risk Profile Low-Medium Well-understood; primary risk is calibration drift without monitoring; ECE monitoring is the standard control

18. Revision History

Version Date Author Changes
1.0 2026-06-12 EAAPL Working Group Initial publication covering calibration methods, threshold setting methodology, multi-tier routing, threshold monitoring, and A/B testing framework
← Back to LibraryMore Human-in-the-Loop