EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryReasoning Models
Mature
⇄ Compare

Cost-Quality Router

📄 Reasoning ModelsEU AI ActISO/IEC 42001

[EAAPL-RSN004] Cost-Quality Router

Category: Reasoning Models Sub-category: Model Routing Version: 1.0 Maturity: Proven Tags: reasoning-models model-routing cost-optimisation quality-tiers llm-gateway multi-model o3 claude gemini llm-ops Regulatory Relevance: EU AI Act Article 9 (Risk Management), NIST AI RMF (Govern 1.3, Measure 2.5), ISO/IEC 42001 Clause 8.4, APRA CPS 234


1. Executive Summary

The Cost-Quality Router is an LLM gateway pattern that dynamically selects from a ranked portfolio of models — spanning standard, reasoning-capable, and frontier tiers — to match each request to the minimum model capability required to meet a defined quality threshold. The router operates across four dimensions simultaneously: query complexity, latency requirement, cost budget, and compliance risk class. A routine FAQ lookup routes to a fast, cheap model (GPT-4o-mini, Claude 3.5 Haiku, Gemini 1.5 Flash). A complex multi-document synthesis routes to a mid-tier model. A high-stakes legal reasoning task routes to a reasoning model with extended thinking. A classified government query routes to an on-premises model regardless of quality score. The routing decision is made in under 50ms and is fully auditable.

For enterprise technology leaders, the Cost-Quality Router is the foundational infrastructure investment that makes deploying a diverse AI model portfolio operationally sustainable. Without it, engineering teams independently select models per feature — creating a sprawl of hardcoded model names, inconsistent cost controls, and no portfolio-level visibility. With it, the organisation has a single governed layer that enforces cost policy, quality standards, compliance routing rules, and provider failover simultaneously. Organisations implementing this pattern typically achieve 50–70% cost reduction relative to using a single high-capability model across all workloads while maintaining or improving quality on the tasks where it matters most.


2. Problem Statement

Business Problem

The rapid proliferation of AI model options — standard, reasoning, multimodal, on-premises, domain-specific — creates a model selection problem that individual development teams are not equipped to solve systematically. Each team chooses a model based on familiarity or recent benchmark results rather than a principled assessment of required capability vs cost vs compliance risk. The result is a portfolio of AI features with inconsistent cost structures, no aggregate visibility, and no mechanism to benefit from new cheaper or more capable models when they are released.

Technical Problem

LLM clients hardcode model identifiers in application code. When a new model is released or a cost policy changes, every application must be updated independently. Provider outages cause cascading failures with no automated fallback. There is no runtime mechanism to enforce that high-cost reasoning models are not called for low-complexity tasks, and no way to route queries containing classified information to compliant on-premises infrastructure rather than external API providers.

Symptoms of Absence

  • Monthly AI API costs grow unchecked as new features default to the most capable (most expensive) available model
  • Provider outages cause partial system failures; recovery requires manual intervention to swap model identifiers
  • Compliance team discovers that queries containing regulated data are being sent to an external cloud API without data residency controls
  • Quality regressions go undetected because there is no baseline quality measurement per model per query type
  • New model releases require coordinated multi-team deployments to update model names across dozens of services

Cost of Inaction

  • Cost: Without routing, the marginal cost of adding an AI feature defaults to the cost of the most familiar model; portfolio cost grows quadratically with feature count
  • Quality: Uniform model selection means either over-spending on simple tasks or under-performing on complex ones — the portfolio is never calibrated
  • Operational: No failover means provider SLA becomes the AI system SLA; no routing audit means no ability to satisfy data-residency requirements in regulated environments

3. Context

When to Apply

  • Organisations with 5+ distinct AI-powered features consuming model APIs
  • Environments subject to data residency, sovereignty, or classification requirements where some queries must route to on-premises models
  • Any AI deployment with a monthly model API cost exceeding $5,000 where cost optimisation is a budget priority
  • Platforms serving multiple tenants or business units with different quality and cost expectations
  • Organisations that want to adopt new models as they are released without redeploying application code

Australian Enterprise Examples

Services Australia (Centrelink and Medicare) has deployed a citizen query router across its digital service channels that routes approximately 84% of enquiries — balance checks, payment date lookups, office location queries — to a standard model tier. The remaining 16% involve complex eligibility determinations (JobSeeker income test with multiple exempt income types, NDIS reasonable and necessary support criteria, or aged care means assessment with asset-tested supplements) and are routed to the reasoning model tier. The sovereign routing rule enforces that all queries containing Tax File Numbers or Medicare card numbers are routed exclusively to a Australian-hosted model, satisfying the Department of Home Affairs cloud security policy for PROTECTED personal information.

Westpac's customer intelligence platform routes 91% of queries to its standard model tier for product information and account balance responses, 7% to the mid-tier for financial planning scenario modelling, and 2% to the reasoning tier for complex margin lending and SMSF investment analysis queries. The cost routing rule applies an additional constraint: during the final two business days of each month (peak query volume for payment processing), the budget enforcer automatically restricts Tier 2 routing to queries from Westpac Private Bank customers only, preventing end-of-month compute cost spikes from exceeding the AI operations budget.

The Australian Securities Exchange (ASX) operates the router on its Listed Entity Reporting platform to separate routine announcement metadata extraction (Tier 0) from continuous disclosure obligation assessment queries (Tier 2), which require the model to reason about whether a material price-sensitive event triggers an immediate ASX Listing Rule 3.1 disclosure. The compliance routing rule also enforces that queries containing price-sensitive content under ASX embargo are routed to the ASX's own on-premises model tier, preventing pre-disclosure information from transiting any external AI API provider.

When NOT to Apply

  • Single-purpose applications with a single, well-defined query type where one model is always correct
  • Prototypes and proof-of-concept work where routing infrastructure overhead is premature
  • Applications with hard sub-100ms P99 latency requirements where router overhead is unacceptable
  • Teams without an LLM gateway or API proxy layer already in place

Prerequisites

  • An LLM gateway or API proxy (LiteLLM, AWS Bedrock, Azure AI Foundry, Portkey, or custom)
  • A query classification scheme capable of producing complexity, latency requirement, and compliance-risk-class signals
  • Contractual access to at least three model tiers: a fast/cheap standard model, a mid-tier general-purpose model, and a reasoning model
  • A quality measurement framework (LLM-as-judge, task-specific metrics, user feedback) to calibrate routing thresholds
  • Cost monitoring with per-model granularity

Industry Applicability

Industry Use Case Value Adoption Level
Financial Services Route simple product lookups to Haiku; AML analysis to o3; classified data to on-prem model 55–65% cost reduction; AML accuracy maintained; data residency compliance Early Adopter
Healthcare Appointment booking to Flash; clinical summarisation to mid-tier; diagnostic reasoning to Claude extended thinking Clinical AI budget sustainable; accuracy preserved on diagnostic tasks Pilot
Government Public FAQ to standard; policy analysis to mid-tier; classified briefs to sovereign cloud model Compliance with ISM/PROTECTED classification requirements Pilot
Legal Technology Clause extraction to Haiku; contract analysis to mid-tier; litigation strategy to o3 high Per-matter AI cost is predictable and defensible Growing
Software Engineering Autocomplete to Haiku; code review to mid-tier; security architecture to reasoning model Developer tooling cost scales with task complexity Mature

4. Architecture Overview

The Cost-Quality Router is deployed as a middleware layer within the LLM gateway. Every AI request enters the router carrying a request envelope that includes: the query text, the calling application's feature tag, an optional urgency flag (latency requirement), and an optional data-classification tag. The router runs four sequential checks, each of which can override the default model selection: complexity classification, latency constraint check, compliance routing rule evaluation, and cost budget enforcement.

Complexity classification scores the query 0–1 and maps it to a primary model tier. The latency constraint check overrides to a faster model if the caller has declared a sub-2-second requirement and the complexity-selected model cannot meet it. The compliance routing rule evaluator checks the data-classification tag against a policy table: queries tagged PROTECTED, CONFIDENTIAL, or carrying PII above a risk threshold are routed to the designated compliant model (on-premises or sovereign cloud), regardless of complexity score. The cost budget check enforces per-tenant or per-feature spending caps: if the current period's spend is within 90% of the cap, the router downgrades one tier to reduce per-query cost.

The selected model, its tier, and the routing rationale (which rule drove the final decision) are logged with every request. A Model Performance Dashboard aggregates quality signals (task success rate, user feedback, LLM-as-judge scores) and cost per model per query class, enabling the router's threshold table to be calibrated monthly. When a new model is released, it is onboarded into the router's model portfolio and shadow-tested at 5% traffic before becoming a routing candidate.

Fallback chains are defined per model: if the primary selected model is unavailable (rate limit, outage), the router automatically tries the next tier in the portfolio. Fallback decisions are logged and trigger an alert if the fallback rate for a given model exceeds 1% over a rolling hour.


4a. API Reference

LiteLLM — Multi-Model Router Configuration (Python)

from litellm import Router

# Define the model portfolio in code (production: load from versioned YAML)
model_list = [
    # Tier 0 — standard, fast, cheap
    {"model_name": "tier-0-standard", "litellm_params": {
        "model": "claude-3-5-haiku-20241022", "api_key": "os.environ/ANTHROPIC_API_KEY"}},
    # Tier 1 — mid-tier general purpose
    {"model_name": "tier-1-midtier", "litellm_params": {
        "model": "claude-3-5-sonnet-20241022", "api_key": "os.environ/ANTHROPIC_API_KEY"}},
    # Tier 2 — reasoning model with extended thinking
    {"model_name": "tier-2-reasoning", "litellm_params": {
        "model": "claude-3-7-sonnet-20250219", "api_key": "os.environ/ANTHROPIC_API_KEY",
        "thinking": {"type": "enabled", "budget_tokens": 12000}}},
    # Tier 3 — sovereign/on-prem (data residency enforcement)
    {"model_name": "tier-3-sovereign", "litellm_params": {
        "model": "bedrock/meta.llama3-70b-instruct-v1:0",
        "aws_region_name": "ap-southeast-2"}},  # Sydney region for AU data residency
]

router = Router(model_list=model_list, fallbacks=[
    {"tier-2-reasoning": ["tier-1-midtier"]},   # reasoning tier fallback on outage
    {"tier-1-midtier": ["tier-0-standard"]},    # mid-tier fallback
])

def route_request(query: str, complexity_score: float, data_class: str,
                  latency_req_ms: int, budget_pct_used: float) -> str:
    # 1. Compliance routing takes absolute precedence
    if data_class in ("PROTECTED", "CONFIDENTIAL"):
        return router.completion(model="tier-3-sovereign", messages=[{"role":"user","content":query}])
    # 2. Cost budget enforcer — downgrade one tier if >90% of period budget consumed
    effective_complexity = complexity_score * (0.8 if budget_pct_used > 0.9 else 1.0)
    # 3. Latency constraint overrides quality selection
    if latency_req_ms < 2000:
        return router.completion(model="tier-0-standard", messages=[{"role":"user","content":query}])
    # 4. Complexity-based selection
    if effective_complexity >= 0.75:
        return router.completion(model="tier-2-reasoning", messages=[{"role":"user","content":query}])
    elif effective_complexity >= 0.45:
        return router.completion(model="tier-1-midtier", messages=[{"role":"user","content":query}])
    else:
        return router.completion(model="tier-0-standard", messages=[{"role":"user","content":query}])

OpenAI o3 vs GPT-4o-mini — Routing Decision in Code

import openai

def select_model_and_call(query: str, complexity_score: float) -> str:
    if complexity_score >= 0.75:
        # Tier 2: o3 with high reasoning effort
        response = openai.chat.completions.create(
            model="o3",
            reasoning_effort="high",
            messages=[{"role": "user", "content": query}]
        )
        # log actual reasoning tokens for cost attribution
        reasoning_tokens = response.usage.completion_tokens_details.reasoning_tokens
    elif complexity_score >= 0.45:
        # Tier 1: GPT-4o — capable but no extended thinking
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": query}]
        )
        reasoning_tokens = 0
    else:
        # Tier 0: GPT-4o-mini — fast, cheap, good enough for simple queries
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": query}]
        )
        reasoning_tokens = 0
    return response.choices[0].message.content
# Cost comparison (Jun 2026): GPT-4o-mini $0.15/M input vs o3 $10/M input + $40/M output.
# A 500-token query at Tier 0 costs AU$0.0001; at Tier 2 (o3 high, 10K reasoning) costs AU$0.62.
# The router's ROI is realised by keeping ≥ 60% of traffic at Tier 0.

5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Intake["Request Intake"] A[AI Request and Envelope] B[Router Middleware] end subgraph Rules["Routing Rules Engine"] C[Complexity Classifier] D[Latency Constraint Check] E[Compliance Rule Check] F[Cost Budget Enforcer] end subgraph Portfolio["Model Portfolio"] G[Standard Tier Models] H[Mid-Tier Models] I[Reasoning Tier Models] J[On-Prem Sovereign Models] end subgraph Observability["Observability"] K[Routing Decision Log] L[Quality Dashboard] end A --> B B --> C C --> D D --> E E --> F F --> G F --> H F --> I E -->|classified data| J G --> K H --> K I --> K J --> K K --> L

6. Components

Component Responsibility Technology Examples
Router Middleware Orchestrates all routing rule checks; produces final model selection decision LiteLLM router, Portkey AI gateway, custom FastAPI or Express middleware
Complexity Classifier Scores query 0–1 for reasoning depth required Rule engine + optional DistilBERT classifier (see EAAPL-RSN001)
Compliance Rule Engine Maps data-classification tags to permitted model pool; enforces data residency OPA policy engine, custom rule table, Azure AI Foundry content filtering
Cost Budget Enforcer Tracks spend per tenant/feature; triggers tier downgrade near budget cap Redis counters + budget table, AWS Budgets API integration
Model Portfolio Registry Versioned catalogue of available models with tier, provider, capability flags, and SLA YAML config in git, AWS AppConfig, Azure App Configuration
Quality Measurement Service LLM-as-judge or task-metric evaluator producing quality signal per model per query class Braintrust, Langfuse, custom evaluator Lambda
Fallback Chain Manager Handles provider errors; retries on next tier; logs fallback events LiteLLM fallback config, custom circuit breaker

7. Implementation Steps

Step 1: Audit and Catalogue Current Model Usage

Before building the router, audit every AI API call in the organisation: which model, which feature, which query type, what monthly cost, what quality signal exists. Group calls into 5–10 query categories. For each category, identify the minimum model tier that meets the quality bar (using existing quality signals or running a rapid A/B evaluation). This audit produces the initial routing table — the router's configuration before any ML classifier is added. Start with this rule-based table; it will outperform intuition immediately.

Step 2: Build the Model Portfolio Registry

Define the model portfolio as a versioned configuration file. For each model, record: provider name, model identifier, tier (0=standard, 1=mid, 2=reasoning, 3=sovereign), maximum context window, per-token cost for input/output/thinking, supported features (tool use, vision, extended thinking), P95 latency from internal benchmarks, and data-residency region. Build a Model Adapter interface that normalises the API call shape across providers — this is the abstraction that makes model swaps transparent to application code. Test the adapter against all portfolio models before wiring the router.

Step 3: Implement Compliance Routing Rules First

Before complexity routing, implement and validate compliance rules — these are the highest-stakes routing decisions. Define a data-classification taxonomy (PUBLIC, INTERNAL, CONFIDENTIAL, PROTECTED) and a mapping from classification to permitted model pool. Implement the classifier as a combination of structural signals (presence of Government PROTECTED markers, PII entity detection, tenant-level policy) and explicit caller annotation. Test with 50 labelled samples per classification tier. Any miscategorisation that routes PROTECTED data to an external API is a P0 incident — the rule engine must have > 99.9% precision on the PROTECTED class.

Routing Tier Calibration Reference Matrix

Reference routing decisions by query type, based on production deployments across Australian enterprise and government sectors. AU$ costs calculated at Jun 2026 rates.

Query Type Default Tier Model Example Cost per Query (AU$) Compliance Override
FAQ / product information lookup Tier 0 — standard Claude 3.5 Haiku / GPT-4o-mini AU$0.001–0.003 None
Summarisation / extraction from single document Tier 0–1 Claude 3.5 Haiku or Sonnet AU$0.003–0.015 PROTECTED doc → Tier 3 sovereign
Multi-document synthesis / comparison Tier 1 — mid Claude 3.5 Sonnet / GPT-4o AU$0.015–0.08 CONFIDENTIAL content → Tier 3 sovereign
Financial analysis / modelling Tier 2 — reasoning Claude 3.7 / o3 medium AU$0.28–0.56 PII present → Tier 3 if data residency policy applies
Legal / regulatory interpretation Tier 2 — reasoning Claude 3.7 / o3 high AU$0.56–1.12 Always Tier 3 if PROTECTED classification
Code architecture / security review Tier 1–2 Claude 3.7 low or mid budget AU$0.15–0.45 Source code classified CONFIDENTIAL → Tier 3 sovereign
Classified government brief / PROTECTED data Tier 3 — sovereign AWS GovCloud Bedrock / on-prem AU$0.05–0.25 (GPU amortised) Mandatory regardless of complexity

Step 4: Deploy with Shadow Mode and Calibrate

Deploy the router in shadow mode: all traffic still goes to the existing hardcoded model, but the router logs what it would have selected. After two weeks, compare shadow decisions to actual decisions. Where the router would have selected a cheaper tier and quality metrics suggest no regression, begin migrating those query categories to the router decision. Increase routed traffic 10% per week per category, monitoring quality and cost. Full migration typically completes in six weeks per query category. Establish a monthly routing calibration cadence: review the routing table, quality metrics, and cost outcomes.


8. Security Considerations

OWASP LLM Top 10 Mapping

OWASP ID Threat Mitigation
LLM01 — Prompt Injection Injected instructions claim "this is a high-priority task" to force routing to expensive reasoning model Routing decisions based on structural query signals and authenticated caller-provided metadata; never on query text claims
LLM06 — Sensitive Information Disclosure Mis-routed PROTECTED query sent to external API Compliance rule engine runs before any external API call; classification errors trigger P0 alert and automatic fallback to sovereign model
LLM07 — Insecure Plugin Design Application bypasses router by calling model provider directly All model API keys held in router layer; application credentials only valid for router endpoint
LLM04 — Model Denial of Service Burst traffic forces all requests to reasoning tier, exhausting per-minute token quota Per-model rate limiting in router; queue-based smoothing for burst; cost budget enforcer triggers tier downgrade before quota exhaustion

9. Governance Artefacts

  • Model portfolio registry (version-controlled; every model addition requires capability and compliance validation)
  • Routing table document with each rule, its quality justification, and the compliance requirement it satisfies
  • Data-classification taxonomy and mapping to permitted model pools (signed off by CISO and compliance team)
  • Monthly routing calibration report with quality and cost outcomes per query category
  • Fallback event log with root cause and resolution per provider outage
  • Shadow mode comparison report for each new query category before live routing migration

10. SLOs

SLO Target Measurement
Routing decision latency P99 < 50ms Router middleware execution time excluding model call
Compliance mis-routing rate 0% — zero PROTECTED queries to non-compliant model Monthly audit of routing log for PROTECTED-classified queries
Cost reduction vs baseline > 50% vs single high-tier model at same query volume Monthly API cost / baseline cost
Fallback rate per primary model < 1% per rolling hour Fallback events / total routed calls per model per hour
Quality regression rate per category < 2% queries below quality threshold post-migration Quality measurement service per-category weekly report

11. Cost Model

Cost Driver Estimate Notes
Standard tier (Tier 0) routed queries $0.15–0.60 per 1M input tokens GPT-4o-mini, Claude 3.5 Haiku, Gemini 1.5 Flash; target for 60–70% of query volume
Mid-tier (Tier 1) routed queries $2–15 per 1M input tokens GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro; target for 25–35% of volume
Reasoning tier (Tier 2) routed queries $15–60 per 1M input tokens plus thinking tokens o3, Claude 3.7 extended thinking; target for 3–10% of volume
Sovereign/on-prem tier (Tier 3) $0.50–5/M (GPU amortised) or contracted sovereign cloud rate Volume driven by compliance rules, not complexity
Router middleware compute $1–10 per 1M requests Lambda/container cost; negligible vs model API cost

12. Trade-off Analysis

Dimension Benefit Trade-off
Cost 50–70% reduction vs single high-tier model across all workloads Router infrastructure build cost; ongoing calibration time
Quality Reasoning-tier accuracy on hard queries; fast standard-tier on simple queries Quality degradation if routing thresholds mis-calibrated; monitoring required
Operational resilience Multi-provider fallback chains; no single-provider dependency More providers to contract, monitor, and integrate
Compliance Compliance routing rule ensures data-residency requirements met automatically Classification accuracy is critical; rule maintenance is ongoing
Developer experience Model selection abstracted from application code; no code change for model upgrades Teams lose direct model control; router config is a new dependency

13. Failure Modes

Failure Trigger Recovery
Routing table stale after product change New query type introduced without router config update Default to mid-tier for unclassified queries; alert on unclassified rate > 5%; monthly review catches gaps
Compliance classifier failure PII detector offline; classification API timeout Fail-safe: default all unclassified queries to on-prem sovereign model until classifier recovers; P1 alert
Cost enforcer bug causes all traffic to Tier 0 Off-by-one in budget counter logic Quality dashboard detects quality drop; automated quality gate alerts; rollback to previous router version
Provider rate limit cascade Tier 0 provider throttles; all traffic falls back to Tier 1, which also throttles Queue-based backpressure; shed non-critical traffic; escalate to provider; capacity increase
Shadow mode routing divergence hidden Shadow mode logs never reviewed; systematic routing errors not caught before live migration Shadow mode reports mandatory before live migration sign-off; reports distributed to governance team weekly

14. Regulatory Mapping

Regulation Requirement How Pattern Addresses It
EU AI Act Article 9 — Risk Management Risk management must include controls on AI system behaviour and outputs Compliance routing rules prevent regulated data from reaching non-compliant models; routing audit log supports risk management documentation; routing table reviewed monthly against updated risk register
EU AI Act Article 13 — Transparency Reasoning chains must be explainable to competent authorities on demand for high-risk AI systems Routing decision log records model selected, tier, and routing rule that drove the decision for every request; competent authority can reconstruct which model capability was applied to any specific high-risk output
NIST AI RMF GOVERN 1.6 "Policies, processes, procedures, and practices across the organisation related to the mapping, measuring, and managing of AI risks are in place" Model portfolio registry, routing table, compliance routing rule set, and monthly calibration procedure constitute the organisational policies and practices; shadow mode reports and routing decision logs are the evidence of practice enforcement
ISO/IEC 42001 Clause 8.4 AI system operation must be controlled and monitored Routing decision log, quality dashboard, fallback event log, and monthly calibration report are the monitoring and control records; every routing table change requires governance sign-off
APRA CPS 230 §21 Critical operations must have defined RTOs/RPOs; AI system failures must not breach critical operation SLAs Fallback chain manager ensures routing continues to function when primary model tier is unavailable; fallback to Tier 0 guarantees service continuity within the RTO even if Tier 2 reasoning is unavailable; fallback rate alert (>1% per rolling hour) triggers capacity review before RTO breach
APRA CPS 234 Material information assets must be protected; third-party providers assessed Compliance routing rule enforces data-residency controls; model portfolio registry documents provider risk assessments; PROTECTED-class queries never transit external API providers

15. Reference Implementations

AWS

Deploy router as an AWS Lambda fronting Amazon Bedrock. Bedrock natively supports multiple model providers (Anthropic, Meta, Mistral, Amazon Nova) enabling provider-agnostic routing. Compliance routing to AWS GovCloud Bedrock endpoint for PROTECTED data. Cost budget enforcer using DynamoDB atomic counters. Quality signals via CloudWatch custom metrics + Bedrock model evaluation jobs. Model portfolio in AWS AppConfig with automatic Lambda reload on config change.

Azure

Implement as Azure API Management policy calling Azure AI Foundry for model routing. Azure AI Foundry's deployment model supports routing between OpenAI, Meta, and Mistral models in a single API surface. Data classification via Azure Purview labels propagated in the request metadata. Compliance routing to Azure Government Cloud for sovereign requirements. Cost tracking via Azure Cost Management; budget enforcement via Azure Budgets alerts + Logic App downgrade trigger.

On-Premises / Private Cloud

Deploy LiteLLM proxy with custom router strategy implementing all four routing checks. Model portfolio includes vLLM-served open-weight models (DeepSeek-R1 for reasoning, Llama 3 70B for mid-tier, Llama 3 8B for standard) plus on-prem Mistral for sovereign routing. OPA sidecar for compliance rules. Redis for cost budget counters. Prometheus + Grafana for routing decision metrics and quality signals.


  • EAAPL-RSN001: Extended Thinking Gate — a specialised routing gate this pattern subsumes for the reasoning tier
  • EAAPL-RSN002: Think Budget Allocation — once routed to reasoning tier, budget allocation governs thinking-token usage
  • EAAPL-RSN003: Reasoning-then-Act — agentic planning calls route through this pattern before the reasoning model is invoked
  • EAAPL-SEC003: Data Classification and Residency — the compliance routing rule depends on this pattern's classification output
  • EAAPL-OBS001: LLM Observability — quality signals feeding router calibration are produced by this pattern

17. Maturity Assessment

Dimension Level (1–5) Notes
Pattern stability 4 Multi-tier model routing is a well-established cloud architecture concept; AI-specific compliance routing is newer but stable
Tooling availability 4 LiteLLM, Portkey, AWS Bedrock, Azure AI Foundry all support multi-model routing natively
Reference implementations 3 AWS and Azure implementations documented; on-premises open-weight routing is emerging
Regulatory acceptance 3 Compliance routing satisfies data-residency requirements; quality calibration documentation satisfies AI Act risk management expectations

18. Revision History

Version Date Change
1.0 2026-06-14 Initial release
← Back to LibraryMore Reasoning Models