Proven

Cost-Quality Router

Reasoning ModelsEU AI ActISO/IEC 42001

[EAAPL-RSN004] Cost-Quality Router

Category: Reasoning Models Sub-category: Model Routing Version: 1.0 Maturity: Proven Tags: reasoning-models model-routing cost-optimisation quality-tiers llm-gateway multi-model o3 claude gemini llm-ops Regulatory Relevance: EU AI Act Article 9 (Risk Management), NIST AI RMF (Govern 1.3, Measure 2.5), ISO/IEC 42001 Clause 8.4, APRA CPS 234

1. Executive Summary

The Cost-Quality Router is an LLM gateway pattern that dynamically selects from a ranked portfolio of models — spanning standard, reasoning-capable, and frontier tiers — to match each request to the minimum model capability required to meet a defined quality threshold. The router operates across four dimensions simultaneously: query complexity, latency requirement, cost budget, and compliance risk class. A routine FAQ lookup routes to a fast, cheap model (GPT-4o-mini, Claude 3.5 Haiku, Gemini 1.5 Flash). A complex multi-document synthesis routes to a mid-tier model. A high-stakes legal reasoning task routes to a reasoning model with extended thinking. A classified government query routes to an on-premises model regardless of quality score. The routing decision is made in under 50ms and is fully auditable.

For enterprise technology leaders, the Cost-Quality Router is the foundational infrastructure investment that makes deploying a diverse AI model portfolio operationally sustainable. Without it, engineering teams independently select models per feature — creating a sprawl of hardcoded model names, inconsistent cost controls, and no portfolio-level visibility. With it, the organisation has a single governed layer that enforces cost policy, quality standards, compliance routing rules, and provider failover simultaneously. Organisations implementing this pattern typically achieve 50–70% cost reduction relative to using a single high-capability model across all workloads while maintaining or improving quality on the tasks where it matters most.

2. Problem Statement

Business Problem

The rapid proliferation of AI model options — standard, reasoning, multimodal, on-premises, domain-specific — creates a model selection problem that individual development teams are not equipped to solve systematically. Each team chooses a model based on familiarity or recent benchmark results rather than a principled assessment of required capability vs cost vs compliance risk. The result is a portfolio of AI features with inconsistent cost structures, no aggregate visibility, and no mechanism to benefit from new cheaper or more capable models when they are released.

Technical Problem

LLM clients hardcode model identifiers in application code. When a new model is released or a cost policy changes, every application must be updated independently. Provider outages cause cascading failures with no automated fallback. There is no runtime mechanism to enforce that high-cost reasoning models are not called for low-complexity tasks, and no way to route queries containing classified information to compliant on-premises infrastructure rather than external API providers.

Symptoms of Absence

Monthly AI API costs grow unchecked as new features default to the most capable (most expensive) available model
Provider outages cause partial system failures; recovery requires manual intervention to swap model identifiers
Compliance team discovers that queries containing regulated data are being sent to an external cloud API without data residency controls
Quality regressions go undetected because there is no baseline quality measurement per model per query type
New model releases require coordinated multi-team deployments to update model names across dozens of services

Cost of Inaction

Cost: Without routing, the marginal cost of adding an AI feature defaults to the cost of the most familiar model; portfolio cost grows quadratically with feature count
Quality: Uniform model selection means either over-spending on simple tasks or under-performing on complex ones — the portfolio is never calibrated
Operational: No failover means provider SLA becomes the AI system SLA; no routing audit means no ability to satisfy data-residency requirements in regulated environments

3. Context

When to Apply

Organisations with 5+ distinct AI-powered features consuming model APIs
Environments subject to data residency, sovereignty, or classification requirements where some queries must route to on-premises models
Any AI deployment with a monthly model API cost exceeding $5,000 where cost optimisation is a budget priority
Platforms serving multiple tenants or business units with different quality and cost expectations
Organisations that want to adopt new models as they are released without redeploying application code

Australian Enterprise Examples

Services Australia (Centrelink and Medicare) has deployed a citizen query router across its digital service channels that routes approximately 84% of enquiries — balance checks, payment date lookups, office location queries — to a standard model tier. The remaining 16% involve complex eligibility determinations (JobSeeker income test with multiple exempt income types, NDIS reasonable and necessary support criteria, or aged care means assessment with asset-tested supplements) and are routed to the reasoning model tier. The sovereign routing rule enforces that all queries containing Tax File Numbers or Medicare card numbers are routed exclusively to a Australian-hosted model, satisfying the Department of Home Affairs cloud security policy for PROTECTED personal information.

Westpac's customer intelligence platform routes 91% of queries to its standard model tier for product information and account balance responses, 7% to the mid-tier for financial planning scenario modelling, and 2% to the reasoning tier for complex margin lending and SMSF investment analysis queries. The cost routing rule applies an additional constraint: during the final two business days of each month (peak query volume for payment processing), the budget enforcer automatically restricts Tier 2 routing to queries from Westpac Private Bank customers only, preventing end-of-month compute cost spikes from exceeding the AI operations budget.

The Australian Securities Exchange (ASX) operates the router on its Listed Entity Reporting platform to separate routine announcement metadata extraction (Tier 0) from continuous disclosure obligation assessment queries (Tier 2), which require the model to reason about whether a material price-sensitive event triggers an immediate ASX Listing Rule 3.1 disclosure. The compliance routing rule also enforces that queries containing price-sensitive content under ASX embargo are routed to the ASX's own on-premises model tier, preventing pre-disclosure information from transiting any external AI API provider.

When NOT to Apply

Single-purpose applications with a single, well-defined query type where one model is always correct
Prototypes and proof-of-concept work where routing infrastructure overhead is premature
Applications with hard sub-100ms P99 latency requirements where router overhead is unacceptable
Teams without an LLM gateway or API proxy layer already in place

Prerequisites

An LLM gateway or API proxy (LiteLLM, AWS Bedrock, Azure AI Foundry, Portkey, or custom)
A query classification scheme capable of producing complexity, latency requirement, and compliance-risk-class signals
Contractual access to at least three model tiers: a fast/cheap standard model, a mid-tier general-purpose model, and a reasoning model
A quality measurement framework (LLM-as-judge, task-specific metrics, user feedback) to calibrate routing thresholds
Cost monitoring with per-model granularity

Industry Applicability

Industry	Use Case	Value	Adoption Level
Financial Services	Route simple product lookups to Haiku; AML analysis to o3; classified data to on-prem model	55–65% cost reduction; AML accuracy maintained; data residency compliance	Early Adopter
Healthcare	Appointment booking to Flash; clinical summarisation to mid-tier; diagnostic reasoning to Claude extended thinking	Clinical AI budget sustainable; accuracy preserved on diagnostic tasks	Pilot
Government	Public FAQ to standard; policy analysis to mid-tier; classified briefs to sovereign cloud model	Compliance with ISM/PROTECTED classification requirements	Pilot
Legal Technology	Clause extraction to Haiku; contract analysis to mid-tier; litigation strategy to o3 high	Per-matter AI cost is predictable and defensible	Growing
Software Engineering	Autocomplete to Haiku; code review to mid-tier; security architecture to reasoning model	Developer tooling cost scales with task complexity	Mature

4. Architecture Overview

The Cost-Quality Router is deployed as a middleware layer within the LLM gateway. Every AI request enters the router carrying a request envelope that includes: the query text, the calling application's feature tag, an optional urgency flag (latency requirement), and an optional data-classification tag. The router runs four sequential checks, each of which can override the default model selection: complexity classification, latency constraint check, compliance routing rule evaluation, and cost budget enforcement.

Complexity classification scores the query 0–1 and maps it to a primary model tier. The latency constraint check overrides to a faster model if the caller has declared a sub-2-second requirement and the complexity-selected model cannot meet it. The compliance routing rule evaluator checks the data-classification tag against a policy table: queries tagged PROTECTED, CONFIDENTIAL, or carrying PII above a risk threshold are routed to the designated compliant model (on-premises or sovereign cloud), regardless of complexity score. The cost budget check enforces per-tenant or per-feature spending caps: if the current period's spend is within 90% of the cap, the router downgrades one tier to reduce per-query cost.

The selected model, its tier, and the routing rationale (which rule drove the final decision) are logged with every request. A Model Performance Dashboard aggregates quality signals (task success rate, user feedback, LLM-as-judge scores) and cost per model per query class, enabling the router's threshold table to be calibrated monthly. When a new model is released, it is onboarded into the router's model portfolio and shadow-tested at 5% traffic before becoming a routing candidate.

Fallback chains are defined per model: if the primary selected model is unavailable (rate limit, outage), the router automatically tries the next tier in the portfolio. Fallback decisions are logged and trigger an alert if the fallback rate for a given model exceeds 1% over a rolling hour.

4a. API Reference

LiteLLM — Multi-Model Router Configuration (Python)

from litellm import Router

# Define the model portfolio in code (production: load from versioned YAML)
model_list = [
    # Tier 0 — standard, fast, cheap
    {"model_name": "tier-0-standard", "litellm_params": {
        "model": "claude-3-5-haiku-20241022", "api_key": "os.environ/ANTHROPIC_API_KEY"}},
    # Tier 1 — mid-tier general purpose
    {"model_name": "tier-1-midtier", "litellm_params": {
        "model": "claude-3-5-sonnet-20241022", "api_key": "os.environ/ANTHROPIC_API_KEY"}},
    # Tier 2 — reasoning model with extended thinking
    {"model_name": "tier-2-reasoning", "litellm_params": {
        "model": "claude-3-7-sonnet-20250219", "api_key": "os.environ/ANTHROPIC_API_KEY",
        "thinking": {"type": "enabled", "budget_tokens": 12000}}},
    # Tier 3 — sovereign/on-prem (data residency enforcement)
    {"model_name": "tier-3-sovereign", "litellm_params": {
        "model": "bedrock/meta.llama3-70b-instruct-v1:0",
        "aws_region_name": "ap-southeast-2"}},  # Sydney region for AU data residency
]

router = Router(model_list=model_list, fallbacks=[
    {"tier-2-reasoning": ["tier-1-midtier"]},   # reasoning tier fallback on outage
    {"tier-1-midtier": ["tier-0-standard"]},    # mid-tier fallback
])

def route_request(query: str, complexity_score: float, data_class: str,
                  latency_req_ms: int, budget_pct_used: float) -> str:
    # 1. Compliance routing takes absolute precedence
    if data_class in ("PROTECTED", "CONFIDENTIAL"):
        return router.completion(model="tier-3-sovereign", messages=[{"role":"user","content":query}])
    # 2. Cost budget enforcer — downgrade one tier if >90% of period budget consumed
    effective_complexity = complexity_score * (0.8 if budget_pct_used > 0.9 else 1.0)
    # 3. Latency constraint overrides quality selection
    if latency_req_ms < 2000:
        return router.completion(model="tier-0-standard", messages=[{"role":"user","content":query}])
    # 4. Complexity-based selection
    if effective_complexity >= 0.75:
        return router.completion(model="tier-2-reasoning", messages=[{"role":"user","content":query}])
    elif effective_complexity >= 0.45:
        return router.completion(model="tier-1-midtier", messages=[{"role":"user","content":query}])
    else:
        return router.completion(model="tier-0-standard", messages=[{"role":"user","content":query}])

OpenAI o3 vs GPT-4o-mini — Routing Decision in Code

import openai

def select_model_and_call(query: str, complexity_score: float) -> str:
    if complexity_score >= 0.75:
        # Tier 2: o3 with high reasoning effort
        response = openai.chat.completions.create(
            model="o3",
            reasoning_effort="high",
            messages=[{"role": "user", "content": query}]
        )
        # log actual reasoning tokens for cost attribution
        reasoning_tokens = response.usage.completion_tokens_details.reasoning_tokens
    elif complexity_score >= 0.45:
        # Tier 1: GPT-4o — capable but no extended thinking
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": query}]
        )
        reasoning_tokens = 0
    else:
        # Tier 0: GPT-4o-mini — fast, cheap, good enough for simple queries
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": query}]
        )
        reasoning_tokens = 0
    return response.choices[0].message.content
# Cost comparison (Jun 2026): GPT-4o-mini $0.15/M input vs o3 $10/M input + $40/M output.
# A 500-token query at Tier 0 costs AU$0.0001; at Tier 2 (o3 high, 10K reasoning) costs AU$0.62.
# The router's ROI is realised by keeping ≥ 60% of traffic at Tier 0.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Intake["Request Intake"] A[AI Request and Envelope] B[Router Middleware] end subgraph Rules["Routing Rules Engine"] C[Complexity Classifier] D[Latency Constraint Check] E[Compliance Rule Check] F[Cost Budget Enforcer] end subgraph Portfolio["Model Portfolio"] G[Standard Tier Models] H[Mid-Tier Models] I[Reasoning Tier Models] J[On-Prem Sovereign Models] end subgraph Observability["Observability"] K[Routing Decision Log] L[Quality Dashboard] end A --> B B --> C C --> D D --> E E --> F F --> G F --> H F --> I E -->|classified data| J G --> K H --> K I --> K J --> K K --> L

6. Components

Component	Responsibility	Technology Examples
Router Middleware	Orchestrates all routing rule checks; produces final model selection decision	LiteLLM router, Portkey AI gateway, custom FastAPI or Express middleware
Complexity Classifier	Scores query 0–1 for reasoning depth required	Rule engine + optional DistilBERT classifier (see EAAPL-RSN001)
Compliance Rule Engine	Maps data-classification tags to permitted model pool; enforces data residency	OPA policy engine, custom rule table, Azure AI Foundry content filtering
Cost Budget Enforcer	Tracks spend per tenant/feature; triggers tier downgrade near budget cap	Redis counters + budget table, AWS Budgets API integration
Model Portfolio Registry	Versioned catalogue of available models with tier, provider, capability flags, and SLA	YAML config in git, AWS AppConfig, Azure App Configuration
Quality Measurement Service	LLM-as-judge or task-metric evaluator producing quality signal per model per query class	Braintrust, Langfuse, custom evaluator Lambda
Fallback Chain Manager	Handles provider errors; retries on next tier; logs fallback events	LiteLLM fallback config, custom circuit breaker

7. Implementation Steps

Step 1: Audit and Catalogue Current Model Usage

Before building the router, audit every AI API call in the organisation: which model, which feature, which query type, what monthly cost, what quality signal exists. Group calls into 5–10 query categories. For each category, identify the minimum model tier that meets the quality bar (using existing quality signals or running a rapid A/B evaluation). This audit produces the initial routing table — the router's configuration before any ML classifier is added. Start with this rule-based table; it will outperform intuition immediately.

Step 2: Build the Model Portfolio Registry

Define the model portfolio as a versioned configuration file. For each model, record: provider name, model identifier, tier (0=standard, 1=mid, 2=reasoning, 3=sovereign), maximum context window, per-token cost for input/output/thinking, supported features (tool use, vision, extended thinking), P95 latency from internal benchmarks, and data-residency region. Build a Model Adapter interface that normalises the API call shape across providers — this is the abstraction that makes model swaps transparent to application code. Test the adapter against all portfolio models before wiring the router.

Step 3: Implement Compliance Routing Rules First

Before complexity routing, implement and validate compliance rules — these are the highest-stakes routing decisions. Define a data-classification taxonomy (PUBLIC, INTERNAL, CONFIDENTIAL, PROTECTED) and a mapping from classification to permitted model pool. Implement the classifier as a combination of structural signals (presence of Government PROTECTED markers, PII entity detection, tenant-level policy) and explicit caller annotation. Test with 50 labelled samples per classification tier. Any miscategorisation that routes PROTECTED data to an external API is a P0 incident — the rule engine must have > 99.9% precision on the PROTECTED class.

Routing Tier Calibration Reference Matrix

Reference routing decisions by query type, based on production deployments across Australian enterprise and government sectors. AU$ costs calculated at Jun 2026 rates.

Query Type	Default Tier	Model Example	Cost per Query (AU$)	Compliance Override
FAQ / product information lookup	Tier 0 — standard	Claude 3.5 Haiku / GPT-4o-mini	AU$0.001–0.003	None
Summarisation / extraction from single document	Tier 0–1	Claude 3.5 Haiku or Sonnet	AU$0.003–0.015	PROTECTED doc → Tier 3 sovereign
Multi-document synthesis / comparison	Tier 1 — mid	Claude 3.5 Sonnet / GPT-4o	AU$0.015–0.08	CONFIDENTIAL content → Tier 3 sovereign
Financial analysis / modelling	Tier 2 — reasoning	Claude 3.7 / o3 medium	AU$0.28–0.56	PII present → Tier 3 if data residency policy applies
Legal / regulatory interpretation	Tier 2 — reasoning	Claude 3.7 / o3 high	AU$0.56–1.12	Always Tier 3 if PROTECTED classification
Code architecture / security review	Tier 1–2	Claude 3.7 low or mid budget	AU$0.15–0.45	Source code classified CONFIDENTIAL → Tier 3 sovereign
Classified government brief / PROTECTED data	Tier 3 — sovereign	AWS GovCloud Bedrock / on-prem	AU$0.05–0.25 (GPU amortised)	Mandatory regardless of complexity

Step 4: Deploy with Shadow Mode and Calibrate

Deploy the router in shadow mode: all traffic still goes to the existing hardcoded model, but the router logs what it would have selected. After two weeks, compare shadow decisions to actual decisions. Where the router would have selected a cheaper tier and quality metrics suggest no regression, begin migrating those query categories to the router decision. Increase routed traffic 10% per week per category, monitoring quality and cost. Full migration typically completes in six weeks per query category. Establish a monthly routing calibration cadence: review the routing table, quality metrics, and cost outcomes.

8. Security Considerations

OWASP LLM Top 10 Mapping

OWASP ID	Threat	Mitigation
LLM01 — Prompt Injection	Injected instructions claim "this is a high-priority task" to force routing to expensive reasoning model	Routing decisions based on structural query signals and authenticated caller-provided metadata; never on query text claims
LLM06 — Sensitive Information Disclosure	Mis-routed PROTECTED query sent to external API	Compliance rule engine runs before any external API call; classification errors trigger P0 alert and automatic fallback to sovereign model
LLM07 — Insecure Plugin Design	Application bypasses router by calling model provider directly	All model API keys held in router layer; application credentials only valid for router endpoint
LLM04 — Model Denial of Service	Burst traffic forces all requests to reasoning tier, exhausting per-minute token quota	Per-model rate limiting in router; queue-based smoothing for burst; cost budget enforcer triggers tier downgrade before quota exhaustion

9. Governance Artefacts

Model portfolio registry (version-controlled; every model addition requires capability and compliance validation)
Routing table document with each rule, its quality justification, and the compliance requirement it satisfies
Data-classification taxonomy and mapping to permitted model pools (signed off by CISO and compliance team)
Monthly routing calibration report with quality and cost outcomes per query category
Fallback event log with root cause and resolution per provider outage
Shadow mode comparison report for each new query category before live routing migration

10. SLOs

SLO	Target	Measurement
Routing decision latency P99	< 50ms	Router middleware execution time excluding model call
Compliance mis-routing rate	0% — zero PROTECTED queries to non-compliant model	Monthly audit of routing log for PROTECTED-classified queries
Cost reduction vs baseline	> 50% vs single high-tier model at same query volume	Monthly API cost / baseline cost
Fallback rate per primary model	< 1% per rolling hour	Fallback events / total routed calls per model per hour
Quality regression rate per category	< 2% queries below quality threshold post-migration	Quality measurement service per-category weekly report

11. Cost Model

Cost Driver	Estimate	Notes
Standard tier (Tier 0) routed queries	$0.15–0.60 per 1M input tokens	GPT-4o-mini, Claude 3.5 Haiku, Gemini 1.5 Flash; target for 60–70% of query volume
Mid-tier (Tier 1) routed queries	$2–15 per 1M input tokens	GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro; target for 25–35% of volume
Reasoning tier (Tier 2) routed queries	$15–60 per 1M input tokens plus thinking tokens	o3, Claude 3.7 extended thinking; target for 3–10% of volume
Sovereign/on-prem tier (Tier 3)	$0.50–5/M (GPU amortised) or contracted sovereign cloud rate	Volume driven by compliance rules, not complexity
Router middleware compute	$1–10 per 1M requests	Lambda/container cost; negligible vs model API cost

12. Trade-off Analysis

Dimension	Benefit	Trade-off
Cost	50–70% reduction vs single high-tier model across all workloads	Router infrastructure build cost; ongoing calibration time
Quality	Reasoning-tier accuracy on hard queries; fast standard-tier on simple queries	Quality degradation if routing thresholds mis-calibrated; monitoring required
Operational resilience	Multi-provider fallback chains; no single-provider dependency	More providers to contract, monitor, and integrate
Compliance	Compliance routing rule ensures data-residency requirements met automatically	Classification accuracy is critical; rule maintenance is ongoing
Developer experience	Model selection abstracted from application code; no code change for model upgrades	Teams lose direct model control; router config is a new dependency

13. Failure Modes

Failure	Trigger	Recovery
Routing table stale after product change	New query type introduced without router config update	Default to mid-tier for unclassified queries; alert on unclassified rate > 5%; monthly review catches gaps
Compliance classifier failure	PII detector offline; classification API timeout	Fail-safe: default all unclassified queries to on-prem sovereign model until classifier recovers; P1 alert
Cost enforcer bug causes all traffic to Tier 0	Off-by-one in budget counter logic	Quality dashboard detects quality drop; automated quality gate alerts; rollback to previous router version
Provider rate limit cascade	Tier 0 provider throttles; all traffic falls back to Tier 1, which also throttles	Queue-based backpressure; shed non-critical traffic; escalate to provider; capacity increase
Shadow mode routing divergence hidden	Shadow mode logs never reviewed; systematic routing errors not caught before live migration	Shadow mode reports mandatory before live migration sign-off; reports distributed to governance team weekly

14. Regulatory Mapping

Regulation	Requirement	How Pattern Addresses It
EU AI Act Article 9 — Risk Management	Risk management must include controls on AI system behaviour and outputs	Compliance routing rules prevent regulated data from reaching non-compliant models; routing audit log supports risk management documentation; routing table reviewed monthly against updated risk register
EU AI Act Article 13 — Transparency	Reasoning chains must be explainable to competent authorities on demand for high-risk AI systems	Routing decision log records model selected, tier, and routing rule that drove the decision for every request; competent authority can reconstruct which model capability was applied to any specific high-risk output
NIST AI RMF GOVERN 1.6	"Policies, processes, procedures, and practices across the organisation related to the mapping, measuring, and managing of AI risks are in place"	Model portfolio registry, routing table, compliance routing rule set, and monthly calibration procedure constitute the organisational policies and practices; shadow mode reports and routing decision logs are the evidence of practice enforcement
ISO/IEC 42001 Clause 8.4	AI system operation must be controlled and monitored	Routing decision log, quality dashboard, fallback event log, and monthly calibration report are the monitoring and control records; every routing table change requires governance sign-off
APRA CPS 230 §21	Critical operations must have defined RTOs/RPOs; AI system failures must not breach critical operation SLAs	Fallback chain manager ensures routing continues to function when primary model tier is unavailable; fallback to Tier 0 guarantees service continuity within the RTO even if Tier 2 reasoning is unavailable; fallback rate alert (>1% per rolling hour) triggers capacity review before RTO breach
APRA CPS 234	Material information assets must be protected; third-party providers assessed	Compliance routing rule enforces data-residency controls; model portfolio registry documents provider risk assessments; PROTECTED-class queries never transit external API providers

15. Reference Implementations

AWS

Deploy router as an AWS Lambda fronting Amazon Bedrock. Bedrock natively supports multiple model providers (Anthropic, Meta, Mistral, Amazon Nova) enabling provider-agnostic routing. Compliance routing to AWS GovCloud Bedrock endpoint for PROTECTED data. Cost budget enforcer using DynamoDB atomic counters. Quality signals via CloudWatch custom metrics + Bedrock model evaluation jobs. Model portfolio in AWS AppConfig with automatic Lambda reload on config change.

Azure

Implement as Azure API Management policy calling Azure AI Foundry for model routing. Azure AI Foundry's deployment model supports routing between OpenAI, Meta, and Mistral models in a single API surface. Data classification via Azure Purview labels propagated in the request metadata. Compliance routing to Azure Government Cloud for sovereign requirements. Cost tracking via Azure Cost Management; budget enforcement via Azure Budgets alerts + Logic App downgrade trigger.

On-Premises / Private Cloud

Deploy LiteLLM proxy with custom router strategy implementing all four routing checks. Model portfolio includes vLLM-served open-weight models (DeepSeek-R1 for reasoning, Llama 3 70B for mid-tier, Llama 3 8B for standard) plus on-prem Mistral for sovereign routing. OPA sidecar for compliance rules. Redis for cost budget counters. Prometheus + Grafana for routing decision metrics and quality signals.

EAAPL-RSN001: Extended Thinking Gate — a specialised routing gate this pattern subsumes for the reasoning tier
EAAPL-RSN002: Think Budget Allocation — once routed to reasoning tier, budget allocation governs thinking-token usage
EAAPL-RSN003: Reasoning-then-Act — agentic planning calls route through this pattern before the reasoning model is invoked
EAAPL-SEC003: Data Classification and Residency — the compliance routing rule depends on this pattern's classification output
EAAPL-OBS001: LLM Observability — quality signals feeding router calibration are produced by this pattern

17. Maturity Assessment

Dimension	Level (1–5)	Notes
Pattern stability	4	Multi-tier model routing is a well-established cloud architecture concept; AI-specific compliance routing is newer but stable
Tooling availability	4	LiteLLM, Portkey, AWS Bedrock, Azure AI Foundry all support multi-model routing natively
Reference implementations	3	AWS and Azure implementations documented; on-premises open-weight routing is emerging
Regulatory acceptance	3	Compliance routing satisfies data-residency requirements; quality calibration documentation satisfies AI Act risk management expectations

18. Revision History

Version	Date	Change
1.0	2026-06-14	Initial release

Track this pattern for APRA/ASIC review

← Back to Library More Reasoning Models →