EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryPlatform EngineeringEAAPL-PLT003
EAAPL-PLT003Proven
⇄ Compare

Model Routing

⚙️ Platform EngineeringEU AI ActISO/IEC 42001

[EAAPL-PLT003] Model Routing

Category: Platform Engineering Sub-category: Traffic Management Version: 1.2 Maturity: Proven Tags: model-routing, intelligent-routing, cost-based-routing, latency-routing, capability-routing, shadow-routing, a-b-routing, fallback, routing-rules-as-code Regulatory Relevance: EU AI Act Article 9 (Risk Management), ISO 42001, NIST AI RMF MAP 2.1


1. Executive Summary

The Model Routing pattern establishes intelligent, policy-driven dispatch of AI inference requests to the optimal model from a pool of candidates. As organisations operate multiple model providers and tiers—frontier models for complex reasoning, mid-tier models for standard tasks, specialist models for domain-specific workloads—the routing layer translates business intent (minimise cost, maximise quality, meet latency SLO) into per-request model selection decisions without burdening product teams with this logic.

The commercial impact is significant: organisations that implement tiered routing consistently report 30–50% reduction in model API spend by directing simple tasks to cheaper models while reserving frontier compute for genuinely complex requests. Additionally, shadow routing enables risk-free model evaluation in production traffic, and fallback routing maintains availability when individual providers degrade. Routing rules expressed as code integrate with GitOps workflows, giving governance teams an auditable, reviewable change process for every routing policy change.


2. Problem Statement

Business Problem

Organisations pay frontier model prices for tasks that could be handled by models costing 10–20× less. There is no systematic mechanism to evaluate new model versions without exposing production traffic to risk. When a model provider has an outage, AI features fail rather than failing over to an available alternative.

Technical Problem

Routing logic is hardcoded in product team applications: each team selects a specific model endpoint and implements its own fallback logic. When routing strategy needs to change (e.g., switch primary model, adjust fallback order, enable cost-based routing), each team must make independent code changes. There is no A/B framework for comparing model quality systematically.

Symptoms

  • 100% of AI requests going to the single most expensive model regardless of task complexity
  • New model evaluation requiring full production deployment with rollback risk
  • Model provider outage causing complete AI feature failure rather than graceful failover
  • No mechanism to compare quality of two models on the same production traffic
  • Teams spending engineering time implementing and maintaining per-team fallback logic

Cost of Inaction

  • Unnecessary model API costs of 30–50% above optimal routing
  • Model evaluation cycles of 4–8 weeks due to lack of production traffic comparison tooling
  • Provider outage MTTR of hours instead of minutes due to hardcoded model selection
  • Inability to demonstrate model governance to auditors (no audit trail of routing decisions)

3. Context

When to Apply

  • Organisation operates ≥2 model providers or model tiers simultaneously
  • Cost optimisation of AI spend is a priority
  • Availability requirements demand provider failover capability
  • Model evaluation and comparison is a recurring operational need
  • Platform team centralises model access (see EAAPL-PLT001)

When NOT to Apply

  • Single model, single provider with no plans for multi-provider: routing overhead not warranted
  • Models are fundamentally incompatible in output format such that failover would break consuming applications
  • Ultra-low latency requirements (<100ms total) where routing overhead is prohibitive (use direct integration)

Prerequisites

  • AI API Gateway (EAAPL-PLT002) as the host for routing logic
  • Model Registry with capability cards per model (PLT001 Layer 2)
  • Multiple model provider credentials managed in Secrets Manager
  • Observability infrastructure for routing decision logging and model performance metrics
  • Response schema normalisation across providers (or application tolerance for schema variation)

Industry Applicability

Industry Applicability Routing Strategy Priority
Financial Services High Capability-based (accuracy critical); fallback for availability
Healthcare High Capability-based (clinical accuracy); cost-based for administrative tasks
Media / Content Very High Cost-based routing dominant; high volume, variable complexity
E-commerce High Latency-based for customer-facing; cost-based for batch enrichment
Technology / SaaS Very High Multi-strategy; A/B routing for model evaluation is core practice
Government Medium Capability and data-residency routing; complex policy rules

4. Architecture Overview

The Model Routing layer sits within or immediately behind the AI API Gateway and executes per-request model selection before the upstream proxy forwards the call. The routing decision is deterministic given the same input context and routing configuration, making it reproducible and auditable. The routing configuration is stored as code in a Git repository, enabling GitOps workflows for policy changes.

Intent Classification is the first stage of routing logic. The incoming request carries signals that inform routing: the declared use case tag in the request metadata (e.g., use-case: summarisation), the consumer's team namespace (which may have team-level routing overrides), the estimated complexity of the request (derived from prompt length, presence of structured data, declared reasoning requirement), and any explicit model hint from the consumer (which is subject to policy gating). Intent classification can be as simple as a rule lookup against the use-case tag or as sophisticated as a lightweight classifier that scores request complexity in <10ms.

Routing Strategy Evaluation applies the configured strategy for the consumer/use-case combination. Four primary strategies are defined:

Cost-based routing assigns a cost tier to each request (low/medium/high) based on complexity signals and routes to the cheapest model within that tier that meets the quality threshold. Cost tiers map to model families: low-cost (GPT-4o-mini, Claude Haiku, Gemini Flash), mid-cost (GPT-4o, Claude Sonnet), high-cost (o1, Claude Opus, Gemini Ultra). The quality threshold per tier is expressed as a minimum benchmark score on the organisation's evaluation dataset.

Latency-based routing selects the model with the lowest current P90 latency from real-time metrics. This is particularly valuable for interactive user-facing features where model quality differences are marginal but latency differences are perceived. The latency metric is maintained as a sliding 5-minute window per provider endpoint.

Capability-based routing matches the request's declared requirements against model capability cards in the registry. A request requiring 128K+ context routes only to models with sufficient context windows; a request requiring tool use routes only to models with function-calling capability; a request requiring structured JSON output routes to models with reliable JSON mode. Capability routing is essentially a filter, often combined with cost or latency routing for final selection.

Fallback routing defines an ordered preference list for a given model alias. When the primary model's circuit breaker is open or the provider returns persistent errors, the router advances to the next candidate. The fallback chain is explicit and version-controlled, not implicit.

A/B and Shadow Routing are layered on top of the primary strategy. A/B routing sends a configurable percentage of traffic to a candidate model, comparing outputs against the primary on the organisation's quality metrics. Shadow routing duplicates requests to a candidate model asynchronously without serving its response to the consumer; this enables zero-risk production traffic evaluation. Both mechanisms write routing experiment metadata to the Evaluation Framework (EAAPL-PLT008) for analysis.

Circuit Breaker Integration makes routing resilient. Each model endpoint has an associated circuit breaker tracking success rate and latency over a rolling window. When a circuit opens, the router excludes that endpoint from selection for the duration of the open window (configurable, typically 60 seconds). After the open window, a half-open state tests with a single request. This means the router inherently implements provider failover without a separate failover mechanism.

Routing Rules as Code is a first-class governance principle. All routing configuration—strategy assignments per use case and consumer, fallback chains, A/B experiment configurations, capability requirements, cost tier thresholds—is expressed in a structured configuration format (YAML/JSON) stored in the platform's Git repository. Changes go through pull request review with platform team approval and are applied to the routing engine via a configuration deployment pipeline. Every routing configuration version is recorded in the audit log alongside the routing decisions it produced.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Request["Request + Config"] A[Incoming Request] B[Routing Rules GitOps] C[Model Registry] end subgraph Router["Model Router"] D[Intent Classifier] E[Strategy Engine] F{Circuit Breaker} end subgraph Models["Model Endpoints"] G[Frontier Tier] H[Mid-Cost Tier] I[Efficiency Tier] end A --> D B --> E C --> E D --> E E --> F F -->|primary| G F -->|cost route| H F -->|efficiency| I E --> J[(Routing Audit Log)] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#fef9c3,stroke:#eab308 style C fill:#fef9c3,stroke:#eab308 style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#f3e8ff,stroke:#a855f7 style G fill:#dbeafe,stroke:#3b82f6 style H fill:#dbeafe,stroke:#3b82f6 style I fill:#dbeafe,stroke:#3b82f6 style J fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Intent Classifier Service Estimate request complexity; extract use-case signals Rule-based lookup, lightweight ML classifier (DistilBERT), regex patterns High
Routing Strategy Engine Service Apply configured strategy to produce ranked model list Custom rule engine, LiteLLM router, Envoy route configuration Critical
Circuit Breaker State Store Service Maintain per-endpoint health state (closed/open/half-open) Redis, in-memory (single instance), Resilience4j Critical
A/B Traffic Splitter Service Distribute traffic according to experiment configuration Custom weighted random, LaunchDarkly, feature flag service Medium
Shadow Router Service Duplicate requests to shadow model asynchronously Async task queue (Celery, asyncio), Kafka producer Medium
Routing Rules Store Configuration Version-controlled routing configuration Git repository + ConfigMap (Kubernetes), Consul K/V High
Real-Time Metrics Collector Service Maintain sliding window of model performance metrics Prometheus, in-memory metrics cache with TTL High
Model Registry Client Service Query model capability cards for capability-based routing gRPC/HTTP client to Model Registry service High
Routing Decision Logger Service Write routing decision record to audit log Async writer to Kafka/OpenTelemetry High
Evaluation Integration Service Publish A/B results to Evaluation Framework REST/event client to PLT008 Medium

7. Data Flow

Primary Flow — Cost-Based Routing Request

Step Actor Action Output
1 Incoming Request Arrive at router with use-case tag summarisation and consumer team team-marketing Request context with metadata
2 Intent Classifier Look up summarisation in use-case taxonomy; estimate complexity as LOW from prompt token count Complexity: LOW; Use case: summarisation
3 Routing Strategy Selector Look up team-marketing + summarisation in routing rules; find strategy: cost-based Strategy: cost-based
4 Cost-Based Strategy Map LOW complexity to Tier 3 efficiency models; retrieve list: [Claude Haiku, GPT-4o-mini] Candidate list: [Claude Haiku, GPT-4o-mini]
5 Circuit Breaker Check Check circuit state for Claude Haiku (CLOSED) and GPT-4o-mini (CLOSED) Both available
6 Final Selector Select Claude Haiku (primary preference in rules); check A/B config — no active experiment for this consumer Selected: Claude Haiku endpoint
7 Routing Decision Log Emit routing record: {request_id, strategy, candidates, selected, reason, timestamp} Audit log record written
8 Upstream Proxy Forward request to Claude Haiku endpoint Model response

Error Flow

Error Condition Detection Action Consumer Impact
Primary model circuit open Circuit breaker state check at step 5 Advance to next candidate in fallback chain Transparent; higher cost model may be used
All candidates circuit open Step 5 all candidates unavailable Return 503 with routing-exhausted code; trigger incident alert Service degraded; no AI response
Capability mismatch (no capable model available) Capability filter produces empty list Return 422 with no-capable-model code Consumer must adjust request parameters
Routing rules not found for use case Strategy selector miss Apply default strategy (configured globally) Potential non-optimal routing; logs warning
Intent classification timeout <10ms budget exceeded Apply default routing strategy without classification Routing proceeds; log classification timeout

8. Security Considerations

Authentication and Authorisation

  • Model selection may not be manipulated by consumer input beyond the declared use-case tag; raw model names in consumer requests are validated against authorised models for that consumer
  • Team-level routing overrides require platform team approval; they are stored in the version-controlled routing rules, not consumer-controllable at request time

Secrets Management

  • Model provider credentials for each endpoint are retrieved from Secrets Manager at routing decision time; credentials are not embedded in routing rules
  • Shadow routing uses separate credentials with read-only scoping where possible to prevent shadow model being used for mutations

Data Classification and Encryption

  • Routing decisions involving RESTRICTED or CONFIDENTIAL data are logged with the classification label for audit trail completeness
  • Shadow requests must be subject to the same data classification and policy enforcement as primary requests

Auditability

  • Every routing decision is logged with: strategy applied, candidates considered, circuit breaker states, selected endpoint, reason code, any experiment configuration active
  • Routing configuration changes are version-controlled and auditable as Git commits with author, timestamp, and review record

OWASP LLM Top 10 Controls

OWASP LLM Risk Routing-Layer Control
LLM01 Prompt Injection Routing does not modify prompts; injection risk handled at gateway layer
LLM04 Model DoS Circuit breaker prevents failed model from absorbing continued traffic
LLM05 Supply Chain Only models in the approved registry are eligible routing targets
LLM09 Overreliance Routing logs which model produced each response; enables per-model quality monitoring

9. Governance Considerations

Responsible AI

  • Routing rules must not route high-risk AI use cases to models without a completed Model Risk Card
  • A/B experiments involving high-risk use cases require explicit Governance Board approval before activation
  • Shadow routing results feed into model evaluation decisions that are recorded in the Evaluation Framework

Model Risk Management

  • The routing fallback chain defines the approved substitution hierarchy; arbitrary model substitution is not permitted
  • When a new model is added to the registry and routing rules, a Model Risk Card delta review is required comparing the new model to existing candidates
  • Routing telemetry (which model served which volume of requests) is a key input to the quarterly model risk review

Governance Artefacts

Artefact Owner Cadence Location
routing-rules.yaml Platform Team Per change via PR Git repository
A/B experiment registry Platform Team + Model Owner Per experiment Evaluation Framework
Fallback chain approval records Platform Governance Board Per change GRC system / Git PR comments
Routing telemetry report Platform Team Monthly Observability dashboard
Model substitution impact assessment Risk Team Per fallback chain change Model Registry

10. Operational Considerations

Monitoring

Signal Source Alert Threshold Owner
Fallback activation rate Routing decision log >5% of requests using non-primary model Platform On-Call
Circuit breaker state changes Circuit breaker events Any circuit opening Platform On-Call + Model Owner
Intent classification error rate Intent classifier metrics >1% classification errors Platform Team
Routing rule miss rate Routing engine logs >0.1% requests hitting default fallback Platform Team
A/B experiment quality delta Evaluation Framework Statistically significant quality degradation in B variant Platform Team + Product Owner

SLOs

SLO Target Window
Routing decision latency P99 <15ms (overhead beyond gateway) Rolling 7 days
Routing availability (decisions produced) 99.99% Rolling 30 days
Fallback success rate >99% of requests served even when primary unavailable Rolling 30 days
Circuit breaker false positive rate <0.1% circuits opened without actual provider failure Rolling 30 days

Logging

  • Routing decisions logged as structured JSON with correlation to the gateway request ID
  • Circuit breaker state transitions logged separately for operational analysis
  • A/B experiment decisions include experiment ID and variant for analysis join

Incident Response

Incident Detection Response RTO
Routing engine crash Health check failure; 100% routing errors Kubernetes pod restart; DNS failover to secondary 2 min
All circuits open (full blackout) Zero successful upstream calls Activate static fallback responses; page platform + engineering leadership 5 min
Routing misconfiguration deployed Fallback rate spike after deployment Rollback routing-rules.yaml via GitOps; circuit breakers reset 10 min

Disaster Recovery

Component RPO RTO Strategy
Routing engine (stateless) 0 2 min Multi-replica; pod auto-restart
Routing rules config 0 5 min Git-backed; ConfigMap reload
Circuit breaker state (Redis) 5 min 2 min Redis Sentinel; acceptable brief stale state
Routing decision audit log <1 min 10 min Kafka replication + S3 cross-region

11. Cost Considerations

Cost Drivers

Driver Description Relative Weight
Routing engine compute Stateless; minimal CPU; scales with request count Very Low
Intent classifier inference If ML-based, adds per-request compute Low
Circuit breaker state (Redis) Small memory footprint Very Low
Cost savings from tier routing Negative cost — 30–50% reduction in model API spend Dominant positive ROI

Optimisations

  • Most valuable optimisation: aggressive Tier 3 routing for high-volume, low-complexity tasks (summarisation, classification, entity extraction)
  • Intent classifier should be rule-based for speed (latency budget <5ms) unless complexity estimation materially improves routing quality
  • Cache routing decisions for identical consumer + use-case combinations with short TTL (1 minute) to reduce routing computation

Indicative Cost Range

Scale Monthly Routing Infra Cost Notes
Any scale $100–$500/month Routing engine is minimal compute; ROI is entirely from model cost savings
Cost savings at medium scale (10M tokens/day) -$3,000–$8,000/month From tier routing directing 60% of traffic to Tier 3 models
Cost savings at large scale (100M tokens/day) -$30,000–$80,000/month Tier routing ROI dominates; dedicated cost optimisation team warranted

12. Trade-Off Analysis

Routing Strategy Options

Strategy Description Pros Cons Best For
Static Routing Fixed model per use-case; no dynamic selection Simplest; predictable; easy to audit No cost optimisation; no failover Initial deployment; highly regulated use cases
Cost-Based Routing Route to cheapest model meeting quality threshold 30–50% cost reduction Requires quality benchmarks; threshold tuning effort High-volume, mixed-complexity workloads
Capability-Based Routing Filter by capability; then cost or latency within capable set Accurate capability matching; prevents capability-mismatch errors Requires maintained capability cards in registry Multi-model deployments with specialised models
ML-Based Routing Classify request complexity with ML model; route accordingly Most accurate tier assignment Adds latency; ML model requires training and maintenance Very high volume where marginal accuracy gains justify overhead

Intent Classification Options

Option Latency Accuracy Maintenance Best For
Rule-based (use-case tag lookup) <1ms Depends on caller discipline Low Structured internal API with disciplined callers
Regex + heuristics on prompt 1–5ms Moderate Low-Medium General purpose with structured prompts
Lightweight ML classifier 5–15ms High Medium High-volume workloads where routing accuracy has large cost impact

Architectural Tensions

Tension Option A Option B Resolution
Routing transparency vs. complexity Expose routing decision to consumers Black box Include X-Model-Used header in response; audit log accessible to consumers for own requests
Routing speed vs. accuracy Rule-based (fast, less accurate) ML classifier (slower, more accurate) Rule-based default; ML opt-in for high-volume use cases where ROI justifies latency
Consumer control vs. platform governance Allow consumers to specify exact model Platform controls all routing Allow model family hints; platform selects within family; override audited
Failover quality vs. consistency Always fail over to available model Return error if preferred model unavailable Fail-over default for availability; consumer can opt for fail-fast if consistency required

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Intent classifier crash Medium Medium — all requests use default routing Classifier health check; default routing rate spikes Restart classifier; default routing adequate in interim
Routing rules desync (ConfigMap stale) Low Medium — requests using outdated routing policy Rules version mismatch alert Force ConfigMap reload; GitOps pipeline re-applies
Circuit breaker stuck open (false positive) Low Medium — model excluded despite being healthy Provider health check succeeds while circuit open Manual circuit reset; post-incident investigation
A/B experiment misconfiguration (100% to B) Low High — all traffic to unvalidated model Traffic split monitoring alert Rollback experiment config; route to primary
Model capability card stale in registry Medium Low-Medium — capability routing sends to incapable model Capability mismatch error from model Update registry; add error handler for capability mismatch

Cascading Scenario

  • Mass circuit opening storm: Under a broad cloud provider degradation, multiple circuits open simultaneously. The router falls back to the next tier for all requests. If the fallback tier is also degraded (same cloud region), the cascade proceeds through all fallback candidates and the router returns 503 for all requests. Mitigation: fallback chains must span cloud providers or include on-premises/alternative-region endpoints.

14. Regulatory Considerations

EU AI Act Article 9

  • Routing decisions must be recorded to demonstrate that the risk management system controls which models process which use cases; the routing audit log satisfies this requirement
  • High-risk AI systems must not be subject to automatic fallback to lower-quality or unapproved models without human oversight configuration

NIST AI RMF MAP 2.1

  • The routing configuration explicitly documents the intended deployment context for each model, satisfying MAP 2.1's requirement to document AI deployment context

Audit and Record-Keeping

  • Routing decision logs must be retained for the same period as the AI system's operational records (typically 7 years for regulated decisions)
  • Routing configuration Git history constitutes an auditable record of every routing policy change with author and approval

15. Reference Implementations

AWS

Component AWS Service
Routing engine LiteLLM Proxy on ECS, or custom Lambda function
Circuit breaker state ElastiCache Redis
Routing rules SSM Parameter Store or S3 config object
Intent classifier Lambda + custom rules, or SageMaker endpoint (ML-based)
Model endpoints Bedrock (Claude, Llama, Titan), SageMaker endpoints for self-hosted

Azure

Component Azure Service
Routing engine APIM with AI routing policies, or custom AKS deployment
Circuit breaker APIM native circuit breaker policy
Routing rules App Configuration
Model endpoints Azure OpenAI multiple deployments

GCP

Component Azure Service
Routing engine Cloud Run service with LiteLLM or custom Python
Circuit breaker Custom Redis-backed on Memorystore
Model endpoints Vertex AI multiple model deployments

On-Premises

Component Technology
Routing engine LiteLLM Proxy or custom Python/Go service
Circuit breaker Resilience4j (Java) or custom Redis-backed
Routing rules Consul K/V or Git-synced ConfigMap
Model endpoints vLLM serving multiple models on GPU cluster

Pattern ID Name Relationship
EAAPL-PLT001 Enterprise AI Platform Parent — routing is a core capability of the platform
EAAPL-PLT002 AI API Gateway Host — routing executes within or behind the gateway
EAAPL-PLT004 LLM Cost Control Complementary — cost-based routing is primary cost control lever
EAAPL-PLT008 AI Experiment Tracking Dependency — A/B and shadow routing results feed experiment tracking
EAAPL-INT007 AI Circuit Breaker Component — circuit breaker is embedded within routing

17. Maturity Assessment

Overall Maturity: Proven Model routing is production-proven across dozens of enterprise deployments. LiteLLM and Kong AI Gateway provide mature implementations. ML-based intent classification is still an emerging practice; rule-based routing is the proven approach.

Scoring Matrix

Dimension Score (1–5) Rationale
Pattern Completeness 5 All sections documented
Implementation Evidence 4 Core routing proven; ML-based intent classification less so
Tooling Stability 4 LiteLLM router mature; ML classification tooling evolving
Regulatory Alignment 4 Audit logging mapped; specific regulatory requirements vary by use case
Cost ROI Evidence 5 Consistent 30–50% cost reduction reported across multiple deployments

18. Revision History

Version Date Author Changes
1.0 2024-03-10 EAAPL Working Group Initial publication
1.1 2024-09-15 EAAPL Working Group Added A/B and shadow routing sections; ML-based intent classification
1.2 2025-06-12 EAAPL Working Group Cost savings data updated; cascading failure scenario added; GCP reference added
← Back to LibraryMore Platform Engineering