EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryModel Management
Proven
⇄ Compare

EAAPL-MDL002 — Shadow Model Deployment

EAAPL-MDL002 — Shadow Model Deployment

Attribute Value
Pattern ID EAAPL-MDL002
Name Shadow Model Deployment
Maturity Proven
Complexity High
Tags model-risk observability high-availability high-complexity
Last Reviewed 2026-06-12
Owner Enterprise AI Architecture Practice

1. Executive Summary

Shadow model deployment allows an organisation to validate a new AI model under real production conditions — full traffic load, real user inputs, live context — without exposing users to the new model's outputs. Production traffic is mirrored asynchronously to the shadow model; the shadow computes a response, which is stored and compared to the production response but never served to the user. This eliminates the principal risk of model upgrades: discovering that a new model behaves differently only after users experience it. For CIOs, shadow deployment is a mandatory risk control before promoting any model upgrade in a regulated or customer-facing context. For CTOs, it provides statistically grounded promotion criteria grounded in real traffic rather than offline benchmarks. For risk officers, it is the evidentiary record that demonstrates the organisation validated model behaviour before promoting a change. The pattern is high-complexity because it requires an asynchronous traffic mirroring infrastructure, a shadow response storage layer, and a statistical comparison pipeline — none of which exist in most organisations by default. The investment is justified for any model with material business impact: customer-facing recommendation, credit decisioning, fraud detection, medical triage support, or automated content moderation.


2. Problem Statement

2.1 Business Problem

Organisations upgrade AI models periodically to improve quality, reduce cost, or address emerging risks. Conventional practice is to test on an offline dataset, then deploy to production. The offline dataset is always stale: it does not represent the current distribution of real user inputs, seasonal patterns, or adversarial inputs. Models that pass offline benchmarks fail in production. The business discovers the failure through customer complaints, revenue impact, or regulatory action.

2.2 Technical Problem

Offline evaluation cannot capture the full complexity of production traffic. Real production requests carry user-specific context, system state, upstream service responses, and time-sensitive signals that are absent from a fixed benchmark dataset. A model that performs identically to its predecessor on a benchmark dataset may perform materially differently on the long tail of real production inputs.

2.3 Symptoms

  • Model upgrades cause unexpected quality regressions discovered by customer feedback, not internal monitoring.
  • Post-upgrade error rates spike before detection — mean time to detection is hours, not minutes.
  • There is no statistical basis for the decision to promote a new model ("it looked good in testing").
  • Rollbacks are required for >30% of model promotions in the past year.
  • The organisation cannot demonstrate to a regulator that it validated model behaviour before deployment.

2.4 Cost of Inaction

Category Indicative Impact
Quality Risk Model regression discovered in production affects all users; rollback takes 5–30 minutes during which users are impacted
Regulatory EU AI Act Article 9 risk management obligation not met without pre-production validation evidence
Reputational Public incident caused by model regression damages brand; recovery requires customer communications
Financial Customer churn from degraded experience; revenue loss during incident; incident investigation cost

3. Context

3.1 When to Apply

  • Before promoting any model version in a customer-facing, regulated, or high-stakes context.
  • When offline benchmark datasets may not represent current production traffic distribution.
  • When the new model represents a MINOR or MAJOR version change (per EAAPL-MDL001 schema).
  • When rollback risk is high and the cost of a production incident exceeds shadow infrastructure cost.

3.2 When NOT to Apply

  • PATCH version changes (quantisation, minor optimisation) where behaviour change is expected to be negligible — use regression testing instead.
  • Models serving internal tooling with no customer or regulatory impact.
  • Contexts where traffic volume is so low that statistical comparison is meaningless (< 1,000 requests/day — use canary instead).
  • Stateful write-heavy models where shadow execution risks side effects (see Section 4.4).

3.3 Prerequisites

Prerequisite Detail
Traffic mirroring capability Load balancer or service mesh capable of async request duplication
Shadow response store High-throughput, schema-flexible storage for shadow + production response pairs
Comparison analysis pipeline Automated pipeline running daily statistical comparison of shadow vs production
Model versioning (EAAPL-MDL001) Both production and shadow models must be versioned and registered
Promotion criteria definition Measurable, pre-agreed criteria for shadow-to-production promotion

3.4 Industry Applicability

Industry Applicability Primary Driver
Financial Services Critical APRA CPS230 change management; credit/fraud model validation
Healthcare Critical Patient safety; clinical decision support validation
E-commerce / Retail High Revenue-impacting recommendation engine upgrades
Media / Content High Content moderation model upgrades affecting policy enforcement
Government High Service delivery quality; citizen-facing AI accountability
Technology Platforms Medium API quality guarantees to downstream consumers

4. Architecture Overview

4.1 Traffic Mirroring Architecture

Production traffic is mirrored asynchronously to the shadow model using a request duplication layer placed at the load balancer or service mesh level. The critical design principle is that the mirroring is asynchronous and non-blocking: the production request path is unaffected by any shadow-side processing. If the shadow model is slow or fails, the production response is never delayed or affected.

The request duplicator captures the full request payload — including headers, authentication context (anonymised), timestamp, and all inference inputs — and enqueues a copy to a shadow inference queue. The shadow model consumer reads from this queue and processes requests at its own pace. Because shadow processing is decoupled from production request latency, the shadow model can be run on lower-priority compute — scheduled spot instances, off-peak batch processing — without affecting production SLOs.

4.2 Shadow Response Storage

Every shadow inference produces a response pair: the shadow model's response and the corresponding production model's response (retrieved from the production response log by matching a request correlation ID). These pairs are stored in a shadow comparison store — a document or columnar database optimised for the comparison analysis pipeline. The store retains: request ID, timestamp, request payload hash (not cleartext for privacy), production response, shadow response, and all computed quality metrics for both responses.

Retention policy for shadow response pairs: 90 days online, then purged (unless subject to regulatory retention). The shadow store must be scoped as a non-production system: real user input data subject to privacy regulations must be anonymised or pseudonymised before storage.

4.3 Comparison Analysis Pipeline

A daily analysis pipeline processes accumulated shadow/production pairs and produces a comparison report. The pipeline computes: (1) quality metrics for both models — accuracy, BLEU/ROUGE/BERTScore for generation tasks, calibration for classification; (2) latency distribution (p50, p95, p99) for shadow vs production; (3) error rate comparison; (4) cost per inference comparison; (5) safety check results (does shadow model generate any content that production model would not?); (6) disagreement rate — the proportion of requests where the two models produce materially different outputs. The report is published to the model governance dashboard and stored in the Model Register against the shadow version.

4.4 Handling Stateful Operations in Shadow

Shadow models must operate in read-only mode. They must not write to any production database, send notifications, invoke external APIs, or modify any shared state. Shadow inference is computation-only. For models that normally invoke tools or external systems, the shadow request processor must use a stubbed tool layer that records the intended tool calls without executing them. This is enforced by infrastructure — the shadow model's service account has no write permissions on production systems.

4.5 Shadow Duration Guidelines

Shadow duration is determined by model risk tier: Low-risk internal models require a minimum of 1 week with at least 10,000 shadow requests. Medium-risk customer-facing models require a minimum of 2 weeks with at least 50,000 shadow requests. High-risk regulated models (credit, medical, fraud) require a minimum of 4 weeks with at least 100,000 shadow requests and explicit sign-off from the risk function. These are minimums — shadow should continue until promotion criteria are met, regardless of calendar time.

4.6 Promotion Criteria

Promotion from shadow to production (via canary release per EAAPL-MDL003) requires all of the following: (1) shadow quality score meets or exceeds production by the margin defined at version registration; (2) shadow p99 latency within 20% of production p99; (3) shadow error rate does not exceed production error rate; (4) shadow safety check passes (zero content safety violations); (5) minimum shadow duration met; (6) comparison report reviewed and approved by model owner and, for high-risk models, AI Governance.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Traffic["Traffic Layer"] A[User Request] B[Load Balancer] end subgraph Models["Model Serving"] C[Production Model] D[Shadow Inference Queue] E[Shadow Model] end subgraph Analysis["Comparison and Governance"] F[(Shadow Response Store)] G[Comparison Pipeline] H{Promotion Decision} end A --> B B -->|sync primary| C B -->|async mirror| D D --> E C --> F E --> F F --> G G --> H H -->|criteria met| I[Canary Release] H -->|criteria not met| J[Extend Shadow Period] style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#d1fae5,stroke:#10b981 style D fill:#fef9c3,stroke:#eab308 style E fill:#dbeafe,stroke:#3b82f6 style F fill:#fef9c3,stroke:#eab308 style G fill:#f0fdf4,stroke:#22c55e style H fill:#f3e8ff,stroke:#a855f7 style I fill:#d1fae5,stroke:#10b981 style J fill:#fee2e2,stroke:#ef4444

6. Components

Component Type Responsibility Technology Options Criticality
Request Duplicator Infrastructure Asynchronously mirrors production requests to shadow queue; zero production latency impact Envoy mirror filter, AWS ALB mirroring, Nginx mirror, Istio Critical
Shadow Inference Queue Messaging Decouples shadow processing from production path; buffers during shadow compute spikes AWS SQS, Azure Service Bus, GCP Pub/Sub, Kafka High
Shadow Model Serving Inference Runs shadow model version against mirrored requests Same inference infrastructure as production; lower-priority compute High
Stub Tool Layer Safety Guard Intercepts tool calls from shadow model; records intent without executing Custom middleware; feature flag that disables external calls Critical
Shadow Response Store Data Store Stores request/response pairs for comparison analysis DynamoDB, BigQuery, Snowflake, PostgreSQL High
Comparison Analysis Pipeline Batch Compute Runs daily statistical comparison; produces comparison report Apache Spark, AWS Glue, dbt + SQL, custom Python pipeline High
Model Governance Dashboard Observability Presents comparison results; supports promotion decision workflow Grafana, custom React dashboard, Looker Medium

7. Data Flow

7.1 Primary Flow

Step Actor Action Output
1 User Sends inference request Request received at load balancer
2 Load Balancer Routes request to production model; asynchronously mirrors to shadow queue Production request dispatched; shadow message enqueued
3 Production Model Processes request; returns response Production response served to user; logged with request ID
4 Shadow Queue Consumer Reads mirrored request; invokes shadow model Shadow inference job initiated
5 Shadow Model Processes mirrored request via stub tool layer Shadow response computed; tool calls recorded not executed
6 Shadow Response Writer Writes shadow response + matching production response to shadow store Response pair persisted with correlation ID
7 Comparison Pipeline Daily run: reads all new pairs; computes metrics; generates report Comparison report published to governance dashboard
8 Model Owner / Governance Reviews report against promotion criteria Promotion approved or shadow period extended

7.2 Error Flow

Error Scenario Detection Recovery Action
Shadow model inference failure Error rate monitor on shadow consumer Log error; skip pair; alert on sustained failure rate > 5%
Shadow queue backpressure Queue depth monitor Scale shadow consumer; shed shadow load (production unaffected)
Stub tool layer bypass (shadow writes) Audit log alert on unexpected write attempt Halt shadow processing; security investigation; version quarantined
Comparison pipeline failure Pipeline health monitor Retry pipeline; alert after 2 consecutive daily failures
Response pair storage at capacity Storage utilisation alert Age out pairs beyond retention window; scale storage

8. Security Considerations

8.1 Controls Summary

Domain Control
Authentication Shadow model service account isolated from production service account; no shared credentials
Authorisation Shadow model service account has read-only access to inference inputs; no write access to any production system
Secrets Shadow model uses same secrets manager as production; keys scoped per model version
Classification Shadow response store classified at same level as production data; user request payloads anonymised before storage
Encryption Shadow store encrypted at rest (AES-256) and in transit (TLS 1.3)
Auditability All shadow inference attempts logged; any tool call attempt (stub or real) logged to audit trail

8.2 OWASP LLM Top 10 Relevance

OWASP LLM Risk Relevance Mitigation
LLM01 Prompt Injection High Shadow model processes real production inputs including potentially adversarial content; must run in isolated sandbox
LLM02 Insecure Output Handling Medium Shadow responses are stored not served, but must still be sanitised before display in comparison dashboard
LLM03 Training Data Poisoning Low Shadow model is a pre-trained/fine-tuned candidate; poisoning risk addressed in training pipeline (EAAPL-MDL006)
LLM04 Model Denial of Service Medium Shadow queue acts as a buffer; but sustained high volume can exhaust shadow compute budget
LLM05 Supply Chain Vulnerabilities Medium Shadow model shares supply chain with production; validated by same provenance check
LLM06 Sensitive Information Disclosure High Request payloads contain real user data; pseudonymisation before shadow store is mandatory
LLM07 Insecure Plugin Design High Stub tool layer is the primary control; any bypass allows shadow model to take real-world action
LLM08 Excessive Agency High Stub tool layer prevents shadow from executing any action; this is the central security control
LLM09 Overreliance Low Shadow is internal validation tooling; overreliance not applicable
LLM10 Model Theft Medium Shadow response store contains model outputs at scale; store access controls prevent inference reversal

9. Governance Considerations

9.1 Responsible AI

Shadow testing must include fairness analysis in the comparison report: do the shadow model's outputs diverge from production in ways that are disproportionate across demographic subgroups? Any fairness regression detected in shadow is a blocking criterion for promotion, regardless of overall quality metrics.

9.2 Model Risk Management

Shadow deployment is the pre-production validation stage of the MRM lifecycle. The comparison report constitutes evidence for model validation. For APRA-regulated entities, the comparison report is part of the model governance record and must be retained.

9.3 Human Approval Gates

Promotion from shadow to canary is a human decision. The comparison report informs the decision but does not automate it. The model owner must explicitly approve promotion. For high-risk models, the AI Governance function countersigns. Automated promotion without human review is not permitted.

9.4 Governance Artefacts

Artefact Owner Frequency Location
Shadow Comparison Report Model Owner Daily during shadow Model Register + governance dashboard
Shadow Period Summary AI Governance At promotion decision Model governance record
Stub Tool Layer Audit Log Security Operations Continuous SIEM
Privacy Impact Assessment Privacy Officer Per shadow deployment Privacy register

10. Operational Considerations

10.1 SLOs

SLO Target Measurement Method
Shadow queue lag behind production < 60 seconds Queue consumer lag metric
Shadow processing error rate < 1% Error counter on shadow consumer
Comparison report publication latency < 2 hours after midnight Pipeline completion timestamp
Response pair storage availability 99.9% Storage health check

10.2 Monitoring and Logging

Key metrics to monitor continuously during shadow period: shadow queue depth (alert if > 10,000 unprocessed), shadow consumer error rate (alert if > 1%), shadow model latency p99 (informational — not blocking production), daily comparison report publication (alert if missing), stub tool layer bypass attempts (alert immediately — P1).

10.3 Incident Response

Two incident classes specific to shadow deployment: (1) Shadow production interference — if any shadow operation writes to or calls a production system, halt shadow immediately; security investigation; version quarantined until investigation complete. (2) Shadow queue saturation impacting production — theoretically impossible if mirroring is purely async; if observed, circuit-breaker drops shadow traffic; P1 incident.

10.4 Disaster Recovery

Scenario RPO RTO Recovery Procedure
Shadow store data loss 24h 4 hours Restart shadow period; production unaffected
Shadow consumer failure N/A 1 hour Restart consumer; process queued messages; production unaffected
Comparison pipeline failure N/A 2 hours Retry pipeline run; extend shadow period if report missing

10.5 Capacity Planning

Shadow infrastructure processes the same volume as production but asynchronously. Size shadow compute at 30–50% of production inference capacity (queue provides elasticity). Shadow response store grows at: (average response size) × (daily request volume) × (retention days). For a service with 100,000 requests/day at 2KB average response size and 90-day retention: ~18 GB. Plan at 5× for safety margins and comparison metadata.


11. Cost Considerations

11.1 Cost Drivers

Driver Description Relative Impact
Shadow inference compute Running shadow model at production traffic volume High
Shadow response storage Storing 90 days of response pairs at production volume Medium
Comparison pipeline compute Daily batch analysis of accumulated pairs Low
Queue infrastructure Managed queue service at production message volume Low
Engineering time Setting up and maintaining shadow infrastructure per model High

11.2 Scaling Risks

Shadow inference compute scales linearly with production traffic. A traffic spike doubles shadow compute cost. Mitigation: shadow consumer operates with a configurable maximum throughput; excess shadow requests are shed (shadow completeness reduces, but production is unaffected). Monitor shadow completeness: if < 80%, extend shadow duration.

11.3 Optimisations

  • Use spot/preemptible instances for shadow inference (shadow is delay-tolerant).
  • Process shadow requests in micro-batches for GPU efficiency (batch size 8–32 depending on model).
  • Use columnar compression on shadow response store (response text compresses 5–10×).
  • Skip shadow for PATCH version changes; run only comparison analysis on a sampled offline subset.

11.4 Indicative Cost Range

Traffic Volume Monthly Shadow Cost (Inference Only) Assumptions
Low (< 100K req/day) $500–$2,000 Spot GPU instances; 4-week shadow; small LLM
Medium (100K–1M req/day) $2,000–$15,000 Managed GPU cluster; spot pricing; auto-scaling
High (> 1M req/day) $15,000–$80,000 Dedicated GPU fleet; storage at scale

12. Trade-Off Analysis

12.1 Shadow vs Alternative Validation Approaches

Approach Quality Signal Production Impact Cost Regulatory Evidence Best For
Shadow deployment (this pattern) High — real traffic None High Strong High-risk, regulated, customer-facing models
Canary release (EAAPL-MDL003) High — real traffic + outcomes User-visible risk Medium Strong Medium-risk models with low rollback cost
Offline A/B on held-out set Medium — static dataset None Low Moderate Research validation; pre-shadow gate
Manual QA on sampled requests Low — human review None Medium Weak Small models, low volume, low risk

12.2 Architectural Tensions

Tension Description Resolution
Privacy vs Signal Quality Using real user data maximises signal; but storage of real user inputs raises privacy risk Pseudonymise at capture; store only input hash + model outputs; purge promptly after comparison
Shadow Completeness vs Cost Full shadow coverage is ideal; cost may require sampling Stratified sampling: ensure all input types represented; priority to long-tail and edge cases
Read-Only Constraint vs Realism Shadow model cannot replicate stateful model behaviours (e.g., personalisation that writes state) Shadow tests stateless inference quality only; stateful behaviour validated separately via integration tests

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Shadow model writes to production system Very Low Critical Audit log alert on unexpected write Halt shadow; quarantine version; security investigation
Comparison report produces false positive Medium High Manual review catches inconsistency Re-run pipeline with corrected metrics; extend shadow period
Shadow queue memory leak causes host OOM Low Medium Container memory alert Restart consumer; process queued messages from checkpoint
Privacy breach: real PII in shadow store Low Critical Data classification scan alert Halt shadow; purge affected data; notify privacy officer
Stub bypass allows shadow notification to user Very Low High User complaint; audit log Halt shadow; user apology; investigate stub implementation

13.1 Cascading Failure Scenarios

If the shadow queue grows unbounded (consumer failure during high-traffic period), the queue infrastructure may exhaust storage. If queue infrastructure is shared with production messaging systems, this can cascade into production messaging failures. Mitigation: shadow queue is isolated from all production messaging infrastructure; maximum queue depth is bounded; when maximum is reached, new shadow messages are dropped (production unaffected).


14. Regulatory Considerations

Regulation / Framework Relevant Clause How This Pattern Addresses It
EU AI Act (2024/1689) Article 9 (Risk Management System) — pre-deployment testing requirement for high-risk AI Shadow deployment constitutes mandatory pre-deployment validation; comparison report is evidence
EU AI Act (2024/1689) Article 10 (Data Governance) — training and validation data quality Shadow uses real production distribution to validate beyond training data
ISO 42001:2023 Clause 8.4 (AI system lifecycle — verification and validation) Shadow comparison report constitutes validation evidence per Clause 8.4
NIST AI RMF (2023) MANAGE 2.2 (Mechanisms for test, evaluation, validation, verification) Shadow is the primary TEVV mechanism for model upgrades
APRA CPS 230 (2025) Paragraph 52 (Change management — testing) Shadow constitutes pre-change testing; comparison report is test evidence
Privacy Act 1988 (Cth) APP 3 (Collection of solicited personal information) / APP 11 (Security) Shadow captures real user data — must have privacy notice scope and be secured at classification level of production data

15. Reference Implementations

15.1 AWS

  • Traffic Mirroring: AWS Application Load Balancer traffic mirroring; or Envoy proxy deployed on ECS/EKS with mirror filter.
  • Shadow Queue: Amazon SQS FIFO queue with shadow inference Lambda consumer.
  • Shadow Inference: SageMaker Endpoint (separate endpoint per shadow version); spot instance backed.
  • Shadow Store: Amazon DynamoDB (response pairs); S3 for bulk comparison data.
  • Comparison Pipeline: AWS Glue job (daily); results to S3 + QuickSight dashboard.

15.2 Azure

  • Traffic Mirroring: Azure API Management with request duplication policy; or Azure Service Mesh (Istio on AKS).
  • Shadow Queue: Azure Service Bus Premium (isolation from production).
  • Shadow Inference: Azure Machine Learning managed endpoint (shadow version); spot-backed compute cluster.
  • Shadow Store: Azure Cosmos DB (response pairs); Azure Blob for comparison data.
  • Comparison Pipeline: Azure Synapse Analytics (daily pipeline); Power BI dashboard.

15.3 GCP

  • Traffic Mirroring: Cloud Load Balancing with request mirroring; or Istio on GKE.
  • Shadow Queue: Cloud Pub/Sub (dedicated topic for shadow).
  • Shadow Inference: Vertex AI Endpoint (shadow version); preemptible GPU nodes.
  • Shadow Store: BigQuery (response pairs, columnar for analysis efficiency).
  • Comparison Pipeline: BigQuery ML + Dataflow; Looker dashboard.

15.4 On-Premises / Hybrid

  • Traffic Mirroring: Nginx mirroring directive; Envoy proxy sidecar in Kubernetes.
  • Shadow Queue: Apache Kafka (dedicated topic, separate consumer group).
  • Shadow Inference: Kubernetes Job on dedicated GPU node pool (lower-priority node affinity).
  • Shadow Store: PostgreSQL + TimescaleDB; columnar compression for response pairs.
  • Comparison Pipeline: Apache Spark on-cluster; Grafana dashboard.

Pattern ID Pattern Name Relationship Type Description
EAAPL-MDL001 Model Versioning Prerequisite Shadow deployment operates on specific versioned model artefacts
EAAPL-MDL003 Canary Model Release Next Step Successful shadow completion gates entry to canary release
EAAPL-MDL004 Model Rollback Sibling If shadow reveals production regression, rollback pattern is applied to current prod version
EAAPL-MDL008 Model Access Governance Dependency Shadow model access is governed by same access tiers as production

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Industry Adoption 4 Shadow/dark launch is established in software; LLM-specific shadow is newer
Tooling Availability 3 Traffic mirroring is mature; LLM shadow comparison pipelines require custom build
Standards Alignment 4 Directly supports EU AI Act Article 9 and ISO 42001 Clause 8.4
Implementation Complexity 4 (high) Requires async infrastructure, privacy controls, and statistical analysis pipeline
Regulatory Acceptance 4 Shadow evidence is accepted as pre-deployment validation by EU AI Act supervisors

18. Revision History

Version Date Author Summary of Changes
1.0 2026-06-12 Enterprise AI Architecture Practice Initial publication
← Back to LibraryMore Model Management