EAAPL-MDL002 — Shadow Model Deployment
| Attribute |
Value |
| Pattern ID |
EAAPL-MDL002 |
| Name |
Shadow Model Deployment |
| Maturity |
Proven |
| Complexity |
High |
| Tags |
model-risk observability high-availability high-complexity |
| Last Reviewed |
2026-06-12 |
| Owner |
Enterprise AI Architecture Practice |
1. Executive Summary
Shadow model deployment allows an organisation to validate a new AI model under real production conditions — full traffic load, real user inputs, live context — without exposing users to the new model's outputs. Production traffic is mirrored asynchronously to the shadow model; the shadow computes a response, which is stored and compared to the production response but never served to the user. This eliminates the principal risk of model upgrades: discovering that a new model behaves differently only after users experience it. For CIOs, shadow deployment is a mandatory risk control before promoting any model upgrade in a regulated or customer-facing context. For CTOs, it provides statistically grounded promotion criteria grounded in real traffic rather than offline benchmarks. For risk officers, it is the evidentiary record that demonstrates the organisation validated model behaviour before promoting a change. The pattern is high-complexity because it requires an asynchronous traffic mirroring infrastructure, a shadow response storage layer, and a statistical comparison pipeline — none of which exist in most organisations by default. The investment is justified for any model with material business impact: customer-facing recommendation, credit decisioning, fraud detection, medical triage support, or automated content moderation.
2. Problem Statement
2.1 Business Problem
Organisations upgrade AI models periodically to improve quality, reduce cost, or address emerging risks. Conventional practice is to test on an offline dataset, then deploy to production. The offline dataset is always stale: it does not represent the current distribution of real user inputs, seasonal patterns, or adversarial inputs. Models that pass offline benchmarks fail in production. The business discovers the failure through customer complaints, revenue impact, or regulatory action.
2.2 Technical Problem
Offline evaluation cannot capture the full complexity of production traffic. Real production requests carry user-specific context, system state, upstream service responses, and time-sensitive signals that are absent from a fixed benchmark dataset. A model that performs identically to its predecessor on a benchmark dataset may perform materially differently on the long tail of real production inputs.
2.3 Symptoms
- Model upgrades cause unexpected quality regressions discovered by customer feedback, not internal monitoring.
- Post-upgrade error rates spike before detection — mean time to detection is hours, not minutes.
- There is no statistical basis for the decision to promote a new model ("it looked good in testing").
- Rollbacks are required for >30% of model promotions in the past year.
- The organisation cannot demonstrate to a regulator that it validated model behaviour before deployment.
2.4 Cost of Inaction
| Category |
Indicative Impact |
| Quality Risk |
Model regression discovered in production affects all users; rollback takes 5–30 minutes during which users are impacted |
| Regulatory |
EU AI Act Article 9 risk management obligation not met without pre-production validation evidence |
| Reputational |
Public incident caused by model regression damages brand; recovery requires customer communications |
| Financial |
Customer churn from degraded experience; revenue loss during incident; incident investigation cost |
3. Context
3.1 When to Apply
- Before promoting any model version in a customer-facing, regulated, or high-stakes context.
- When offline benchmark datasets may not represent current production traffic distribution.
- When the new model represents a MINOR or MAJOR version change (per EAAPL-MDL001 schema).
- When rollback risk is high and the cost of a production incident exceeds shadow infrastructure cost.
3.2 When NOT to Apply
- PATCH version changes (quantisation, minor optimisation) where behaviour change is expected to be negligible — use regression testing instead.
- Models serving internal tooling with no customer or regulatory impact.
- Contexts where traffic volume is so low that statistical comparison is meaningless (< 1,000 requests/day — use canary instead).
- Stateful write-heavy models where shadow execution risks side effects (see Section 4.4).
3.3 Prerequisites
| Prerequisite |
Detail |
| Traffic mirroring capability |
Load balancer or service mesh capable of async request duplication |
| Shadow response store |
High-throughput, schema-flexible storage for shadow + production response pairs |
| Comparison analysis pipeline |
Automated pipeline running daily statistical comparison of shadow vs production |
| Model versioning (EAAPL-MDL001) |
Both production and shadow models must be versioned and registered |
| Promotion criteria definition |
Measurable, pre-agreed criteria for shadow-to-production promotion |
3.4 Industry Applicability
| Industry |
Applicability |
Primary Driver |
| Financial Services |
Critical |
APRA CPS230 change management; credit/fraud model validation |
| Healthcare |
Critical |
Patient safety; clinical decision support validation |
| E-commerce / Retail |
High |
Revenue-impacting recommendation engine upgrades |
| Media / Content |
High |
Content moderation model upgrades affecting policy enforcement |
| Government |
High |
Service delivery quality; citizen-facing AI accountability |
| Technology Platforms |
Medium |
API quality guarantees to downstream consumers |
4. Architecture Overview
4.1 Traffic Mirroring Architecture
Production traffic is mirrored asynchronously to the shadow model using a request duplication layer placed at the load balancer or service mesh level. The critical design principle is that the mirroring is asynchronous and non-blocking: the production request path is unaffected by any shadow-side processing. If the shadow model is slow or fails, the production response is never delayed or affected.
The request duplicator captures the full request payload — including headers, authentication context (anonymised), timestamp, and all inference inputs — and enqueues a copy to a shadow inference queue. The shadow model consumer reads from this queue and processes requests at its own pace. Because shadow processing is decoupled from production request latency, the shadow model can be run on lower-priority compute — scheduled spot instances, off-peak batch processing — without affecting production SLOs.
4.2 Shadow Response Storage
Every shadow inference produces a response pair: the shadow model's response and the corresponding production model's response (retrieved from the production response log by matching a request correlation ID). These pairs are stored in a shadow comparison store — a document or columnar database optimised for the comparison analysis pipeline. The store retains: request ID, timestamp, request payload hash (not cleartext for privacy), production response, shadow response, and all computed quality metrics for both responses.
Retention policy for shadow response pairs: 90 days online, then purged (unless subject to regulatory retention). The shadow store must be scoped as a non-production system: real user input data subject to privacy regulations must be anonymised or pseudonymised before storage.
4.3 Comparison Analysis Pipeline
A daily analysis pipeline processes accumulated shadow/production pairs and produces a comparison report. The pipeline computes: (1) quality metrics for both models — accuracy, BLEU/ROUGE/BERTScore for generation tasks, calibration for classification; (2) latency distribution (p50, p95, p99) for shadow vs production; (3) error rate comparison; (4) cost per inference comparison; (5) safety check results (does shadow model generate any content that production model would not?); (6) disagreement rate — the proportion of requests where the two models produce materially different outputs. The report is published to the model governance dashboard and stored in the Model Register against the shadow version.
4.4 Handling Stateful Operations in Shadow
Shadow models must operate in read-only mode. They must not write to any production database, send notifications, invoke external APIs, or modify any shared state. Shadow inference is computation-only. For models that normally invoke tools or external systems, the shadow request processor must use a stubbed tool layer that records the intended tool calls without executing them. This is enforced by infrastructure — the shadow model's service account has no write permissions on production systems.
4.5 Shadow Duration Guidelines
Shadow duration is determined by model risk tier: Low-risk internal models require a minimum of 1 week with at least 10,000 shadow requests. Medium-risk customer-facing models require a minimum of 2 weeks with at least 50,000 shadow requests. High-risk regulated models (credit, medical, fraud) require a minimum of 4 weeks with at least 100,000 shadow requests and explicit sign-off from the risk function. These are minimums — shadow should continue until promotion criteria are met, regardless of calendar time.
4.6 Promotion Criteria
Promotion from shadow to production (via canary release per EAAPL-MDL003) requires all of the following: (1) shadow quality score meets or exceeds production by the margin defined at version registration; (2) shadow p99 latency within 20% of production p99; (3) shadow error rate does not exceed production error rate; (4) shadow safety check passes (zero content safety violations); (5) minimum shadow duration met; (6) comparison report reviewed and approved by model owner and, for high-risk models, AI Governance.
5. Architecture Diagram
flowchart TD
subgraph Traffic["Traffic Layer"]
A[User Request]
B[Load Balancer]
end
subgraph Models["Model Serving"]
C[Production Model]
D[Shadow Inference Queue]
E[Shadow Model]
end
subgraph Analysis["Comparison and Governance"]
F[(Shadow Response Store)]
G[Comparison Pipeline]
H{Promotion Decision}
end
A --> B
B -->|sync primary| C
B -->|async mirror| D
D --> E
C --> F
E --> F
F --> G
G --> H
H -->|criteria met| I[Canary Release]
H -->|criteria not met| J[Extend Shadow Period]
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#f0fdf4,stroke:#22c55e
style C fill:#d1fae5,stroke:#10b981
style D fill:#fef9c3,stroke:#eab308
style E fill:#dbeafe,stroke:#3b82f6
style F fill:#fef9c3,stroke:#eab308
style G fill:#f0fdf4,stroke:#22c55e
style H fill:#f3e8ff,stroke:#a855f7
style I fill:#d1fae5,stroke:#10b981
style J fill:#fee2e2,stroke:#ef4444
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Request Duplicator |
Infrastructure |
Asynchronously mirrors production requests to shadow queue; zero production latency impact |
Envoy mirror filter, AWS ALB mirroring, Nginx mirror, Istio |
Critical |
| Shadow Inference Queue |
Messaging |
Decouples shadow processing from production path; buffers during shadow compute spikes |
AWS SQS, Azure Service Bus, GCP Pub/Sub, Kafka |
High |
| Shadow Model Serving |
Inference |
Runs shadow model version against mirrored requests |
Same inference infrastructure as production; lower-priority compute |
High |
| Stub Tool Layer |
Safety Guard |
Intercepts tool calls from shadow model; records intent without executing |
Custom middleware; feature flag that disables external calls |
Critical |
| Shadow Response Store |
Data Store |
Stores request/response pairs for comparison analysis |
DynamoDB, BigQuery, Snowflake, PostgreSQL |
High |
| Comparison Analysis Pipeline |
Batch Compute |
Runs daily statistical comparison; produces comparison report |
Apache Spark, AWS Glue, dbt + SQL, custom Python pipeline |
High |
| Model Governance Dashboard |
Observability |
Presents comparison results; supports promotion decision workflow |
Grafana, custom React dashboard, Looker |
Medium |
7. Data Flow
7.1 Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
User |
Sends inference request |
Request received at load balancer |
| 2 |
Load Balancer |
Routes request to production model; asynchronously mirrors to shadow queue |
Production request dispatched; shadow message enqueued |
| 3 |
Production Model |
Processes request; returns response |
Production response served to user; logged with request ID |
| 4 |
Shadow Queue Consumer |
Reads mirrored request; invokes shadow model |
Shadow inference job initiated |
| 5 |
Shadow Model |
Processes mirrored request via stub tool layer |
Shadow response computed; tool calls recorded not executed |
| 6 |
Shadow Response Writer |
Writes shadow response + matching production response to shadow store |
Response pair persisted with correlation ID |
| 7 |
Comparison Pipeline |
Daily run: reads all new pairs; computes metrics; generates report |
Comparison report published to governance dashboard |
| 8 |
Model Owner / Governance |
Reviews report against promotion criteria |
Promotion approved or shadow period extended |
7.2 Error Flow
| Error Scenario |
Detection |
Recovery Action |
| Shadow model inference failure |
Error rate monitor on shadow consumer |
Log error; skip pair; alert on sustained failure rate > 5% |
| Shadow queue backpressure |
Queue depth monitor |
Scale shadow consumer; shed shadow load (production unaffected) |
| Stub tool layer bypass (shadow writes) |
Audit log alert on unexpected write attempt |
Halt shadow processing; security investigation; version quarantined |
| Comparison pipeline failure |
Pipeline health monitor |
Retry pipeline; alert after 2 consecutive daily failures |
| Response pair storage at capacity |
Storage utilisation alert |
Age out pairs beyond retention window; scale storage |
8. Security Considerations
8.1 Controls Summary
| Domain |
Control |
| Authentication |
Shadow model service account isolated from production service account; no shared credentials |
| Authorisation |
Shadow model service account has read-only access to inference inputs; no write access to any production system |
| Secrets |
Shadow model uses same secrets manager as production; keys scoped per model version |
| Classification |
Shadow response store classified at same level as production data; user request payloads anonymised before storage |
| Encryption |
Shadow store encrypted at rest (AES-256) and in transit (TLS 1.3) |
| Auditability |
All shadow inference attempts logged; any tool call attempt (stub or real) logged to audit trail |
8.2 OWASP LLM Top 10 Relevance
| OWASP LLM Risk |
Relevance |
Mitigation |
| LLM01 Prompt Injection |
High |
Shadow model processes real production inputs including potentially adversarial content; must run in isolated sandbox |
| LLM02 Insecure Output Handling |
Medium |
Shadow responses are stored not served, but must still be sanitised before display in comparison dashboard |
| LLM03 Training Data Poisoning |
Low |
Shadow model is a pre-trained/fine-tuned candidate; poisoning risk addressed in training pipeline (EAAPL-MDL006) |
| LLM04 Model Denial of Service |
Medium |
Shadow queue acts as a buffer; but sustained high volume can exhaust shadow compute budget |
| LLM05 Supply Chain Vulnerabilities |
Medium |
Shadow model shares supply chain with production; validated by same provenance check |
| LLM06 Sensitive Information Disclosure |
High |
Request payloads contain real user data; pseudonymisation before shadow store is mandatory |
| LLM07 Insecure Plugin Design |
High |
Stub tool layer is the primary control; any bypass allows shadow model to take real-world action |
| LLM08 Excessive Agency |
High |
Stub tool layer prevents shadow from executing any action; this is the central security control |
| LLM09 Overreliance |
Low |
Shadow is internal validation tooling; overreliance not applicable |
| LLM10 Model Theft |
Medium |
Shadow response store contains model outputs at scale; store access controls prevent inference reversal |
9. Governance Considerations
9.1 Responsible AI
Shadow testing must include fairness analysis in the comparison report: do the shadow model's outputs diverge from production in ways that are disproportionate across demographic subgroups? Any fairness regression detected in shadow is a blocking criterion for promotion, regardless of overall quality metrics.
9.2 Model Risk Management
Shadow deployment is the pre-production validation stage of the MRM lifecycle. The comparison report constitutes evidence for model validation. For APRA-regulated entities, the comparison report is part of the model governance record and must be retained.
9.3 Human Approval Gates
Promotion from shadow to canary is a human decision. The comparison report informs the decision but does not automate it. The model owner must explicitly approve promotion. For high-risk models, the AI Governance function countersigns. Automated promotion without human review is not permitted.
9.4 Governance Artefacts
| Artefact |
Owner |
Frequency |
Location |
| Shadow Comparison Report |
Model Owner |
Daily during shadow |
Model Register + governance dashboard |
| Shadow Period Summary |
AI Governance |
At promotion decision |
Model governance record |
| Stub Tool Layer Audit Log |
Security Operations |
Continuous |
SIEM |
| Privacy Impact Assessment |
Privacy Officer |
Per shadow deployment |
Privacy register |
10. Operational Considerations
10.1 SLOs
| SLO |
Target |
Measurement Method |
| Shadow queue lag behind production |
< 60 seconds |
Queue consumer lag metric |
| Shadow processing error rate |
< 1% |
Error counter on shadow consumer |
| Comparison report publication latency |
< 2 hours after midnight |
Pipeline completion timestamp |
| Response pair storage availability |
99.9% |
Storage health check |
10.2 Monitoring and Logging
Key metrics to monitor continuously during shadow period: shadow queue depth (alert if > 10,000 unprocessed), shadow consumer error rate (alert if > 1%), shadow model latency p99 (informational — not blocking production), daily comparison report publication (alert if missing), stub tool layer bypass attempts (alert immediately — P1).
10.3 Incident Response
Two incident classes specific to shadow deployment: (1) Shadow production interference — if any shadow operation writes to or calls a production system, halt shadow immediately; security investigation; version quarantined until investigation complete. (2) Shadow queue saturation impacting production — theoretically impossible if mirroring is purely async; if observed, circuit-breaker drops shadow traffic; P1 incident.
10.4 Disaster Recovery
| Scenario |
RPO |
RTO |
Recovery Procedure |
| Shadow store data loss |
24h |
4 hours |
Restart shadow period; production unaffected |
| Shadow consumer failure |
N/A |
1 hour |
Restart consumer; process queued messages; production unaffected |
| Comparison pipeline failure |
N/A |
2 hours |
Retry pipeline run; extend shadow period if report missing |
10.5 Capacity Planning
Shadow infrastructure processes the same volume as production but asynchronously. Size shadow compute at 30–50% of production inference capacity (queue provides elasticity). Shadow response store grows at: (average response size) × (daily request volume) × (retention days). For a service with 100,000 requests/day at 2KB average response size and 90-day retention: ~18 GB. Plan at 5× for safety margins and comparison metadata.
11. Cost Considerations
11.1 Cost Drivers
| Driver |
Description |
Relative Impact |
| Shadow inference compute |
Running shadow model at production traffic volume |
High |
| Shadow response storage |
Storing 90 days of response pairs at production volume |
Medium |
| Comparison pipeline compute |
Daily batch analysis of accumulated pairs |
Low |
| Queue infrastructure |
Managed queue service at production message volume |
Low |
| Engineering time |
Setting up and maintaining shadow infrastructure per model |
High |
11.2 Scaling Risks
Shadow inference compute scales linearly with production traffic. A traffic spike doubles shadow compute cost. Mitigation: shadow consumer operates with a configurable maximum throughput; excess shadow requests are shed (shadow completeness reduces, but production is unaffected). Monitor shadow completeness: if < 80%, extend shadow duration.
11.3 Optimisations
- Use spot/preemptible instances for shadow inference (shadow is delay-tolerant).
- Process shadow requests in micro-batches for GPU efficiency (batch size 8–32 depending on model).
- Use columnar compression on shadow response store (response text compresses 5–10×).
- Skip shadow for PATCH version changes; run only comparison analysis on a sampled offline subset.
11.4 Indicative Cost Range
| Traffic Volume |
Monthly Shadow Cost (Inference Only) |
Assumptions |
| Low (< 100K req/day) |
$500–$2,000 |
Spot GPU instances; 4-week shadow; small LLM |
| Medium (100K–1M req/day) |
$2,000–$15,000 |
Managed GPU cluster; spot pricing; auto-scaling |
| High (> 1M req/day) |
$15,000–$80,000 |
Dedicated GPU fleet; storage at scale |
12. Trade-Off Analysis
12.1 Shadow vs Alternative Validation Approaches
| Approach |
Quality Signal |
Production Impact |
Cost |
Regulatory Evidence |
Best For |
| Shadow deployment (this pattern) |
High — real traffic |
None |
High |
Strong |
High-risk, regulated, customer-facing models |
| Canary release (EAAPL-MDL003) |
High — real traffic + outcomes |
User-visible risk |
Medium |
Strong |
Medium-risk models with low rollback cost |
| Offline A/B on held-out set |
Medium — static dataset |
None |
Low |
Moderate |
Research validation; pre-shadow gate |
| Manual QA on sampled requests |
Low — human review |
None |
Medium |
Weak |
Small models, low volume, low risk |
12.2 Architectural Tensions
| Tension |
Description |
Resolution |
| Privacy vs Signal Quality |
Using real user data maximises signal; but storage of real user inputs raises privacy risk |
Pseudonymise at capture; store only input hash + model outputs; purge promptly after comparison |
| Shadow Completeness vs Cost |
Full shadow coverage is ideal; cost may require sampling |
Stratified sampling: ensure all input types represented; priority to long-tail and edge cases |
| Read-Only Constraint vs Realism |
Shadow model cannot replicate stateful model behaviours (e.g., personalisation that writes state) |
Shadow tests stateless inference quality only; stateful behaviour validated separately via integration tests |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| Shadow model writes to production system |
Very Low |
Critical |
Audit log alert on unexpected write |
Halt shadow; quarantine version; security investigation |
| Comparison report produces false positive |
Medium |
High |
Manual review catches inconsistency |
Re-run pipeline with corrected metrics; extend shadow period |
| Shadow queue memory leak causes host OOM |
Low |
Medium |
Container memory alert |
Restart consumer; process queued messages from checkpoint |
| Privacy breach: real PII in shadow store |
Low |
Critical |
Data classification scan alert |
Halt shadow; purge affected data; notify privacy officer |
| Stub bypass allows shadow notification to user |
Very Low |
High |
User complaint; audit log |
Halt shadow; user apology; investigate stub implementation |
13.1 Cascading Failure Scenarios
If the shadow queue grows unbounded (consumer failure during high-traffic period), the queue infrastructure may exhaust storage. If queue infrastructure is shared with production messaging systems, this can cascade into production messaging failures. Mitigation: shadow queue is isolated from all production messaging infrastructure; maximum queue depth is bounded; when maximum is reached, new shadow messages are dropped (production unaffected).
14. Regulatory Considerations
| Regulation / Framework |
Relevant Clause |
How This Pattern Addresses It |
| EU AI Act (2024/1689) |
Article 9 (Risk Management System) — pre-deployment testing requirement for high-risk AI |
Shadow deployment constitutes mandatory pre-deployment validation; comparison report is evidence |
| EU AI Act (2024/1689) |
Article 10 (Data Governance) — training and validation data quality |
Shadow uses real production distribution to validate beyond training data |
| ISO 42001:2023 |
Clause 8.4 (AI system lifecycle — verification and validation) |
Shadow comparison report constitutes validation evidence per Clause 8.4 |
| NIST AI RMF (2023) |
MANAGE 2.2 (Mechanisms for test, evaluation, validation, verification) |
Shadow is the primary TEVV mechanism for model upgrades |
| APRA CPS 230 (2025) |
Paragraph 52 (Change management — testing) |
Shadow constitutes pre-change testing; comparison report is test evidence |
| Privacy Act 1988 (Cth) |
APP 3 (Collection of solicited personal information) / APP 11 (Security) |
Shadow captures real user data — must have privacy notice scope and be secured at classification level of production data |
15. Reference Implementations
15.1 AWS
- Traffic Mirroring: AWS Application Load Balancer traffic mirroring; or Envoy proxy deployed on ECS/EKS with mirror filter.
- Shadow Queue: Amazon SQS FIFO queue with shadow inference Lambda consumer.
- Shadow Inference: SageMaker Endpoint (separate endpoint per shadow version); spot instance backed.
- Shadow Store: Amazon DynamoDB (response pairs); S3 for bulk comparison data.
- Comparison Pipeline: AWS Glue job (daily); results to S3 + QuickSight dashboard.
15.2 Azure
- Traffic Mirroring: Azure API Management with request duplication policy; or Azure Service Mesh (Istio on AKS).
- Shadow Queue: Azure Service Bus Premium (isolation from production).
- Shadow Inference: Azure Machine Learning managed endpoint (shadow version); spot-backed compute cluster.
- Shadow Store: Azure Cosmos DB (response pairs); Azure Blob for comparison data.
- Comparison Pipeline: Azure Synapse Analytics (daily pipeline); Power BI dashboard.
15.3 GCP
- Traffic Mirroring: Cloud Load Balancing with request mirroring; or Istio on GKE.
- Shadow Queue: Cloud Pub/Sub (dedicated topic for shadow).
- Shadow Inference: Vertex AI Endpoint (shadow version); preemptible GPU nodes.
- Shadow Store: BigQuery (response pairs, columnar for analysis efficiency).
- Comparison Pipeline: BigQuery ML + Dataflow; Looker dashboard.
15.4 On-Premises / Hybrid
- Traffic Mirroring: Nginx mirroring directive; Envoy proxy sidecar in Kubernetes.
- Shadow Queue: Apache Kafka (dedicated topic, separate consumer group).
- Shadow Inference: Kubernetes Job on dedicated GPU node pool (lower-priority node affinity).
- Shadow Store: PostgreSQL + TimescaleDB; columnar compression for response pairs.
- Comparison Pipeline: Apache Spark on-cluster; Grafana dashboard.
| Pattern ID |
Pattern Name |
Relationship Type |
Description |
| EAAPL-MDL001 |
Model Versioning |
Prerequisite |
Shadow deployment operates on specific versioned model artefacts |
| EAAPL-MDL003 |
Canary Model Release |
Next Step |
Successful shadow completion gates entry to canary release |
| EAAPL-MDL004 |
Model Rollback |
Sibling |
If shadow reveals production regression, rollback pattern is applied to current prod version |
| EAAPL-MDL008 |
Model Access Governance |
Dependency |
Shadow model access is governed by same access tiers as production |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension |
Score (1–5) |
Rationale |
| Industry Adoption |
4 |
Shadow/dark launch is established in software; LLM-specific shadow is newer |
| Tooling Availability |
3 |
Traffic mirroring is mature; LLM shadow comparison pipelines require custom build |
| Standards Alignment |
4 |
Directly supports EU AI Act Article 9 and ISO 42001 Clause 8.4 |
| Implementation Complexity |
4 (high) |
Requires async infrastructure, privacy controls, and statistical analysis pipeline |
| Regulatory Acceptance |
4 |
Shadow evidence is accepted as pre-deployment validation by EU AI Act supervisors |
18. Revision History
| Version |
Date |
Author |
Summary of Changes |
| 1.0 |
2026-06-12 |
Enterprise AI Architecture Practice |
Initial publication |