EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryModel Management
Proven
⇄ Compare

EAAPL-MDL004 — Model Rollback

🧮 Model Management🏭 Field-tested in AU

EAAPL-MDL004 — Model Rollback

Attribute Value
Pattern ID EAAPL-MDL004
Name Model Rollback
Maturity Proven
Complexity Medium
Tags model-risk disaster-recovery high-availability medium-complexity
Last Reviewed 2026-06-12
Owner Enterprise AI Architecture Practice

1. Executive Summary

Model rollback is the capability to revert production AI model serving to a previously approved, known-good model version within a defined time target — typically less than five minutes from decision to full traffic on the previous version. It is the safety net for every model deployment pattern: canary release (EAAPL-MDL003) relies on automated rollback for its blast-radius control; shadow deployment (EAAPL-MDL002) is only justified by the assurance that a bad promotion can be quickly reversed. Without a tested, rehearsed rollback capability, model upgrades carry existential risk — a defective model serving 100% of production traffic cannot be recovered without operational chaos. For CIOs, rollback is a non-negotiable resilience capability that should appear in Business Continuity Plans for AI-dependent services. For CTOs, it is the technical prerequisite that makes model iteration velocity safe — teams can deploy confidently when they know recovery is fast and rehearsed. For risk officers, tested rollback capability directly satisfies APRA CPS 230 Paragraph 42 (incident management) and EU AI Act Article 9 risk management requirements. The pattern encompasses not just the traffic shift mechanism but state management during rollback, consumer notification, and the mandatory post-rollback investigation process.


2. Problem Statement

2.1 Business Problem

When a new model version causes a production incident — quality regression, increased error rate, safety violation, or regulatory non-compliance — every second of delay in restoration has customer impact. Organisations without a defined rollback procedure improvise under pressure, making mistakes that extend the incident. Executive teams cannot provide accurate time-to-resolution estimates because no target exists.

2.2 Technical Problem

Model serving infrastructure is stateful: there may be in-flight requests processing under the new model, cached responses that need invalidation, session state tied to the new model version, database records created by the new model, and downstream consumers that have cached the new model's endpoint or schema. A naive traffic switch reverts serving but does not address any of these complications, leaving the system in an inconsistent state.

2.3 Symptoms

  • The organisation has no defined rollback procedure for model upgrades — it would be "figured out if needed."
  • The last model rollback took 45+ minutes and required escalation to senior engineers.
  • Rollback procedures exist on paper but have never been tested in a non-emergency context.
  • State management during rollback is unclear — in-flight requests during the switch are silently dropped or completed with mixed model versions.

2.4 Cost of Inaction

Category Indicative Impact
Availability Extended incident duration (30–120 min improvised vs < 5 min rehearsed rollback)
Customer Impact Every additional minute of degraded AI service has measurable NPS and churn impact
Regulatory Inability to respond to a model safety violation within a defined window is a regulatory breach under EU AI Act Article 9
Reputational Protracted visible incidents generate media attention; fast recovery is invisible

3. Context

3.1 When to Apply

  • Any production deployment of an AI model that has business, customer, or regulatory significance.
  • As a companion to canary release (EAAPL-MDL003) — rollback is the automatic response to canary threshold breach.
  • Organisations that have defined RTO targets for AI model serving.
  • Regulated environments where the organisation must demonstrate the capability to rapidly cease or revert AI operation.

3.2 When NOT to Apply

  • Models embedded in batch pipelines where "rollback" means reprocessing a batch (handled separately as a data pipeline concern).
  • Models deployed in embedded/edge devices where over-the-air updates are the recovery mechanism (different latency class).
  • Training pipelines — rollback in this context refers to reverting to a previous training run, not serving.

3.3 Prerequisites

Prerequisite Detail
Model versioning (EAAPL-MDL001) Previous version artefact must be registered and retrievable
Last-known-good version designation Model Register must designate the current known-good version at all times
Traffic routing infrastructure Capable of instant weight changes (< 60 seconds for routing updates)
Rollback automation Automated system that can execute rollback without manual infrastructure steps
Rollback runbook (tested) Written and rehearsed procedure; last test < 90 days ago

3.4 Industry Applicability

Industry Applicability Primary Driver
Financial Services Critical Regulatory obligation to cease defective algorithmic operations
Healthcare Critical Patient safety; immediate cessation of harmful AI recommendation
Government Critical Accountability for citizen-facing AI; auditability
Technology Platforms High SLA obligations to enterprise API consumers
E-commerce High Revenue protection during model incidents
Media / Content High Content moderation continuity during model change incidents

4. Architecture Overview

4.1 Rollback Trigger Conditions

Rollback is triggered by two paths: automatic and manual.

Automatic triggers are metric-driven and fire without human intervention: (a) Error rate exceeds the rollback threshold defined at version registration (typically production error rate + 0.5 pp) for two consecutive 5-minute evaluation windows. (b) Latency p99 exceeds rollback threshold (typically production p99 + 50%) for two consecutive windows. (c) Any content safety violation is detected in production model outputs. (d) Any security alert indicating the model is producing outputs consistent with a prompt injection attack.

Manual triggers are initiated by an authorised human: model owner, platform on-call, product team lead, or AI Governance. Manual triggers are appropriate for: product team feedback indicating unexpected quality regression not captured by automated metrics; regulatory concern raised by compliance team; upstream model vendor safety advisory; any other situation where a human has direct evidence of a model problem not yet reflected in metrics.

All trigger events — automatic and manual — are logged to the immutable audit trail with: trigger type, triggering metric or human identity, current model version, target rollback version, and timestamp.

4.2 Rollback Execution Procedure

Step 1 — Traffic shift initiation (< 1 minute): The rollback automation retrieves the designated last-known-good version from the Model Register. It updates the traffic router to direct 100% of traffic to the previous version. The current (defective) version receives 0% traffic. This step is fully automated and requires no manual infrastructure action.

Step 2 — In-flight request drain (< 2 minutes): In-flight requests that reached the defective model before the traffic shift are allowed to complete within a configurable drain window (default 30 seconds). After drain timeout, remaining in-flight connections are closed with a retriable error response. Consumers should implement retry with idempotency keys (see EAAPL-INF policy on idempotency).

Step 3 — Serving verification (< 2 minutes): The rollback automation queries the health endpoint of the previous version model and verifies a sample inference returns a valid response. Only after verification does it emit a "rollback complete" event.

Step 4 — Notification (immediate, parallel with verification): On rollback initiation, automated notifications are sent to: model owner, platform on-call, AI Governance (for high-risk models), registered downstream consumers. Notification includes: version rolled back from, version rolled back to, trigger type, and incident reference number.

Total target: < 5 minutes from trigger to 100% of traffic on previous version with verification.

4.3 State Management During Rollback

State management is the most complex aspect of model rollback. Four categories of state require handling:

Cached responses: If a response cache exists downstream, entries produced by the defective version must be invalidated. The rollback automation publishes a cache invalidation event keyed by the version ID. Downstream caches that honour this event purge defective-version entries.

Session state: Users whose sessions were assigned to the defective version (per canary sticky sessions) must be migrated to the previous version. Session assignments in the distributed cache are updated by the rollback automation — the migration is seamless to the user.

Database records created by the new version: This is the hardest category. If the new model version wrote records to a production database with a format or schema unique to the new version, rollback to the previous serving version does not automatically rollback the database records. The model deployment process must document what database writes the new version performed, and the rollback runbook must include the appropriate data remediation step (which may range from "no action required" to a targeted data migration). For models with significant database impact, the deployment decision must account for rollback data complexity.

Downstream consumer caches: Consumers who have cached the new model's responses or endpoints receive rollback notification and must implement their own invalidation. The rollback notification event includes sufficient context for consumers to identify potentially stale data.

4.4 Post-Rollback Investigation

A rollback event mandates a root cause analysis (RCA). The RCA must be initiated within 4 hours of rollback completion and delivered within 5 business days. The RCA documents: what went wrong (technical root cause), why it was not caught by pre-deployment testing (shadow/canary gap analysis), what the customer impact was (users affected × duration × severity), and what changes will prevent recurrence (process, tooling, or test coverage). The RCA is stored in the incident management system and reviewed by AI Governance. The defective model version is flagged in the Model Register as "rollback-required" — it cannot be re-deployed without addressing the RCA findings and producing a new approved version.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Trigger["Rollback Triggers"] A[Canary Monitor Alert] B[Manual Human Trigger] end subgraph Execution["Rollback Execution"] C[Rollback Automation] D[(Model Register)] E[Traffic Router] end subgraph Recovery["State Recovery"] F[Cache Invalidation] G[Previous Model] H[Notification Service] end A --> C B --> C C --> D D -->|last-known-good version| C C -->|0% current, 100% prev| E E --> G C --> F G -->|health verified| H H --> I[Incident + RCA] style A fill:#fee2e2,stroke:#ef4444 style B fill:#fee2e2,stroke:#ef4444 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#dbeafe,stroke:#3b82f6 style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Rollback Automation Automation Executes rollback procedure: traffic shift, drain, verification, notification Custom Lambda/Cloud Function, Argo Rollouts, custom Kubernetes operator Critical
Traffic Router Infrastructure Shifts traffic weights between model versions; must update in < 60 seconds AWS ALB, Istio, NGINX, Envoy Critical
Model Register Platform Service Provides last-known-good version reference; records rollback events MLflow, custom registry, Vertex AI Critical
Response Cache Data Store Caches model responses; must support version-keyed invalidation events Redis, Memcached, Varnish, CloudFront High
Notification Service Integration Sends rollback notifications to owners, governance, consumers PagerDuty, OpsGenie, Slack, email High
Drain Manager Infrastructure Manages in-flight request completion during traffic shift Load balancer connection draining, custom graceful shutdown High
Incident Management System Governance Records rollback event; tracks RCA; stores findings PagerDuty, ServiceNow, Jira Medium

7. Data Flow

7.1 Primary Flow

Step Actor Action Output
1 Monitor / Human Initiates rollback (automatic threshold breach or manual trigger) Rollback trigger event with trigger type, model version
2 Rollback Automation Queries Model Register for last-known-good version Previous version artefact reference and health endpoint
3 Rollback Automation Updates traffic router: 100% to previous version, 0% to current Routing configuration updated; current version in drain
4 Drain Manager Allows in-flight requests to complete (30-second window) All in-flight requests resolved; current version idle
5 Rollback Automation Publishes cache invalidation event for current version Downstream caches begin purging current-version entries
6 Rollback Automation Verifies previous version health and sample inference Health confirmed; rollback-complete event emitted
7 Notification Service Sends rollback notification to all registered parties Notifications delivered; incident created
8 Model Register Records rollback event: from version, to version, trigger, timestamp Immutable audit log entry
9 Model Owner Initiates RCA within 4 hours; delivers within 5 business days RCA document; Model Register updated with rollback-required flag

7.2 Error Flow

Error Scenario Detection Recovery Action
Previous version health check fails Health probe returns unhealthy Attempt one prior version; alert P1; manual investigation
Traffic router does not update in < 60 seconds Timeout on routing API call Retry 3×; escalate to infrastructure P1; manual routing override
Cache invalidation not honoured Stale cache serves defective responses Force flush via admin API; extend incident duration estimate
Rollback automation itself fails Automation health monitor Manual execution of rollback runbook; automation fix as P1 follow-up
Drain timeout exceeded with stuck requests Drain timer alarm Force-close remaining connections; accept retriable error for affected users

8. Security Considerations

8.1 Controls Summary

Domain Control
Authentication Rollback automation service account with narrow scope: traffic router write, cache invalidation write; no model data access
Authorisation Manual rollback requires authentication of requester; RBAC limits to model owner, platform on-call, AI Governance
Secrets Rollback automation uses short-lived OIDC tokens; no persistent credentials
Classification Rollback audit logs contain model version IDs and trigger information — INTERNAL classification
Encryption All API calls during rollback execution use TLS 1.3; audit log encrypted at rest
Auditability Every rollback step logged with operator identity (or "automation"), timestamp, and outcome

8.2 OWASP LLM Top 10 Relevance

OWASP LLM Risk Relevance Mitigation
LLM01 Prompt Injection Medium If a prompt injection attack triggers a safety violation that causes a rollback, the rollback is the correct response — the pattern supports this
LLM02 Insecure Output Handling Low Rollback is a serving control, not an output processing control
LLM03 Training Data Poisoning Low Rollback addresses deployment-time failures; poisoning is a training-time concern
LLM04 Model Denial of Service Medium A DoS attack that triggers latency rollback threshold is a valid rollback use case; pattern supports this
LLM05 Supply Chain Vulnerabilities Medium If a supply chain compromise is detected in a deployed model, rollback is the immediate response
LLM06 Sensitive Information Disclosure Low Rollback does not address already-disclosed information; notification to affected parties is a separate incident response step
LLM07 Insecure Plugin Design Low Rollback is a serving control
LLM08 Excessive Agency Medium The rollback automation itself must not have excessive agency — it executes a defined, bounded procedure only
LLM09 Overreliance Low Rollback is a technical control, not a behavioural pattern
LLM10 Model Theft Low Rollback does not address model theft; artefact access controls are the relevant control

9. Governance Considerations

9.1 Responsible AI

A rollback event is evidence that the model risk management process identified and responded to a defect. The RCA must address whether the defect had differential impact on any demographic subgroup and whether any users were harmed by the defective model before rollback. If harm occurred, the incident escalation process includes user notification per the Privacy Act and any applicable financial services regulation.

9.2 Model Risk Management

Rollback events are material MRM events. The defective version is flagged in the Model Register. Three rollback events for any model within 12 months triggers a model risk review: the model's validation process, deployment process, and monitoring coverage are all examined.

9.3 Human Approval Gates

Automatic rollback does not require human approval — speed is the priority. Re-enabling a rolled-back version (re-deploying or re-initiating canary) always requires human approval with RCA evidence.

9.4 Governance Artefacts

Artefact Owner Frequency Location
Rollback Audit Log Entry Rollback Automation Per rollback Immutable audit log
Rollback Notification Record Notification Service Per rollback Incident management system
Root Cause Analysis Model Owner Per rollback Incident management system + Model Register
Model Risk Register Update AI Governance Per rollback Risk management system

10. Operational Considerations

10.1 SLOs

SLO Target Measurement Method
Traffic shift from trigger to previous version < 2 minutes Rollback automation timing
Full rollback completion (with verification) < 5 minutes Rollback automation timing end-to-end
Rollback notification delivery < 3 minutes Notification service delivery receipt
RCA initiation < 4 hours post-rollback Incident management system timestamp
Rollback runbook test frequency ≥ quarterly Rollback drill calendar record

10.2 Monitoring and Logging

Rollback capability itself must be monitored: the automation service must have a health monitor; the ability to reach the traffic router API must be tested hourly (synthetic probe); the Model Register must confirm a last-known-good version is designated at all times (alerting if none designated). If any of these fail, a P1 incident is raised: the rollback capability is compromised.

10.3 Incident Response

A model rollback event is itself a P2 incident by default (degradation detected and contained). It escalates to P1 if: the rollback takes longer than the 5-minute RTO target, the previous version is also degraded (cascading failure), or a content safety violation was the trigger (potential regulatory notification obligation). Incident management system creates an incident record automatically on rollback trigger.

10.4 Disaster Recovery

Scenario RPO RTO Recovery Procedure
No previous version available to roll back to N/A Manual Emergency: serve static fallback or disable AI feature; escalate
Rollback automation unavailable N/A 15 minutes Execute manual rollback runbook; restore automation post-incident
Both current and previous versions degraded N/A Manual Disable AI-dependent feature; serve non-AI fallback; P1 escalation

10.5 Capacity Planning

Rollback is rare but must succeed at any scale. Pre-warm the previous version endpoint before any new version deployment: keep the previous version serving at minimum capacity (1–2 instances) so rollback does not require cold start. This adds a small ongoing cost ($100–$500/month for most model sizes) but eliminates cold-start delay from the rollback path.


11. Cost Considerations

11.1 Cost Drivers

Driver Description Relative Impact
Previous version standby compute Keeping previous version endpoints warm at minimum capacity Medium
Rollback automation maintenance Engineering time to maintain and test rollback automation Medium
Rollback drill execution Quarterly drill requires engineer time and temporary dual-serving cost Low
RCA process Engineer time for root cause analysis and preventive action Medium

11.2 Scaling Risks

If rollback is triggered during a peak traffic event, the previous version must scale up rapidly. Auto-scaling of the previous version endpoint must be configured with aggressive scale-out policies. Pre-warming eliminates scale-out latency but adds baseline cost.

11.3 Optimisations

  • Use serverless inference (Lambda + container) for previous version standby — near-zero cost at idle, scales instantly.
  • Keep previous version artefact in hot storage in the same region as production — eliminates artefact retrieval latency.
  • Automate rollback drill as part of quarterly chaos engineering programme — no additional scheduling cost.

11.4 Indicative Cost Range

Organisation Scale Monthly Rollback Capability Cost Key Assumptions
Small (1–5 models) $100–$500 Previous version warm standby; serverless where possible
Medium (5–20 models) $500–$3,000 Previous versions on shared compute pool; auto-scaling
Large (20+ models) $3,000–$15,000 Dedicated previous-version compute fleet; active monitoring

12. Trade-Off Analysis

12.1 Rollback Depth Options

Option Speed Risk Coverage Cost Complexity Best For
One version back only (this pattern) Fastest High Low Low Most organisations; sufficient for >95% of incidents
Two versions back capability Fast Very High Medium Medium High-risk models; organisations with frequent rollbacks
Full version history replay Medium Complete High High Compliance-critical models; forensic investigation
Blue/green (parallel full deployment) Instant Complete Very High High Mission-critical services; zero-tolerance for rollback delay

12.2 Architectural Tensions

Tension Description Resolution
Speed vs State Consistency Fast rollback may leave state inconsistencies (DB records, cache entries from defective version) Accept temporary inconsistency; runbook documents post-rollback cleanup steps per model type
Automation vs Human Judgment Automatic rollback is fast but may be triggered by false positives Two consecutive windows must breach threshold before auto-rollback; single-window breach pages on-call
Cost vs Recovery Speed Keeping previous version warm reduces rollback RTO but adds baseline cost Tier previous version standby by model criticality; critical models always warm, others cold

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Previous version cold start exceeds RTO Medium High Rollback timing monitor Pre-warm policy applied retroactively; P1 for current incident
Rollback automation has permissions bug Low Critical Rollback drill failure Manual runbook execution; fix automation as P1 follow-up
Model Register unavailable at rollback time Low Critical Register health monitor Rollback automation uses cached last-known-good version reference
Previous version also has undetected defect Very Low Critical Post-rollback metric monitoring Disable AI feature; serve fallback; emergency escalation
In-flight requests produce mixed outputs Medium Low User-visible inconsistency complaints Acceptable: document in incident report; users can retry

13.1 Cascading Failure Scenarios

If a rollback is triggered while a canary is in progress at 50%, the rollback automation reverts to the last-known-good version (which may be the pre-canary version, not the baseline canary started from). This requires the Model Register to clearly designate the last-fully-promoted version as last-known-good, not the canary baseline. Mitigation: last-known-good version is only updated on full 100% promotion completion — not at any intermediate canary stage.


14. Regulatory Considerations

Regulation / Framework Relevant Clause How This Pattern Addresses It
EU AI Act (2024/1689) Article 9 (Risk Management System) — corrective actions for high-risk AI Rollback is the primary corrective action capability; RTO target demonstrates readiness
EU AI Act (2024/1689) Article 61 (Post-market monitoring) — obligation to address serious incidents Rollback is the immediate response; RCA is the post-incident obligation
ISO 42001:2023 Clause 10.2 (Nonconformity and corrective action) Rollback event triggers mandatory corrective action per Clause 10.2
NIST AI RMF (2023) MANAGE 4.1 (Incident response for AI systems) Rollback procedure is the AI incident response mechanism
APRA CPS 230 (2025) Paragraph 42 (Incident management) / Paragraph 52 (Change management) Rollback is the incident management procedure for model changes; RTO target satisfies Para 42
Privacy Act 1988 (Cth) APP 11 — if defective model disclosed personal information, notification obligation may arise Rollback audit log enables post-rollback assessment of data exposure; supports notification decision

15. Reference Implementations

15.1 AWS

  • Traffic Shift: ALB weighted target groups; Lambda function adjusts weights via SDK call.
  • Rollback Automation: AWS Step Functions state machine; triggered by CloudWatch Alarm.
  • Cache Invalidation: ElastiCache Redis flush by version key prefix; CloudFront cache invalidation API.
  • Previous Version Standby: SageMaker Endpoint (1 instance minimum); Lambda function (serverless option).
  • Notification: SNS topic → PagerDuty integration + Slack Lambda.
  • Audit Log: CloudWatch Logs (immutable with log group retention policy); CloudTrail.

15.2 Azure

  • Traffic Shift: Azure Application Gateway backend pool weight update via ARM API.
  • Rollback Automation: Azure Logic Apps or Azure Functions; triggered by Azure Monitor alert.
  • Cache Invalidation: Azure Cache for Redis flush; Azure CDN purge API.
  • Previous Version Standby: Azure ML managed endpoint (1 instance minimum); Azure Container Apps.
  • Notification: Azure Monitor action group → PagerDuty + Teams webhook.
  • Audit Log: Azure Monitor Diagnostic Settings → immutable Log Analytics Workspace.

15.3 GCP

  • Traffic Shift: Cloud Load Balancing backend service weight update via Cloud SDK.
  • Rollback Automation: Cloud Workflows; triggered by Cloud Monitoring alerting policy.
  • Cache Invalidation: Cloud Memorystore Redis flush; Cloud CDN cache invalidation API.
  • Previous Version Standby: Vertex AI Endpoint (min replicas 1); Cloud Run (serverless).
  • Notification: Cloud Monitoring → PubSub → Cloud Function → PagerDuty + Slack.
  • Audit Log: Cloud Audit Logs → BigQuery export (immutable via dataset lock).

15.4 On-Premises / Hybrid

  • Traffic Shift: NGINX upstream weight update via NGINX Plus API; Envoy xDS routing update.
  • Rollback Automation: Argo Rollouts automated analysis + rollback; custom Kubernetes controller.
  • Cache Invalidation: Redis FLUSHDB on version namespace; Varnish ban by version tag.
  • Previous Version Standby: Kubernetes Deployment with minimum replica set.
  • Notification: Alertmanager → PagerDuty webhook + Slack integration.
  • Audit Log: Elasticsearch with write-once index; Kafka event sourcing for rollback events.

Pattern ID Pattern Name Relationship Type Description
EAAPL-MDL001 Model Versioning Prerequisite Rollback targets a specific previous version by version ID from the Model Register
EAAPL-MDL002 Shadow Model Deployment Sibling Shadow evidence reduces rollback probability; rollback is the recovery when shadow validation was insufficient
EAAPL-MDL003 Canary Model Release Sibling Canary automatic rollback invokes this pattern; canary is the primary prevention mechanism
EAAPL-MDL008 Model Access Governance Dependency Rollback automation service account is governed by model access governance

17. Maturity Assessment

Overall Maturity: Proven

Dimension Score (1–5) Rationale
Industry Adoption 4 Blue/green and rollback patterns are standard in software; model-specific rollback is mature
Tooling Availability 4 Argo Rollouts, ALB, Istio all support instant traffic shifts; automation is straightforward
Standards Alignment 5 Directly addresses APRA CPS 230 Paragraph 42, EU AI Act Article 61, ISO 42001 Clause 10.2
Implementation Complexity 3 (medium) Traffic shift is simple; state management during rollback is complex and organisation-specific
Regulatory Acceptance 5 Tested rollback with defined RTO is explicitly required by APRA and implied by EU AI Act

18. Revision History

Version Date Author Summary of Changes
1.0 2026-06-12 Enterprise AI Architecture Practice Initial publication
← Back to LibraryMore Model Management