EAAPL-MDL004 — Model Rollback
| Attribute | Value |
|---|---|
| Pattern ID | EAAPL-MDL004 |
| Name | Model Rollback |
| Maturity | Proven |
| Complexity | Medium |
| Tags | model-risk disaster-recovery high-availability medium-complexity |
| Last Reviewed | 2026-06-12 |
| Owner | Enterprise AI Architecture Practice |
1. Executive Summary
Model rollback is the capability to revert production AI model serving to a previously approved, known-good model version within a defined time target — typically less than five minutes from decision to full traffic on the previous version. It is the safety net for every model deployment pattern: canary release (EAAPL-MDL003) relies on automated rollback for its blast-radius control; shadow deployment (EAAPL-MDL002) is only justified by the assurance that a bad promotion can be quickly reversed. Without a tested, rehearsed rollback capability, model upgrades carry existential risk — a defective model serving 100% of production traffic cannot be recovered without operational chaos. For CIOs, rollback is a non-negotiable resilience capability that should appear in Business Continuity Plans for AI-dependent services. For CTOs, it is the technical prerequisite that makes model iteration velocity safe — teams can deploy confidently when they know recovery is fast and rehearsed. For risk officers, tested rollback capability directly satisfies APRA CPS 230 Paragraph 42 (incident management) and EU AI Act Article 9 risk management requirements. The pattern encompasses not just the traffic shift mechanism but state management during rollback, consumer notification, and the mandatory post-rollback investigation process.
2. Problem Statement
2.1 Business Problem
When a new model version causes a production incident — quality regression, increased error rate, safety violation, or regulatory non-compliance — every second of delay in restoration has customer impact. Organisations without a defined rollback procedure improvise under pressure, making mistakes that extend the incident. Executive teams cannot provide accurate time-to-resolution estimates because no target exists.
2.2 Technical Problem
Model serving infrastructure is stateful: there may be in-flight requests processing under the new model, cached responses that need invalidation, session state tied to the new model version, database records created by the new model, and downstream consumers that have cached the new model's endpoint or schema. A naive traffic switch reverts serving but does not address any of these complications, leaving the system in an inconsistent state.
2.3 Symptoms
- The organisation has no defined rollback procedure for model upgrades — it would be "figured out if needed."
- The last model rollback took 45+ minutes and required escalation to senior engineers.
- Rollback procedures exist on paper but have never been tested in a non-emergency context.
- State management during rollback is unclear — in-flight requests during the switch are silently dropped or completed with mixed model versions.
2.4 Cost of Inaction
| Category | Indicative Impact |
|---|---|
| Availability | Extended incident duration (30–120 min improvised vs < 5 min rehearsed rollback) |
| Customer Impact | Every additional minute of degraded AI service has measurable NPS and churn impact |
| Regulatory | Inability to respond to a model safety violation within a defined window is a regulatory breach under EU AI Act Article 9 |
| Reputational | Protracted visible incidents generate media attention; fast recovery is invisible |
3. Context
3.1 When to Apply
- Any production deployment of an AI model that has business, customer, or regulatory significance.
- As a companion to canary release (EAAPL-MDL003) — rollback is the automatic response to canary threshold breach.
- Organisations that have defined RTO targets for AI model serving.
- Regulated environments where the organisation must demonstrate the capability to rapidly cease or revert AI operation.
3.2 When NOT to Apply
- Models embedded in batch pipelines where "rollback" means reprocessing a batch (handled separately as a data pipeline concern).
- Models deployed in embedded/edge devices where over-the-air updates are the recovery mechanism (different latency class).
- Training pipelines — rollback in this context refers to reverting to a previous training run, not serving.
3.3 Prerequisites
| Prerequisite | Detail |
|---|---|
| Model versioning (EAAPL-MDL001) | Previous version artefact must be registered and retrievable |
| Last-known-good version designation | Model Register must designate the current known-good version at all times |
| Traffic routing infrastructure | Capable of instant weight changes (< 60 seconds for routing updates) |
| Rollback automation | Automated system that can execute rollback without manual infrastructure steps |
| Rollback runbook (tested) | Written and rehearsed procedure; last test < 90 days ago |
3.4 Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Financial Services | Critical | Regulatory obligation to cease defective algorithmic operations |
| Healthcare | Critical | Patient safety; immediate cessation of harmful AI recommendation |
| Government | Critical | Accountability for citizen-facing AI; auditability |
| Technology Platforms | High | SLA obligations to enterprise API consumers |
| E-commerce | High | Revenue protection during model incidents |
| Media / Content | High | Content moderation continuity during model change incidents |
4. Architecture Overview
4.1 Rollback Trigger Conditions
Rollback is triggered by two paths: automatic and manual.
Automatic triggers are metric-driven and fire without human intervention: (a) Error rate exceeds the rollback threshold defined at version registration (typically production error rate + 0.5 pp) for two consecutive 5-minute evaluation windows. (b) Latency p99 exceeds rollback threshold (typically production p99 + 50%) for two consecutive windows. (c) Any content safety violation is detected in production model outputs. (d) Any security alert indicating the model is producing outputs consistent with a prompt injection attack.
Manual triggers are initiated by an authorised human: model owner, platform on-call, product team lead, or AI Governance. Manual triggers are appropriate for: product team feedback indicating unexpected quality regression not captured by automated metrics; regulatory concern raised by compliance team; upstream model vendor safety advisory; any other situation where a human has direct evidence of a model problem not yet reflected in metrics.
All trigger events — automatic and manual — are logged to the immutable audit trail with: trigger type, triggering metric or human identity, current model version, target rollback version, and timestamp.
4.2 Rollback Execution Procedure
Step 1 — Traffic shift initiation (< 1 minute): The rollback automation retrieves the designated last-known-good version from the Model Register. It updates the traffic router to direct 100% of traffic to the previous version. The current (defective) version receives 0% traffic. This step is fully automated and requires no manual infrastructure action.
Step 2 — In-flight request drain (< 2 minutes): In-flight requests that reached the defective model before the traffic shift are allowed to complete within a configurable drain window (default 30 seconds). After drain timeout, remaining in-flight connections are closed with a retriable error response. Consumers should implement retry with idempotency keys (see EAAPL-INF policy on idempotency).
Step 3 — Serving verification (< 2 minutes): The rollback automation queries the health endpoint of the previous version model and verifies a sample inference returns a valid response. Only after verification does it emit a "rollback complete" event.
Step 4 — Notification (immediate, parallel with verification): On rollback initiation, automated notifications are sent to: model owner, platform on-call, AI Governance (for high-risk models), registered downstream consumers. Notification includes: version rolled back from, version rolled back to, trigger type, and incident reference number.
Total target: < 5 minutes from trigger to 100% of traffic on previous version with verification.
4.3 State Management During Rollback
State management is the most complex aspect of model rollback. Four categories of state require handling:
Cached responses: If a response cache exists downstream, entries produced by the defective version must be invalidated. The rollback automation publishes a cache invalidation event keyed by the version ID. Downstream caches that honour this event purge defective-version entries.
Session state: Users whose sessions were assigned to the defective version (per canary sticky sessions) must be migrated to the previous version. Session assignments in the distributed cache are updated by the rollback automation — the migration is seamless to the user.
Database records created by the new version: This is the hardest category. If the new model version wrote records to a production database with a format or schema unique to the new version, rollback to the previous serving version does not automatically rollback the database records. The model deployment process must document what database writes the new version performed, and the rollback runbook must include the appropriate data remediation step (which may range from "no action required" to a targeted data migration). For models with significant database impact, the deployment decision must account for rollback data complexity.
Downstream consumer caches: Consumers who have cached the new model's responses or endpoints receive rollback notification and must implement their own invalidation. The rollback notification event includes sufficient context for consumers to identify potentially stale data.
4.4 Post-Rollback Investigation
A rollback event mandates a root cause analysis (RCA). The RCA must be initiated within 4 hours of rollback completion and delivered within 5 business days. The RCA documents: what went wrong (technical root cause), why it was not caught by pre-deployment testing (shadow/canary gap analysis), what the customer impact was (users affected × duration × severity), and what changes will prevent recurrence (process, tooling, or test coverage). The RCA is stored in the incident management system and reviewed by AI Governance. The defective model version is flagged in the Model Register as "rollback-required" — it cannot be re-deployed without addressing the RCA findings and producing a new approved version.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Rollback Automation | Automation | Executes rollback procedure: traffic shift, drain, verification, notification | Custom Lambda/Cloud Function, Argo Rollouts, custom Kubernetes operator | Critical |
| Traffic Router | Infrastructure | Shifts traffic weights between model versions; must update in < 60 seconds | AWS ALB, Istio, NGINX, Envoy | Critical |
| Model Register | Platform Service | Provides last-known-good version reference; records rollback events | MLflow, custom registry, Vertex AI | Critical |
| Response Cache | Data Store | Caches model responses; must support version-keyed invalidation events | Redis, Memcached, Varnish, CloudFront | High |
| Notification Service | Integration | Sends rollback notifications to owners, governance, consumers | PagerDuty, OpsGenie, Slack, email | High |
| Drain Manager | Infrastructure | Manages in-flight request completion during traffic shift | Load balancer connection draining, custom graceful shutdown | High |
| Incident Management System | Governance | Records rollback event; tracks RCA; stores findings | PagerDuty, ServiceNow, Jira | Medium |
7. Data Flow
7.1 Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Monitor / Human | Initiates rollback (automatic threshold breach or manual trigger) | Rollback trigger event with trigger type, model version |
| 2 | Rollback Automation | Queries Model Register for last-known-good version | Previous version artefact reference and health endpoint |
| 3 | Rollback Automation | Updates traffic router: 100% to previous version, 0% to current | Routing configuration updated; current version in drain |
| 4 | Drain Manager | Allows in-flight requests to complete (30-second window) | All in-flight requests resolved; current version idle |
| 5 | Rollback Automation | Publishes cache invalidation event for current version | Downstream caches begin purging current-version entries |
| 6 | Rollback Automation | Verifies previous version health and sample inference | Health confirmed; rollback-complete event emitted |
| 7 | Notification Service | Sends rollback notification to all registered parties | Notifications delivered; incident created |
| 8 | Model Register | Records rollback event: from version, to version, trigger, timestamp | Immutable audit log entry |
| 9 | Model Owner | Initiates RCA within 4 hours; delivers within 5 business days | RCA document; Model Register updated with rollback-required flag |
7.2 Error Flow
| Error Scenario | Detection | Recovery Action |
|---|---|---|
| Previous version health check fails | Health probe returns unhealthy | Attempt one prior version; alert P1; manual investigation |
| Traffic router does not update in < 60 seconds | Timeout on routing API call | Retry 3×; escalate to infrastructure P1; manual routing override |
| Cache invalidation not honoured | Stale cache serves defective responses | Force flush via admin API; extend incident duration estimate |
| Rollback automation itself fails | Automation health monitor | Manual execution of rollback runbook; automation fix as P1 follow-up |
| Drain timeout exceeded with stuck requests | Drain timer alarm | Force-close remaining connections; accept retriable error for affected users |
8. Security Considerations
8.1 Controls Summary
| Domain | Control |
|---|---|
| Authentication | Rollback automation service account with narrow scope: traffic router write, cache invalidation write; no model data access |
| Authorisation | Manual rollback requires authentication of requester; RBAC limits to model owner, platform on-call, AI Governance |
| Secrets | Rollback automation uses short-lived OIDC tokens; no persistent credentials |
| Classification | Rollback audit logs contain model version IDs and trigger information — INTERNAL classification |
| Encryption | All API calls during rollback execution use TLS 1.3; audit log encrypted at rest |
| Auditability | Every rollback step logged with operator identity (or "automation"), timestamp, and outcome |
8.2 OWASP LLM Top 10 Relevance
| OWASP LLM Risk | Relevance | Mitigation |
|---|---|---|
| LLM01 Prompt Injection | Medium | If a prompt injection attack triggers a safety violation that causes a rollback, the rollback is the correct response — the pattern supports this |
| LLM02 Insecure Output Handling | Low | Rollback is a serving control, not an output processing control |
| LLM03 Training Data Poisoning | Low | Rollback addresses deployment-time failures; poisoning is a training-time concern |
| LLM04 Model Denial of Service | Medium | A DoS attack that triggers latency rollback threshold is a valid rollback use case; pattern supports this |
| LLM05 Supply Chain Vulnerabilities | Medium | If a supply chain compromise is detected in a deployed model, rollback is the immediate response |
| LLM06 Sensitive Information Disclosure | Low | Rollback does not address already-disclosed information; notification to affected parties is a separate incident response step |
| LLM07 Insecure Plugin Design | Low | Rollback is a serving control |
| LLM08 Excessive Agency | Medium | The rollback automation itself must not have excessive agency — it executes a defined, bounded procedure only |
| LLM09 Overreliance | Low | Rollback is a technical control, not a behavioural pattern |
| LLM10 Model Theft | Low | Rollback does not address model theft; artefact access controls are the relevant control |
9. Governance Considerations
9.1 Responsible AI
A rollback event is evidence that the model risk management process identified and responded to a defect. The RCA must address whether the defect had differential impact on any demographic subgroup and whether any users were harmed by the defective model before rollback. If harm occurred, the incident escalation process includes user notification per the Privacy Act and any applicable financial services regulation.
9.2 Model Risk Management
Rollback events are material MRM events. The defective version is flagged in the Model Register. Three rollback events for any model within 12 months triggers a model risk review: the model's validation process, deployment process, and monitoring coverage are all examined.
9.3 Human Approval Gates
Automatic rollback does not require human approval — speed is the priority. Re-enabling a rolled-back version (re-deploying or re-initiating canary) always requires human approval with RCA evidence.
9.4 Governance Artefacts
| Artefact | Owner | Frequency | Location |
|---|---|---|---|
| Rollback Audit Log Entry | Rollback Automation | Per rollback | Immutable audit log |
| Rollback Notification Record | Notification Service | Per rollback | Incident management system |
| Root Cause Analysis | Model Owner | Per rollback | Incident management system + Model Register |
| Model Risk Register Update | AI Governance | Per rollback | Risk management system |
10. Operational Considerations
10.1 SLOs
| SLO | Target | Measurement Method |
|---|---|---|
| Traffic shift from trigger to previous version | < 2 minutes | Rollback automation timing |
| Full rollback completion (with verification) | < 5 minutes | Rollback automation timing end-to-end |
| Rollback notification delivery | < 3 minutes | Notification service delivery receipt |
| RCA initiation | < 4 hours post-rollback | Incident management system timestamp |
| Rollback runbook test frequency | ≥ quarterly | Rollback drill calendar record |
10.2 Monitoring and Logging
Rollback capability itself must be monitored: the automation service must have a health monitor; the ability to reach the traffic router API must be tested hourly (synthetic probe); the Model Register must confirm a last-known-good version is designated at all times (alerting if none designated). If any of these fail, a P1 incident is raised: the rollback capability is compromised.
10.3 Incident Response
A model rollback event is itself a P2 incident by default (degradation detected and contained). It escalates to P1 if: the rollback takes longer than the 5-minute RTO target, the previous version is also degraded (cascading failure), or a content safety violation was the trigger (potential regulatory notification obligation). Incident management system creates an incident record automatically on rollback trigger.
10.4 Disaster Recovery
| Scenario | RPO | RTO | Recovery Procedure |
|---|---|---|---|
| No previous version available to roll back to | N/A | Manual | Emergency: serve static fallback or disable AI feature; escalate |
| Rollback automation unavailable | N/A | 15 minutes | Execute manual rollback runbook; restore automation post-incident |
| Both current and previous versions degraded | N/A | Manual | Disable AI-dependent feature; serve non-AI fallback; P1 escalation |
10.5 Capacity Planning
Rollback is rare but must succeed at any scale. Pre-warm the previous version endpoint before any new version deployment: keep the previous version serving at minimum capacity (1–2 instances) so rollback does not require cold start. This adds a small ongoing cost ($100–$500/month for most model sizes) but eliminates cold-start delay from the rollback path.
11. Cost Considerations
11.1 Cost Drivers
| Driver | Description | Relative Impact |
|---|---|---|
| Previous version standby compute | Keeping previous version endpoints warm at minimum capacity | Medium |
| Rollback automation maintenance | Engineering time to maintain and test rollback automation | Medium |
| Rollback drill execution | Quarterly drill requires engineer time and temporary dual-serving cost | Low |
| RCA process | Engineer time for root cause analysis and preventive action | Medium |
11.2 Scaling Risks
If rollback is triggered during a peak traffic event, the previous version must scale up rapidly. Auto-scaling of the previous version endpoint must be configured with aggressive scale-out policies. Pre-warming eliminates scale-out latency but adds baseline cost.
11.3 Optimisations
- Use serverless inference (Lambda + container) for previous version standby — near-zero cost at idle, scales instantly.
- Keep previous version artefact in hot storage in the same region as production — eliminates artefact retrieval latency.
- Automate rollback drill as part of quarterly chaos engineering programme — no additional scheduling cost.
11.4 Indicative Cost Range
| Organisation Scale | Monthly Rollback Capability Cost | Key Assumptions |
|---|---|---|
| Small (1–5 models) | $100–$500 | Previous version warm standby; serverless where possible |
| Medium (5–20 models) | $500–$3,000 | Previous versions on shared compute pool; auto-scaling |
| Large (20+ models) | $3,000–$15,000 | Dedicated previous-version compute fleet; active monitoring |
12. Trade-Off Analysis
12.1 Rollback Depth Options
| Option | Speed | Risk Coverage | Cost | Complexity | Best For |
|---|---|---|---|---|---|
| One version back only (this pattern) | Fastest | High | Low | Low | Most organisations; sufficient for >95% of incidents |
| Two versions back capability | Fast | Very High | Medium | Medium | High-risk models; organisations with frequent rollbacks |
| Full version history replay | Medium | Complete | High | High | Compliance-critical models; forensic investigation |
| Blue/green (parallel full deployment) | Instant | Complete | Very High | High | Mission-critical services; zero-tolerance for rollback delay |
12.2 Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Speed vs State Consistency | Fast rollback may leave state inconsistencies (DB records, cache entries from defective version) | Accept temporary inconsistency; runbook documents post-rollback cleanup steps per model type |
| Automation vs Human Judgment | Automatic rollback is fast but may be triggered by false positives | Two consecutive windows must breach threshold before auto-rollback; single-window breach pages on-call |
| Cost vs Recovery Speed | Keeping previous version warm reduces rollback RTO but adds baseline cost | Tier previous version standby by model criticality; critical models always warm, others cold |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| Previous version cold start exceeds RTO | Medium | High | Rollback timing monitor | Pre-warm policy applied retroactively; P1 for current incident |
| Rollback automation has permissions bug | Low | Critical | Rollback drill failure | Manual runbook execution; fix automation as P1 follow-up |
| Model Register unavailable at rollback time | Low | Critical | Register health monitor | Rollback automation uses cached last-known-good version reference |
| Previous version also has undetected defect | Very Low | Critical | Post-rollback metric monitoring | Disable AI feature; serve fallback; emergency escalation |
| In-flight requests produce mixed outputs | Medium | Low | User-visible inconsistency complaints | Acceptable: document in incident report; users can retry |
13.1 Cascading Failure Scenarios
If a rollback is triggered while a canary is in progress at 50%, the rollback automation reverts to the last-known-good version (which may be the pre-canary version, not the baseline canary started from). This requires the Model Register to clearly designate the last-fully-promoted version as last-known-good, not the canary baseline. Mitigation: last-known-good version is only updated on full 100% promotion completion — not at any intermediate canary stage.
14. Regulatory Considerations
| Regulation / Framework | Relevant Clause | How This Pattern Addresses It |
|---|---|---|
| EU AI Act (2024/1689) | Article 9 (Risk Management System) — corrective actions for high-risk AI | Rollback is the primary corrective action capability; RTO target demonstrates readiness |
| EU AI Act (2024/1689) | Article 61 (Post-market monitoring) — obligation to address serious incidents | Rollback is the immediate response; RCA is the post-incident obligation |
| ISO 42001:2023 | Clause 10.2 (Nonconformity and corrective action) | Rollback event triggers mandatory corrective action per Clause 10.2 |
| NIST AI RMF (2023) | MANAGE 4.1 (Incident response for AI systems) | Rollback procedure is the AI incident response mechanism |
| APRA CPS 230 (2025) | Paragraph 42 (Incident management) / Paragraph 52 (Change management) | Rollback is the incident management procedure for model changes; RTO target satisfies Para 42 |
| Privacy Act 1988 (Cth) | APP 11 — if defective model disclosed personal information, notification obligation may arise | Rollback audit log enables post-rollback assessment of data exposure; supports notification decision |
15. Reference Implementations
15.1 AWS
- Traffic Shift: ALB weighted target groups; Lambda function adjusts weights via SDK call.
- Rollback Automation: AWS Step Functions state machine; triggered by CloudWatch Alarm.
- Cache Invalidation: ElastiCache Redis flush by version key prefix; CloudFront cache invalidation API.
- Previous Version Standby: SageMaker Endpoint (1 instance minimum); Lambda function (serverless option).
- Notification: SNS topic → PagerDuty integration + Slack Lambda.
- Audit Log: CloudWatch Logs (immutable with log group retention policy); CloudTrail.
15.2 Azure
- Traffic Shift: Azure Application Gateway backend pool weight update via ARM API.
- Rollback Automation: Azure Logic Apps or Azure Functions; triggered by Azure Monitor alert.
- Cache Invalidation: Azure Cache for Redis flush; Azure CDN purge API.
- Previous Version Standby: Azure ML managed endpoint (1 instance minimum); Azure Container Apps.
- Notification: Azure Monitor action group → PagerDuty + Teams webhook.
- Audit Log: Azure Monitor Diagnostic Settings → immutable Log Analytics Workspace.
15.3 GCP
- Traffic Shift: Cloud Load Balancing backend service weight update via Cloud SDK.
- Rollback Automation: Cloud Workflows; triggered by Cloud Monitoring alerting policy.
- Cache Invalidation: Cloud Memorystore Redis flush; Cloud CDN cache invalidation API.
- Previous Version Standby: Vertex AI Endpoint (min replicas 1); Cloud Run (serverless).
- Notification: Cloud Monitoring → PubSub → Cloud Function → PagerDuty + Slack.
- Audit Log: Cloud Audit Logs → BigQuery export (immutable via dataset lock).
15.4 On-Premises / Hybrid
- Traffic Shift: NGINX upstream weight update via NGINX Plus API; Envoy xDS routing update.
- Rollback Automation: Argo Rollouts automated analysis + rollback; custom Kubernetes controller.
- Cache Invalidation: Redis FLUSHDB on version namespace; Varnish ban by version tag.
- Previous Version Standby: Kubernetes Deployment with minimum replica set.
- Notification: Alertmanager → PagerDuty webhook + Slack integration.
- Audit Log: Elasticsearch with write-once index; Kafka event sourcing for rollback events.
16. Related Patterns
| Pattern ID | Pattern Name | Relationship Type | Description |
|---|---|---|---|
| EAAPL-MDL001 | Model Versioning | Prerequisite | Rollback targets a specific previous version by version ID from the Model Register |
| EAAPL-MDL002 | Shadow Model Deployment | Sibling | Shadow evidence reduces rollback probability; rollback is the recovery when shadow validation was insufficient |
| EAAPL-MDL003 | Canary Model Release | Sibling | Canary automatic rollback invokes this pattern; canary is the primary prevention mechanism |
| EAAPL-MDL008 | Model Access Governance | Dependency | Rollback automation service account is governed by model access governance |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Industry Adoption | 4 | Blue/green and rollback patterns are standard in software; model-specific rollback is mature |
| Tooling Availability | 4 | Argo Rollouts, ALB, Istio all support instant traffic shifts; automation is straightforward |
| Standards Alignment | 5 | Directly addresses APRA CPS 230 Paragraph 42, EU AI Act Article 61, ISO 42001 Clause 10.2 |
| Implementation Complexity | 3 (medium) | Traffic shift is simple; state management during rollback is complex and organisation-specific |
| Regulatory Acceptance | 5 | Tested rollback with defined RTO is explicitly required by APRA and implied by EU AI Act |
18. Revision History
| Version | Date | Author | Summary of Changes |
|---|---|---|---|
| 1.0 | 2026-06-12 | Enterprise AI Architecture Practice | Initial publication |