AI Incident Management
[EAAPL-GOV008] AI Incident Management
Category: Governance / Operational Resilience Sub-category: AI-Specific Incident Response Version: 1.1 Maturity: Proven Tags: incident-management, AI-incidents, hallucination, bias-incident, security-incident, MTTD, MTTR, regulatory-notification Regulatory Relevance: APRA CPS230 §43–§46, APRA CPS234 §36–§37, EU AI Act Article 72, Privacy Act APP 11, NIST AI RMF MANAGE 3.1
1. Executive Summary
The AI Incident Management pattern implements a specialised incident management lifecycle for AI system failures. It recognises that AI incidents have fundamentally different characteristics from conventional software incidents: failure modes include hallucination, statistical bias drift, adversarial attack, data poisoning, and emergent behaviour—none of which appear in standard ITIL incident taxonomies and none of which standard ITSM runbooks address.
The pattern extends enterprise incident management with an AI-specific incident taxonomy, AI-specific MTTD/MTTR targets, AI-specialist escalation paths, and regulatory notification workflows triggered by specific incident categories. It integrates with the AI Audit Trail (GOV007) for forensic evidence, the Model Bias Detection pipeline (GOV006) for bias incident detection, and the AI Policy Enforcement layer (GOV004) for security incident evidence.
For CIOs and CTOs, the critical strategic outcome is regulatory compliance with notification obligations. APRA CPS230 §43 requires notification of material incidents within 72 hours. The EU AI Act Article 72 requires notification of serious incidents to market surveillance authorities. Without a structured AI incident management process, organisations will miss notification windows—a governance failure that compounds the original incident.
2. Problem Statement
Business Problem
AI-specific failures are not recognised, categorised, or escalated appropriately through standard incident management processes. Hallucination incidents are logged as "system behaved unexpectedly" with no understanding of systemic risk. Bias incidents are not recognised as incidents at all. Regulatory notification obligations are missed because the incident's regulatory significance is not assessed.
Technical Problem
Standard ITSM tools (ServiceNow, Jira Service Management) do not have AI-specific incident categories, AI-specific root cause codes, or AI-specific resolution actions. AI incidents require forensic capability (audit trail queries), model behaviour analysis (fairness metrics, output sampling), and often model rollback or scope restriction — actions not in standard incident runbooks.
Symptoms
- AI failures treated as "application bugs" with standard P3 priority regardless of customer impact
- No post-incident review for AI-specific contributing factors
- Regulatory notification obligations missed for privacy breaches via AI disclosure
- Bias incidents detected by media or regulators, not internal monitoring
- Model rollback performed without documented incident process
- No AI incident trend analysis informing model risk management
Cost of Inaction
- Regulatory: APRA CPS234 §36: failure to notify APRA of material security incident within 72 hours. EU AI Act Article 72: failure to notify of serious AI incidents
- Legal: Undocumented AI incidents creating liability for unaddressed harms
- Reputational: Reactive, unstructured response to public AI failures amplifies reputational damage
- Operational: No learning from AI incidents; same failure modes recur
3. Context
When to Apply
- Any enterprise operating AI systems in production with customer, financial, or regulatory exposure
- APRA-regulated entities (mandatory for material operational incidents per CPS230)
- EU market participants operating high-risk AI (mandatory notification per EU AI Act Article 72)
- Any organisation with stated responsible AI commitments (accountability requires incident response)
When NOT to Apply
- Internal AI tools with no customer impact and no regulatory exposure (standard ITSM applies)
- Development and testing environments (internal development incidents use standard dev processes)
Prerequisites
- AI Audit Trail (GOV007) operational — forensic evidence source
- AI Model Register (GOV001) — MRID required for impact scoping
- AI Policy Enforcement (GOV004) — security incident evidence
- Enterprise ITSM platform — AI-specific incident taxonomy integrated into existing system
Industry Applicability
| Industry | Key AI Incident Types | Notification Obligation | SLA |
|---|---|---|---|
| Banking (AU) | Hallucination (advice), Bias (credit), Security (model theft) | APRA CPS230 §43: 72h | MTTD <1h, MTTR <24h for P1 |
| Insurance (AU) | Bias (pricing), Privacy (output disclosure), Availability | APRA CPS230 §43: 72h | MTTD <2h, MTTR <48h for P1 |
| Healthcare | Hallucination (clinical), Safety (harmful recommendation) | TGA reportable event | MTTD <30min, MTTR <4h |
| Financial Services (EU) | All types | EU AI Act Article 72: immediate (serious) | MTTD <1h |
| Government | Bias (services), Privacy, Availability | OAIC (privacy breach) | MTTD <4h |
4. Architecture Overview
The AI Incident Management pattern extends enterprise ITSM with three AI-specific architectural additions: an AI incident taxonomy that integrates with detection sources, an AI incident playbook library that guides responders through AI-specific resolution actions, and an automated regulatory notification workflow triggered by incident classification.
AI Incident Taxonomy. The pattern defines six primary AI incident categories that do not exist in standard ITSM taxonomies:
Hallucination Incident: An AI system generates factually incorrect, fabricated, or internally inconsistent output. Severity is determined by whether the output was customer-facing, whether it could cause direct harm (medical advice, legal advice, financial advice), and whether it was acted upon before detection. Root cause investigation must determine: was this a single anomalous output or a systematic pattern? Is the hallucination triggerable by specific input patterns (vulnerability)?
Bias Incident: Statistical evidence of discriminatory model behaviour. Detected via GOV006 bias detection alerts or by external complaint/report. Severity based on: protected attribute affected, decision type (consequential vs non-consequential), affected population size, duration of bias operation. Root cause categories: training data skew, feedback loop formation, distribution shift, feature correlation with protected attribute.
Security Incident (AI-specific): Adversarial attack on AI system (prompt injection, model inversion, membership inference, model extraction). Differentiated from conventional security incidents by the AI-specific attack vector and the potential for model-layer compromise that application-layer security controls do not detect.
Availability Incident: AI system unable to provide service within SLO. While similar to conventional availability incidents, AI availability failures have distinctive characteristics: model serving infrastructure failure vs. model quality degradation (model appears available but is producing degraded outputs). The latter is an AI-specific availability failure that standard health checks miss.
Data Poisoning Incident: Evidence or strong suspicion that training data was tampered with to produce targeted model behaviour. Requires forensic investigation of training data provenance. Rarest incident type but most severe consequence.
Privacy Disclosure Incident: AI system discloses personal information that should not be accessible, via membership inference (confirming someone's data was in the training set), training data memorisation (reproducing personal information from training data), or policy enforcement failure. Triggers Privacy Act notifiable data breach assessment.
Graduated Severity Model. AI incidents are assigned four severity levels that determine response times and notification obligations:
- P1 (Critical): Imminent or actual harm to individuals; regulatory notification likely required; immediate containment required
- P2 (High): Significant risk of harm; potential regulatory notification; containment within 4 hours
- P3 (Medium): Limited harm; no immediate regulatory obligation; investigation within 24 hours
- P4 (Low): Minor anomaly; no harm; investigation within 72 hours
AI-Specific Resolution Actions. The playbook library defines resolution actions specific to AI incidents:
- Model Rollback: Revert to prior approved model version (requires deployment token from GOV003)
- Scope Restriction: Limit model to lower-risk use cases while investigation proceeds
- Monitoring Enhancement: Increase monitoring frequency, log sampling, and alert sensitivity
- Emergency Human Override: Route all AI decisions to human review while model is investigated
- Input Filter: Deploy emergency input filter via GOV004 policy bundle update
- Third-Party Notification: Notify upstream model provider (for third-party model incidents)
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Detection Aggregator | Integration | Consolidates AI incident signals from all detection sources | Event bus (Kafka, SNS); webhook receiver | Critical |
| AI Incident Triage Bot | Automation | Auto-classifies incoming signals against AI incident taxonomy; suggests severity | LLM-assisted classification (irony: AI to manage AI incidents) + rules engine | High |
| AI Incident Taxonomy Classifier | Business Logic | Enforces six-category taxonomy; maps to severity model | Configurable rules engine; ITSM integration | Critical |
| AI Incident Playbook Engine | Workflow | Delivers appropriate playbook to responder based on incident type + severity | ServiceNow Playbooks, PagerDuty Runbooks, Confluence-integrated | High |
| Response Action Orchestrator | Automation | Executes automated response actions (model restriction, monitoring enhancement) | Custom API orchestrator + GOV001/GOV004 API integration | High |
| Forensic Query Interface | Integration | Provides structured queries against GOV007 audit trail for incident investigation | Search interface to GOV007 Search Index | Critical |
| Regulatory Notification Engine | Workflow | Assesses notification obligations; drafts notifications; manages timelines | Custom workflow + ITSM + document management | Critical |
| Post-Incident Review Template | Process | Structured PIR template for AI incidents | ITSM template; Confluence page template | High |
| AI Incident Pattern Database | Knowledge Base | Accumulates incident patterns to improve detection and playbooks | Knowledge base in ITSM; custom pattern store | Medium |
7. Data Flow
P1 AI Incident Response Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | Detection Source (e.g., GOV006 bias alert) | Emits incident signal | Signal in detection aggregator |
| 2 | Triage Bot | Auto-classifies as P1 Bias Incident; recommends severity | Draft incident record with classification |
| 3 | On-Call AI Incident Responder | Confirms classification; accepts incident ownership | Incident created in ITSM with P1 priority |
| 4 | Playbook Engine | Serves Bias Incident P1 playbook | Step-by-step response guidance |
| 5 | Responder | Executes immediate containment (scope restriction or human override via response orchestrator) | Model restriction applied; customer exposure limited |
| 6 | Regulatory Assessment | Automated assessment of notification obligation triggered | Notification required: APRA 72h window started |
| 7 | Forensic Investigation | Queries GOV007 audit trail for affected decision population | Affected decision count, date range, customer population |
| 8 | Notification Draft | Regulatory Notification Engine drafts APRA notification | Draft notification for Compliance Director review |
| 9 | Root Cause Analysis | AI-assisted RCA using audit trail data + model metadata | Root cause identified; control gap documented |
| 10 | Post-Incident Review | PIR conducted within 5 business days | PIR report; control improvement actions; GOV001 updated |
8. Security Considerations
Incident Responder Access
- P1/P2 incidents require On-Call AI Incident Responder (24/7 coverage) with elevated read access to audit trail
- Elevated access is time-limited (incident duration + 48 hours); automatically revoked
- All actions during incident response logged with incident reference
OWASP LLM Top 10 Mapping
| OWASP LLM Risk | Incident Category | Response Action |
|---|---|---|
| LLM01 Prompt Injection | Security Incident | Input filter emergency deployment |
| LLM03 Training Data Poisoning | Data Poisoning Incident | Model rollback; retraining audit |
| LLM06 Sensitive Information Disclosure | Privacy Disclosure Incident | Scope restriction; OAIC notification assessment |
| LLM08 Excessive Agency | Security / Hallucination Incident | Human override activation |
| LLM09 Overreliance | Hallucination / Availability Incident | User notification; human review requirement |
9. Governance Considerations
MTTD/MTTR Targets
| Incident Type | Severity | MTTD Target | MTTR Target | Regulatory Notification Window |
|---|---|---|---|---|
| Hallucination (customer-facing) | P1 | <30 minutes | <4 hours | APRA: 72h if material |
| Bias Incident (consequential) | P1 | <1 hour | <24 hours | APRA: 72h; ASIC: assess |
| Security (adversarial attack) | P1 | <15 minutes | <4 hours | APRA CPS234: 72h |
| Privacy Disclosure | P1 | <30 minutes | <4 hours | OAIC: 30 days (NDB) |
| Availability (quality degradation) | P2 | <1 hour | <8 hours | APRA: if critical operation |
| Data Poisoning | P1 | <2 hours | <48 hours | APRA: 72h |
Governance Artefacts
| Artefact | Owner | Frequency | Regulatory Linkage |
|---|---|---|---|
| AI Incident Register | CISO | Continuous | APRA CPS230 §43 |
| Post-Incident Review Reports | AI Incident Lead | Per incident | APRA CPS230 §46 |
| Regulatory Notification Log | Compliance | Per notification | APRA CPS234 §37 |
| AI Incident Trend Analysis | AI Governance | Quarterly | ISO 42001 §10.2 |
10. Operational Considerations
SLOs
| SLO | Target | Measurement |
|---|---|---|
| P1 detection-to-responder notification | <5 minutes | Per P1 incident |
| P1 containment action completion | <1 hour | Per P1 incident |
| Regulatory notification draft ready | <4 hours from P1 classification | Per P1 incident |
| PIR completion | <5 business days | Per incident |
| 24/7 AI Incident Responder coverage | 100% | Monthly |
11. Cost Considerations
Indicative Cost Range
| Component | Annual Cost |
|---|---|
| AI Incident Responder on-call coverage (0.5 FTE burdened) | AUD $80,000 |
| ITSM AI incident taxonomy customisation | AUD $15,000 one-time |
| Regulatory notification tooling | AUD $10,000/yr |
| External incident response retainer (for critical incidents) | AUD $30,000–$80,000/yr |
| Total annual | ~AUD $135,000–$185,000 |
12. Trade-Off Analysis
Option Comparison
| Option | Description | Pros | Cons | Recommended For |
|---|---|---|---|---|
| A: AI-specialised ITSM extension (this pattern) | Existing ITSM + AI taxonomy + AI playbooks | Leverages existing ITSM investment; familiar to IT teams | Customisation effort; AI-specific training required | All regulated enterprises |
| B: Separate AI incident management tool | Dedicated AI ops platform (Arthur AI, Fiddler) | Purpose-built for AI; rich forensics | Siloed from enterprise ITSM; dual management overhead | Organisations without mature ITSM |
| C: Standard ITSM only | Use existing incident process unchanged | Zero additional cost; no change management | AI incidents mis-categorised; regulatory obligations missed | Never for regulated entities with customer-facing AI |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| AI incident not recognised as such (logged as app bug) | High | High — AI-specific response not executed; notification missed | Review of all customer-facing system incidents for AI component | AI detection training for all incident responders; auto-flagging from GOV004/GOV006 |
| Regulatory notification window missed | Low | Critical — APRA enforcement | Notification workflow SLA monitor | Voluntary disclosure with explanation; legal counsel engagement |
| Forensic evidence unavailable (GOV007 unavailable during incident) | Low | High — investigation impeded | GOV007 availability monitoring | Offline investigation using secondary evidence; GOV007 recovery prioritised |
| Post-incident review not completed | Medium | Medium — learning not captured | PIR completion tracking | Mandatory PIR gate before incident closure |
14. Regulatory Considerations
APRA CPS230
- §43: Operational incidents with material impact must be notified to APRA within 72 hours. AI incidents causing disruption to critical operations are material incidents.
- §46: Post-incident reviews required for material incidents. PIR template implements this requirement.
APRA CPS234
- §36: Cybersecurity incidents must be notified to APRA within 72 hours. AI security incidents (prompt injection attacks, model theft) are CPS234 notifiable.
- §37: Summary of cybersecurity incidents must be submitted to APRA annually.
EU AI Act
- Article 72: Providers of high-risk AI systems must report serious incidents (death/serious harm attributable to AI, fundamental rights infringements, significant property damage) to market surveillance authorities without undue delay and no later than 15 days after becoming aware.
Privacy Act 1988 / Notifiable Data Breaches (NDB)
- §26WK: Privacy disclosure via AI (memorisation, membership inference) triggers NDB assessment. 30-day notification obligation if likely to cause serious harm. Incident management workflow includes NDB assessment as mandatory step for privacy disclosure incidents.
15. Reference Implementations
AWS
| Component | Service |
|---|---|
| Incident Detection | CloudWatch Alarms + EventBridge |
| ITSM Integration | ServiceNow + AWS Systems Manager OpsCenter |
| Forensic Query | OpenSearch (GOV007 index) |
| Notification Workflow | Step Functions + SES |
Azure
| Component | Service |
|---|---|
| Incident Detection | Azure Monitor + Event Grid |
| ITSM Integration | ServiceNow / Jira Service Management via Azure Logic Apps |
On-Premises
| Component | Technology |
|---|---|
| ITSM | ServiceNow with custom AI incident app |
| Playbooks | Confluence + ServiceNow Playbooks |
| Forensics | GOV007 Elasticsearch query interface |
16. Related Patterns
| Pattern | Relationship | Dependency Direction |
|---|---|---|
| EAAPL-GOV006 Model Bias Detection | Detection source — bias alerts trigger incidents | GOV006 → GOV008 |
| EAAPL-GOV007 AI Audit Trail | Forensic evidence source | GOV008 → GOV007 |
| EAAPL-GOV004 AI Policy Enforcement | Security incident evidence | GOV004 → GOV008 |
| EAAPL-CMP001 APRA CPS230 | Satisfies — §43/§46 incident obligations | GOV008 → CMP001 |
| EAAPL-CMP002 APRA CPS234 | Satisfies — §36/§37 security incident obligations | GOV008 → CMP002 |
17. Maturity Assessment
Overall Maturity: Proven (Level 3)
| Dimension | Score (1–5) | Evidence |
|---|---|---|
| Taxonomy completeness | 5 | Six AI-specific incident categories fully defined |
| Regulatory notification workflow | 4 | Three jurisdictions covered; gap is automated notification submission |
| Playbook quality | 3 | Six playbook types outlined; depth of individual playbooks varies |
| MTTD/MTTR targets | 4 | Targets defined per incident type; measurement infrastructure required |
| Post-incident learning loop | 3 | Pattern database defined; automated pattern detection not yet implemented |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-06-01 | EAAPL Working Group | Initial publication |
| 1.1 | 2025-04-01 | EAAPL Working Group | EU AI Act Article 72 notification mapping; APRA CPS234 §36/§37 alignment; data poisoning incident category |