EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryAI GovernanceEAAPL-GOV008
EAAPL-GOV008Proven
⇄ Compare

AI Incident Management

⚖️ AI GovernanceAPRA CPS230APRA CPS234🏭 Field-tested in AU

[EAAPL-GOV008] AI Incident Management

Category: Governance / Operational Resilience Sub-category: AI-Specific Incident Response Version: 1.1 Maturity: Proven Tags: incident-management, AI-incidents, hallucination, bias-incident, security-incident, MTTD, MTTR, regulatory-notification Regulatory Relevance: APRA CPS230 §43–§46, APRA CPS234 §36–§37, EU AI Act Article 72, Privacy Act APP 11, NIST AI RMF MANAGE 3.1


1. Executive Summary

The AI Incident Management pattern implements a specialised incident management lifecycle for AI system failures. It recognises that AI incidents have fundamentally different characteristics from conventional software incidents: failure modes include hallucination, statistical bias drift, adversarial attack, data poisoning, and emergent behaviour—none of which appear in standard ITIL incident taxonomies and none of which standard ITSM runbooks address.

The pattern extends enterprise incident management with an AI-specific incident taxonomy, AI-specific MTTD/MTTR targets, AI-specialist escalation paths, and regulatory notification workflows triggered by specific incident categories. It integrates with the AI Audit Trail (GOV007) for forensic evidence, the Model Bias Detection pipeline (GOV006) for bias incident detection, and the AI Policy Enforcement layer (GOV004) for security incident evidence.

For CIOs and CTOs, the critical strategic outcome is regulatory compliance with notification obligations. APRA CPS230 §43 requires notification of material incidents within 72 hours. The EU AI Act Article 72 requires notification of serious incidents to market surveillance authorities. Without a structured AI incident management process, organisations will miss notification windows—a governance failure that compounds the original incident.


2. Problem Statement

Business Problem

AI-specific failures are not recognised, categorised, or escalated appropriately through standard incident management processes. Hallucination incidents are logged as "system behaved unexpectedly" with no understanding of systemic risk. Bias incidents are not recognised as incidents at all. Regulatory notification obligations are missed because the incident's regulatory significance is not assessed.

Technical Problem

Standard ITSM tools (ServiceNow, Jira Service Management) do not have AI-specific incident categories, AI-specific root cause codes, or AI-specific resolution actions. AI incidents require forensic capability (audit trail queries), model behaviour analysis (fairness metrics, output sampling), and often model rollback or scope restriction — actions not in standard incident runbooks.

Symptoms

  • AI failures treated as "application bugs" with standard P3 priority regardless of customer impact
  • No post-incident review for AI-specific contributing factors
  • Regulatory notification obligations missed for privacy breaches via AI disclosure
  • Bias incidents detected by media or regulators, not internal monitoring
  • Model rollback performed without documented incident process
  • No AI incident trend analysis informing model risk management

Cost of Inaction

  • Regulatory: APRA CPS234 §36: failure to notify APRA of material security incident within 72 hours. EU AI Act Article 72: failure to notify of serious AI incidents
  • Legal: Undocumented AI incidents creating liability for unaddressed harms
  • Reputational: Reactive, unstructured response to public AI failures amplifies reputational damage
  • Operational: No learning from AI incidents; same failure modes recur

3. Context

When to Apply

  • Any enterprise operating AI systems in production with customer, financial, or regulatory exposure
  • APRA-regulated entities (mandatory for material operational incidents per CPS230)
  • EU market participants operating high-risk AI (mandatory notification per EU AI Act Article 72)
  • Any organisation with stated responsible AI commitments (accountability requires incident response)

When NOT to Apply

  • Internal AI tools with no customer impact and no regulatory exposure (standard ITSM applies)
  • Development and testing environments (internal development incidents use standard dev processes)

Prerequisites

  • AI Audit Trail (GOV007) operational — forensic evidence source
  • AI Model Register (GOV001) — MRID required for impact scoping
  • AI Policy Enforcement (GOV004) — security incident evidence
  • Enterprise ITSM platform — AI-specific incident taxonomy integrated into existing system

Industry Applicability

Industry Key AI Incident Types Notification Obligation SLA
Banking (AU) Hallucination (advice), Bias (credit), Security (model theft) APRA CPS230 §43: 72h MTTD <1h, MTTR <24h for P1
Insurance (AU) Bias (pricing), Privacy (output disclosure), Availability APRA CPS230 §43: 72h MTTD <2h, MTTR <48h for P1
Healthcare Hallucination (clinical), Safety (harmful recommendation) TGA reportable event MTTD <30min, MTTR <4h
Financial Services (EU) All types EU AI Act Article 72: immediate (serious) MTTD <1h
Government Bias (services), Privacy, Availability OAIC (privacy breach) MTTD <4h

4. Architecture Overview

The AI Incident Management pattern extends enterprise ITSM with three AI-specific architectural additions: an AI incident taxonomy that integrates with detection sources, an AI incident playbook library that guides responders through AI-specific resolution actions, and an automated regulatory notification workflow triggered by incident classification.

AI Incident Taxonomy. The pattern defines six primary AI incident categories that do not exist in standard ITSM taxonomies:

Hallucination Incident: An AI system generates factually incorrect, fabricated, or internally inconsistent output. Severity is determined by whether the output was customer-facing, whether it could cause direct harm (medical advice, legal advice, financial advice), and whether it was acted upon before detection. Root cause investigation must determine: was this a single anomalous output or a systematic pattern? Is the hallucination triggerable by specific input patterns (vulnerability)?

Bias Incident: Statistical evidence of discriminatory model behaviour. Detected via GOV006 bias detection alerts or by external complaint/report. Severity based on: protected attribute affected, decision type (consequential vs non-consequential), affected population size, duration of bias operation. Root cause categories: training data skew, feedback loop formation, distribution shift, feature correlation with protected attribute.

Security Incident (AI-specific): Adversarial attack on AI system (prompt injection, model inversion, membership inference, model extraction). Differentiated from conventional security incidents by the AI-specific attack vector and the potential for model-layer compromise that application-layer security controls do not detect.

Availability Incident: AI system unable to provide service within SLO. While similar to conventional availability incidents, AI availability failures have distinctive characteristics: model serving infrastructure failure vs. model quality degradation (model appears available but is producing degraded outputs). The latter is an AI-specific availability failure that standard health checks miss.

Data Poisoning Incident: Evidence or strong suspicion that training data was tampered with to produce targeted model behaviour. Requires forensic investigation of training data provenance. Rarest incident type but most severe consequence.

Privacy Disclosure Incident: AI system discloses personal information that should not be accessible, via membership inference (confirming someone's data was in the training set), training data memorisation (reproducing personal information from training data), or policy enforcement failure. Triggers Privacy Act notifiable data breach assessment.

Graduated Severity Model. AI incidents are assigned four severity levels that determine response times and notification obligations:

  • P1 (Critical): Imminent or actual harm to individuals; regulatory notification likely required; immediate containment required
  • P2 (High): Significant risk of harm; potential regulatory notification; containment within 4 hours
  • P3 (Medium): Limited harm; no immediate regulatory obligation; investigation within 24 hours
  • P4 (Low): Minor anomaly; no harm; investigation within 72 hours

AI-Specific Resolution Actions. The playbook library defines resolution actions specific to AI incidents:

  • Model Rollback: Revert to prior approved model version (requires deployment token from GOV003)
  • Scope Restriction: Limit model to lower-risk use cases while investigation proceeds
  • Monitoring Enhancement: Increase monitoring frequency, log sampling, and alert sensitivity
  • Emergency Human Override: Route all AI decisions to human review while model is investigated
  • Input Filter: Deploy emergency input filter via GOV004 policy bundle update
  • Third-Party Notification: Notify upstream model provider (for third-party model incidents)

5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Detection["Detection Sources"] A[Monitoring + Bias Alerts] B[Customer or Regulator Reports] end subgraph Triage["Triage and Classification"] C[AI Incident Classifier] D{Severity P1-P4} end subgraph Response["Response and Learning"] E[Playbook Execution] F[Forensic Investigation] G[Post-Incident Review] end H[Regulatory Notification] I[GOV001 Incident Record] A --> C B --> C C --> D D -->|P1-P2 containment| E D -->|P1 regulatory trigger| H E --> F F --> G G --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#d1fae5,stroke:#10b981 style H fill:#fee2e2,stroke:#ef4444 style I fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Detection Aggregator Integration Consolidates AI incident signals from all detection sources Event bus (Kafka, SNS); webhook receiver Critical
AI Incident Triage Bot Automation Auto-classifies incoming signals against AI incident taxonomy; suggests severity LLM-assisted classification (irony: AI to manage AI incidents) + rules engine High
AI Incident Taxonomy Classifier Business Logic Enforces six-category taxonomy; maps to severity model Configurable rules engine; ITSM integration Critical
AI Incident Playbook Engine Workflow Delivers appropriate playbook to responder based on incident type + severity ServiceNow Playbooks, PagerDuty Runbooks, Confluence-integrated High
Response Action Orchestrator Automation Executes automated response actions (model restriction, monitoring enhancement) Custom API orchestrator + GOV001/GOV004 API integration High
Forensic Query Interface Integration Provides structured queries against GOV007 audit trail for incident investigation Search interface to GOV007 Search Index Critical
Regulatory Notification Engine Workflow Assesses notification obligations; drafts notifications; manages timelines Custom workflow + ITSM + document management Critical
Post-Incident Review Template Process Structured PIR template for AI incidents ITSM template; Confluence page template High
AI Incident Pattern Database Knowledge Base Accumulates incident patterns to improve detection and playbooks Knowledge base in ITSM; custom pattern store Medium

7. Data Flow

P1 AI Incident Response Flow

Step Actor Action Output
1 Detection Source (e.g., GOV006 bias alert) Emits incident signal Signal in detection aggregator
2 Triage Bot Auto-classifies as P1 Bias Incident; recommends severity Draft incident record with classification
3 On-Call AI Incident Responder Confirms classification; accepts incident ownership Incident created in ITSM with P1 priority
4 Playbook Engine Serves Bias Incident P1 playbook Step-by-step response guidance
5 Responder Executes immediate containment (scope restriction or human override via response orchestrator) Model restriction applied; customer exposure limited
6 Regulatory Assessment Automated assessment of notification obligation triggered Notification required: APRA 72h window started
7 Forensic Investigation Queries GOV007 audit trail for affected decision population Affected decision count, date range, customer population
8 Notification Draft Regulatory Notification Engine drafts APRA notification Draft notification for Compliance Director review
9 Root Cause Analysis AI-assisted RCA using audit trail data + model metadata Root cause identified; control gap documented
10 Post-Incident Review PIR conducted within 5 business days PIR report; control improvement actions; GOV001 updated

8. Security Considerations

Incident Responder Access

  • P1/P2 incidents require On-Call AI Incident Responder (24/7 coverage) with elevated read access to audit trail
  • Elevated access is time-limited (incident duration + 48 hours); automatically revoked
  • All actions during incident response logged with incident reference

OWASP LLM Top 10 Mapping

OWASP LLM Risk Incident Category Response Action
LLM01 Prompt Injection Security Incident Input filter emergency deployment
LLM03 Training Data Poisoning Data Poisoning Incident Model rollback; retraining audit
LLM06 Sensitive Information Disclosure Privacy Disclosure Incident Scope restriction; OAIC notification assessment
LLM08 Excessive Agency Security / Hallucination Incident Human override activation
LLM09 Overreliance Hallucination / Availability Incident User notification; human review requirement

9. Governance Considerations

MTTD/MTTR Targets

Incident Type Severity MTTD Target MTTR Target Regulatory Notification Window
Hallucination (customer-facing) P1 <30 minutes <4 hours APRA: 72h if material
Bias Incident (consequential) P1 <1 hour <24 hours APRA: 72h; ASIC: assess
Security (adversarial attack) P1 <15 minutes <4 hours APRA CPS234: 72h
Privacy Disclosure P1 <30 minutes <4 hours OAIC: 30 days (NDB)
Availability (quality degradation) P2 <1 hour <8 hours APRA: if critical operation
Data Poisoning P1 <2 hours <48 hours APRA: 72h

Governance Artefacts

Artefact Owner Frequency Regulatory Linkage
AI Incident Register CISO Continuous APRA CPS230 §43
Post-Incident Review Reports AI Incident Lead Per incident APRA CPS230 §46
Regulatory Notification Log Compliance Per notification APRA CPS234 §37
AI Incident Trend Analysis AI Governance Quarterly ISO 42001 §10.2

10. Operational Considerations

SLOs

SLO Target Measurement
P1 detection-to-responder notification <5 minutes Per P1 incident
P1 containment action completion <1 hour Per P1 incident
Regulatory notification draft ready <4 hours from P1 classification Per P1 incident
PIR completion <5 business days Per incident
24/7 AI Incident Responder coverage 100% Monthly

11. Cost Considerations

Indicative Cost Range

Component Annual Cost
AI Incident Responder on-call coverage (0.5 FTE burdened) AUD $80,000
ITSM AI incident taxonomy customisation AUD $15,000 one-time
Regulatory notification tooling AUD $10,000/yr
External incident response retainer (for critical incidents) AUD $30,000–$80,000/yr
Total annual ~AUD $135,000–$185,000

12. Trade-Off Analysis

Option Comparison

Option Description Pros Cons Recommended For
A: AI-specialised ITSM extension (this pattern) Existing ITSM + AI taxonomy + AI playbooks Leverages existing ITSM investment; familiar to IT teams Customisation effort; AI-specific training required All regulated enterprises
B: Separate AI incident management tool Dedicated AI ops platform (Arthur AI, Fiddler) Purpose-built for AI; rich forensics Siloed from enterprise ITSM; dual management overhead Organisations without mature ITSM
C: Standard ITSM only Use existing incident process unchanged Zero additional cost; no change management AI incidents mis-categorised; regulatory obligations missed Never for regulated entities with customer-facing AI

13. Failure Modes

Failure Likelihood Impact Detection Recovery
AI incident not recognised as such (logged as app bug) High High — AI-specific response not executed; notification missed Review of all customer-facing system incidents for AI component AI detection training for all incident responders; auto-flagging from GOV004/GOV006
Regulatory notification window missed Low Critical — APRA enforcement Notification workflow SLA monitor Voluntary disclosure with explanation; legal counsel engagement
Forensic evidence unavailable (GOV007 unavailable during incident) Low High — investigation impeded GOV007 availability monitoring Offline investigation using secondary evidence; GOV007 recovery prioritised
Post-incident review not completed Medium Medium — learning not captured PIR completion tracking Mandatory PIR gate before incident closure

14. Regulatory Considerations

APRA CPS230

  • §43: Operational incidents with material impact must be notified to APRA within 72 hours. AI incidents causing disruption to critical operations are material incidents.
  • §46: Post-incident reviews required for material incidents. PIR template implements this requirement.

APRA CPS234

  • §36: Cybersecurity incidents must be notified to APRA within 72 hours. AI security incidents (prompt injection attacks, model theft) are CPS234 notifiable.
  • §37: Summary of cybersecurity incidents must be submitted to APRA annually.

EU AI Act

  • Article 72: Providers of high-risk AI systems must report serious incidents (death/serious harm attributable to AI, fundamental rights infringements, significant property damage) to market surveillance authorities without undue delay and no later than 15 days after becoming aware.

Privacy Act 1988 / Notifiable Data Breaches (NDB)

  • §26WK: Privacy disclosure via AI (memorisation, membership inference) triggers NDB assessment. 30-day notification obligation if likely to cause serious harm. Incident management workflow includes NDB assessment as mandatory step for privacy disclosure incidents.

15. Reference Implementations

AWS

Component Service
Incident Detection CloudWatch Alarms + EventBridge
ITSM Integration ServiceNow + AWS Systems Manager OpsCenter
Forensic Query OpenSearch (GOV007 index)
Notification Workflow Step Functions + SES

Azure

Component Service
Incident Detection Azure Monitor + Event Grid
ITSM Integration ServiceNow / Jira Service Management via Azure Logic Apps

On-Premises

Component Technology
ITSM ServiceNow with custom AI incident app
Playbooks Confluence + ServiceNow Playbooks
Forensics GOV007 Elasticsearch query interface

Pattern Relationship Dependency Direction
EAAPL-GOV006 Model Bias Detection Detection source — bias alerts trigger incidents GOV006 → GOV008
EAAPL-GOV007 AI Audit Trail Forensic evidence source GOV008 → GOV007
EAAPL-GOV004 AI Policy Enforcement Security incident evidence GOV004 → GOV008
EAAPL-CMP001 APRA CPS230 Satisfies — §43/§46 incident obligations GOV008 → CMP001
EAAPL-CMP002 APRA CPS234 Satisfies — §36/§37 security incident obligations GOV008 → CMP002

17. Maturity Assessment

Overall Maturity: Proven (Level 3)

Dimension Score (1–5) Evidence
Taxonomy completeness 5 Six AI-specific incident categories fully defined
Regulatory notification workflow 4 Three jurisdictions covered; gap is automated notification submission
Playbook quality 3 Six playbook types outlined; depth of individual playbooks varies
MTTD/MTTR targets 4 Targets defined per incident type; measurement infrastructure required
Post-incident learning loop 3 Pattern database defined; automated pattern detection not yet implemented

18. Revision History

Version Date Author Changes
1.0 2024-06-01 EAAPL Working Group Initial publication
1.1 2025-04-01 EAAPL Working Group EU AI Act Article 72 notification mapping; APRA CPS234 §36/§37 alignment; data poisoning incident category
← Back to LibraryMore AI Governance