EAAPL-GOV008Proven

AI Incident Management

AI GovernanceAPRA CPS230APRA CPS234Field-tested in AU

[EAAPL-GOV008] AI Incident Management

Category: Governance / Operational Resilience Sub-category: AI-Specific Incident Response Version: 1.1 Maturity: Proven Tags: incident-management, AI-incidents, hallucination, bias-incident, security-incident, MTTD, MTTR, regulatory-notification Regulatory Relevance: APRA CPS230 §43–§46, APRA CPS234 §36–§37, EU AI Act Article 72, Privacy Act APP 11, NIST AI RMF MANAGE 3.1

1. Executive Summary

The AI Incident Management pattern implements a specialised incident management lifecycle for AI system failures. It recognises that AI incidents have fundamentally different characteristics from conventional software incidents: failure modes include hallucination, statistical bias drift, adversarial attack, data poisoning, and emergent behaviour—none of which appear in standard ITIL incident taxonomies and none of which standard ITSM runbooks address.

The pattern extends enterprise incident management with an AI-specific incident taxonomy, AI-specific MTTD/MTTR targets, AI-specialist escalation paths, and regulatory notification workflows triggered by specific incident categories. It integrates with the AI Audit Trail (GOV007) for forensic evidence, the Model Bias Detection pipeline (GOV006) for bias incident detection, and the AI Policy Enforcement layer (GOV004) for security incident evidence.

For CIOs and CTOs, the critical strategic outcome is regulatory compliance with notification obligations. APRA CPS230 §43 requires notification of material incidents within 72 hours. The EU AI Act Article 72 requires notification of serious incidents to market surveillance authorities. Without a structured AI incident management process, organisations will miss notification windows—a governance failure that compounds the original incident.

2. Problem Statement

Business Problem

AI-specific failures are not recognised, categorised, or escalated appropriately through standard incident management processes. Hallucination incidents are logged as "system behaved unexpectedly" with no understanding of systemic risk. Bias incidents are not recognised as incidents at all. Regulatory notification obligations are missed because the incident's regulatory significance is not assessed.

Technical Problem

Standard ITSM tools (ServiceNow, Jira Service Management) do not have AI-specific incident categories, AI-specific root cause codes, or AI-specific resolution actions. AI incidents require forensic capability (audit trail queries), model behaviour analysis (fairness metrics, output sampling), and often model rollback or scope restriction — actions not in standard incident runbooks.

Symptoms

AI failures treated as "application bugs" with standard P3 priority regardless of customer impact
No post-incident review for AI-specific contributing factors
Regulatory notification obligations missed for privacy breaches via AI disclosure
Bias incidents detected by media or regulators, not internal monitoring
Model rollback performed without documented incident process
No AI incident trend analysis informing model risk management

Cost of Inaction

Regulatory: APRA CPS234 §36: failure to notify APRA of material security incident within 72 hours. EU AI Act Article 72: failure to notify of serious AI incidents
Legal: Undocumented AI incidents creating liability for unaddressed harms
Reputational: Reactive, unstructured response to public AI failures amplifies reputational damage
Operational: No learning from AI incidents; same failure modes recur

3. Context

When to Apply

Any enterprise operating AI systems in production with customer, financial, or regulatory exposure
APRA-regulated entities (mandatory for material operational incidents per CPS230)
EU market participants operating high-risk AI (mandatory notification per EU AI Act Article 72)
Any organisation with stated responsible AI commitments (accountability requires incident response)

When NOT to Apply

Internal AI tools with no customer impact and no regulatory exposure (standard ITSM applies)
Development and testing environments (internal development incidents use standard dev processes)

Prerequisites

AI Audit Trail (GOV007) operational — forensic evidence source
AI Model Register (GOV001) — MRID required for impact scoping
AI Policy Enforcement (GOV004) — security incident evidence
Enterprise ITSM platform — AI-specific incident taxonomy integrated into existing system

Industry Applicability

Industry	Key AI Incident Types	Notification Obligation	SLA
Banking (AU)	Hallucination (advice), Bias (credit), Security (model theft)	APRA CPS230 §43: 72h	MTTD <1h, MTTR <24h for P1
Insurance (AU)	Bias (pricing), Privacy (output disclosure), Availability	APRA CPS230 §43: 72h	MTTD <2h, MTTR <48h for P1
Healthcare	Hallucination (clinical), Safety (harmful recommendation)	TGA reportable event	MTTD <30min, MTTR <4h
Financial Services (EU)	All types	EU AI Act Article 72: immediate (serious)	MTTD <1h
Government	Bias (services), Privacy, Availability	OAIC (privacy breach)	MTTD <4h

4. Architecture Overview

The AI Incident Management pattern extends enterprise ITSM with three AI-specific architectural additions: an AI incident taxonomy that integrates with detection sources, an AI incident playbook library that guides responders through AI-specific resolution actions, and an automated regulatory notification workflow triggered by incident classification.

AI Incident Taxonomy. The pattern defines six primary AI incident categories that do not exist in standard ITSM taxonomies:

Hallucination Incident: An AI system generates factually incorrect, fabricated, or internally inconsistent output. Severity is determined by whether the output was customer-facing, whether it could cause direct harm (medical advice, legal advice, financial advice), and whether it was acted upon before detection. Root cause investigation must determine: was this a single anomalous output or a systematic pattern? Is the hallucination triggerable by specific input patterns (vulnerability)?

Bias Incident: Statistical evidence of discriminatory model behaviour. Detected via GOV006 bias detection alerts or by external complaint/report. Severity based on: protected attribute affected, decision type (consequential vs non-consequential), affected population size, duration of bias operation. Root cause categories: training data skew, feedback loop formation, distribution shift, feature correlation with protected attribute.

Security Incident (AI-specific): Adversarial attack on AI system (prompt injection, model inversion, membership inference, model extraction). Differentiated from conventional security incidents by the AI-specific attack vector and the potential for model-layer compromise that application-layer security controls do not detect.

Availability Incident: AI system unable to provide service within SLO. While similar to conventional availability incidents, AI availability failures have distinctive characteristics: model serving infrastructure failure vs. model quality degradation (model appears available but is producing degraded outputs). The latter is an AI-specific availability failure that standard health checks miss.

Data Poisoning Incident: Evidence or strong suspicion that training data was tampered with to produce targeted model behaviour. Requires forensic investigation of training data provenance. Rarest incident type but most severe consequence.

Privacy Disclosure Incident: AI system discloses personal information that should not be accessible, via membership inference (confirming someone's data was in the training set), training data memorisation (reproducing personal information from training data), or policy enforcement failure. Triggers Privacy Act notifiable data breach assessment.

Graduated Severity Model. AI incidents are assigned four severity levels that determine response times and notification obligations:

P1 (Critical): Imminent or actual harm to individuals; regulatory notification likely required; immediate containment required
P2 (High): Significant risk of harm; potential regulatory notification; containment within 4 hours
P3 (Medium): Limited harm; no immediate regulatory obligation; investigation within 24 hours
P4 (Low): Minor anomaly; no harm; investigation within 72 hours

AI-Specific Resolution Actions. The playbook library defines resolution actions specific to AI incidents:

Model Rollback: Revert to prior approved model version (requires deployment token from GOV003)
Scope Restriction: Limit model to lower-risk use cases while investigation proceeds
Monitoring Enhancement: Increase monitoring frequency, log sampling, and alert sensitivity
Emergency Human Override: Route all AI decisions to human review while model is investigated
Input Filter: Deploy emergency input filter via GOV004 policy bundle update
Third-Party Notification: Notify upstream model provider (for third-party model incidents)

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Detection["Detection Sources"] A[Monitoring + Bias Alerts] B[Customer or Regulator Reports] end subgraph Triage["Triage and Classification"] C[AI Incident Classifier] D{Severity P1-P4} end subgraph Response["Response and Learning"] E[Playbook Execution] F[Forensic Investigation] G[Post-Incident Review] end H[Regulatory Notification] I[GOV001 Incident Record] A --> C B --> C C --> D D -->|P1-P2 containment| E D -->|P1 regulatory trigger| H E --> F F --> G G --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#d1fae5,stroke:#10b981 style H fill:#fee2e2,stroke:#ef4444 style I fill:#fef9c3,stroke:#eab308

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Detection Aggregator	Integration	Consolidates AI incident signals from all detection sources	Event bus (Kafka, SNS); webhook receiver	Critical
AI Incident Triage Bot	Automation	Auto-classifies incoming signals against AI incident taxonomy; suggests severity	LLM-assisted classification (irony: AI to manage AI incidents) + rules engine	High
AI Incident Taxonomy Classifier	Business Logic	Enforces six-category taxonomy; maps to severity model	Configurable rules engine; ITSM integration	Critical
AI Incident Playbook Engine	Workflow	Delivers appropriate playbook to responder based on incident type + severity	ServiceNow Playbooks, PagerDuty Runbooks, Confluence-integrated	High
Response Action Orchestrator	Automation	Executes automated response actions (model restriction, monitoring enhancement)	Custom API orchestrator + GOV001/GOV004 API integration	High
Forensic Query Interface	Integration	Provides structured queries against GOV007 audit trail for incident investigation	Search interface to GOV007 Search Index	Critical
Regulatory Notification Engine	Workflow	Assesses notification obligations; drafts notifications; manages timelines	Custom workflow + ITSM + document management	Critical
Post-Incident Review Template	Process	Structured PIR template for AI incidents	ITSM template; Confluence page template	High
AI Incident Pattern Database	Knowledge Base	Accumulates incident patterns to improve detection and playbooks	Knowledge base in ITSM; custom pattern store	Medium

7. Data Flow

P1 AI Incident Response Flow

Step	Actor	Action	Output
1	Detection Source (e.g., GOV006 bias alert)	Emits incident signal	Signal in detection aggregator
2	Triage Bot	Auto-classifies as P1 Bias Incident; recommends severity	Draft incident record with classification
3	On-Call AI Incident Responder	Confirms classification; accepts incident ownership	Incident created in ITSM with P1 priority
4	Playbook Engine	Serves Bias Incident P1 playbook	Step-by-step response guidance
5	Responder	Executes immediate containment (scope restriction or human override via response orchestrator)	Model restriction applied; customer exposure limited
6	Regulatory Assessment	Automated assessment of notification obligation triggered	Notification required: APRA 72h window started
7	Forensic Investigation	Queries GOV007 audit trail for affected decision population	Affected decision count, date range, customer population
8	Notification Draft	Regulatory Notification Engine drafts APRA notification	Draft notification for Compliance Director review
9	Root Cause Analysis	AI-assisted RCA using audit trail data + model metadata	Root cause identified; control gap documented
10	Post-Incident Review	PIR conducted within 5 business days	PIR report; control improvement actions; GOV001 updated

8. Security Considerations

Incident Responder Access

P1/P2 incidents require On-Call AI Incident Responder (24/7 coverage) with elevated read access to audit trail
Elevated access is time-limited (incident duration + 48 hours); automatically revoked
All actions during incident response logged with incident reference

OWASP LLM Top 10 Mapping

OWASP LLM Risk	Incident Category	Response Action
LLM01 Prompt Injection	Security Incident	Input filter emergency deployment
LLM03 Training Data Poisoning	Data Poisoning Incident	Model rollback; retraining audit
LLM06 Sensitive Information Disclosure	Privacy Disclosure Incident	Scope restriction; OAIC notification assessment
LLM08 Excessive Agency	Security / Hallucination Incident	Human override activation
LLM09 Overreliance	Hallucination / Availability Incident	User notification; human review requirement

9. Governance Considerations

MTTD/MTTR Targets

Incident Type	Severity	MTTD Target	MTTR Target	Regulatory Notification Window
Hallucination (customer-facing)	P1	<30 minutes	<4 hours	APRA: 72h if material
Bias Incident (consequential)	P1	<1 hour	<24 hours	APRA: 72h; ASIC: assess
Security (adversarial attack)	P1	<15 minutes	<4 hours	APRA CPS234: 72h
Privacy Disclosure	P1	<30 minutes	<4 hours	OAIC: 30 days (NDB)
Availability (quality degradation)	P2	<1 hour	<8 hours	APRA: if critical operation
Data Poisoning	P1	<2 hours	<48 hours	APRA: 72h

Governance Artefacts

Artefact	Owner	Frequency	Regulatory Linkage
AI Incident Register	CISO	Continuous	APRA CPS230 §43
Post-Incident Review Reports	AI Incident Lead	Per incident	APRA CPS230 §46
Regulatory Notification Log	Compliance	Per notification	APRA CPS234 §37
AI Incident Trend Analysis	AI Governance	Quarterly	ISO 42001 §10.2

10. Operational Considerations

SLOs

SLO	Target	Measurement
P1 detection-to-responder notification	<5 minutes	Per P1 incident
P1 containment action completion	<1 hour	Per P1 incident
Regulatory notification draft ready	<4 hours from P1 classification	Per P1 incident
PIR completion	<5 business days	Per incident
24/7 AI Incident Responder coverage	100%	Monthly

11. Cost Considerations

Indicative Cost Range

Component	Annual Cost
AI Incident Responder on-call coverage (0.5 FTE burdened)	AUD $80,000
ITSM AI incident taxonomy customisation	AUD $15,000 one-time
Regulatory notification tooling	AUD $10,000/yr
External incident response retainer (for critical incidents)	AUD $30,000–$80,000/yr
Total annual	~AUD $135,000–$185,000

12. Trade-Off Analysis

Option Comparison

Option	Description	Pros	Cons	Recommended For
A: AI-specialised ITSM extension (this pattern)	Existing ITSM + AI taxonomy + AI playbooks	Leverages existing ITSM investment; familiar to IT teams	Customisation effort; AI-specific training required	All regulated enterprises
B: Separate AI incident management tool	Dedicated AI ops platform (Arthur AI, Fiddler)	Purpose-built for AI; rich forensics	Siloed from enterprise ITSM; dual management overhead	Organisations without mature ITSM
C: Standard ITSM only	Use existing incident process unchanged	Zero additional cost; no change management	AI incidents mis-categorised; regulatory obligations missed	Never for regulated entities with customer-facing AI

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
AI incident not recognised as such (logged as app bug)	High	High — AI-specific response not executed; notification missed	Review of all customer-facing system incidents for AI component	AI detection training for all incident responders; auto-flagging from GOV004/GOV006
Regulatory notification window missed	Low	Critical — APRA enforcement	Notification workflow SLA monitor	Voluntary disclosure with explanation; legal counsel engagement
Forensic evidence unavailable (GOV007 unavailable during incident)	Low	High — investigation impeded	GOV007 availability monitoring	Offline investigation using secondary evidence; GOV007 recovery prioritised
Post-incident review not completed	Medium	Medium — learning not captured	PIR completion tracking	Mandatory PIR gate before incident closure

14. Regulatory Considerations

APRA CPS230

§43: Operational incidents with material impact must be notified to APRA within 72 hours. AI incidents causing disruption to critical operations are material incidents.
§46: Post-incident reviews required for material incidents. PIR template implements this requirement.

APRA CPS234

§36: Cybersecurity incidents must be notified to APRA within 72 hours. AI security incidents (prompt injection attacks, model theft) are CPS234 notifiable.
§37: Summary of cybersecurity incidents must be submitted to APRA annually.

EU AI Act

Article 72: Providers of high-risk AI systems must report serious incidents (death/serious harm attributable to AI, fundamental rights infringements, significant property damage) to market surveillance authorities without undue delay and no later than 15 days after becoming aware.

Privacy Act 1988 / Notifiable Data Breaches (NDB)

§26WK: Privacy disclosure via AI (memorisation, membership inference) triggers NDB assessment. 30-day notification obligation if likely to cause serious harm. Incident management workflow includes NDB assessment as mandatory step for privacy disclosure incidents.

15. Reference Implementations

AWS

Component	Service
Incident Detection	CloudWatch Alarms + EventBridge
ITSM Integration	ServiceNow + AWS Systems Manager OpsCenter
Forensic Query	OpenSearch (GOV007 index)
Notification Workflow	Step Functions + SES

Azure

Component	Service
Incident Detection	Azure Monitor + Event Grid
ITSM Integration	ServiceNow / Jira Service Management via Azure Logic Apps

On-Premises

Component	Technology
ITSM	ServiceNow with custom AI incident app
Playbooks	Confluence + ServiceNow Playbooks
Forensics	GOV007 Elasticsearch query interface

Pattern	Relationship	Dependency Direction
EAAPL-GOV006 Model Bias Detection	Detection source — bias alerts trigger incidents	GOV006 → GOV008
EAAPL-GOV007 AI Audit Trail	Forensic evidence source	GOV008 → GOV007
EAAPL-GOV004 AI Policy Enforcement	Security incident evidence	GOV004 → GOV008
EAAPL-CMP001 APRA CPS230	Satisfies — §43/§46 incident obligations	GOV008 → CMP001
EAAPL-CMP002 APRA CPS234	Satisfies — §36/§37 security incident obligations	GOV008 → CMP002

17. Maturity Assessment

Overall Maturity: Proven (Level 3)

Dimension	Score (1–5)	Evidence
Taxonomy completeness	5	Six AI-specific incident categories fully defined
Regulatory notification workflow	4	Three jurisdictions covered; gap is automated notification submission
Playbook quality	3	Six playbook types outlined; depth of individual playbooks varies
MTTD/MTTR targets	4	Targets defined per incident type; measurement infrastructure required
Post-incident learning loop	3	Pattern database defined; automated pattern detection not yet implemented

18. Revision History

Version	Date	Author	Changes
1.0	2024-06-01	EAAPL Working Group	Initial publication
1.1	2025-04-01	EAAPL Working Group	EU AI Act Article 72 notification mapping; APRA CPS234 §36/§37 alignment; data poisoning incident category

Track this pattern for APRA/ASIC review

← Back to Library More AI Governance →