Proven

13 signals↑

EAAPL-OBS003 · Hallucination Detection

Observability & MonitoringField-tested in AU

EAAPL-OBS003 · Hallucination Detection

Pattern ID: EAAPL-OBS003 Status: Emerging Complexity: High Tags: rag llm human-oversight model-risk high-complexity Version: 1.0.0 Last Reviewed: 2026-06-12

1. Executive Summary

Large language models fabricate plausible-sounding content with confidence. In enterprise settings, a hallucinated legal citation, an invented financial figure, or a fabricated clinical guideline can trigger regulatory liability, customer harm, and reputational damage. Despite widespread awareness of this risk, fewer than 20% of enterprise AI deployments have systematic runtime detection — most rely on post-hoc user feedback, which captures only a fraction of hallucinations and arrives too late to prevent harm.

This pattern defines a layered, runtime hallucination detection architecture for production AI systems, with emphasis on RAG (retrieval-augmented generation) deployments where grounding verification is tractable. It covers: RAG grounding verification using NLI (Natural Language Inference) models and LLM-as-judge; source citation validation; factual consistency checking against retrieved context; confidence calibration monitoring; automated hallucination rate estimation using a hybrid human-labeled/ML-classifier approach; human-in-the-loop escalation for high-probability hallucinations; and post-delivery feedback collection to close the calibration loop. The outcome is a defensible claim that the organisation has implemented reasonable technical controls to detect and respond to AI hallucinations — a requirement emerging in EU AI Act Article 9 and APRA's AI Risk Management guidance.

Target Audience: CIO, CTO, AI Engineering Lead, Risk Officer, Chief Compliance Officer Time to Implement: 10–16 weeks (high complexity; NLI model selection and calibration are the long poles)

2. Problem Statement

Business Problem

Organisations are deploying AI systems to answer customer questions, draft legal documents, generate financial summaries, and provide clinical guidance. When these systems hallucinate — and they do, at rates between 3% and 27% depending on the task and model — the business consequences range from customer complaints to regulatory sanctions to legal liability. The absence of runtime hallucination detection means the organisation cannot demonstrate due diligence when harm occurs.

Technical Problem

Hallucination is not a binary property and cannot be detected with a single mechanism. A response can be well-grounded in retrieved context yet still factually incorrect (the context itself was wrong). It can cite real sources that don't say what the response claims. It can include accurate general facts alongside fabricated specifics. Detecting hallucinations requires multiple complementary mechanisms operating at different levels of the output.

Symptoms

Customers or staff occasionally report AI outputs that contradict facts or cite non-existent sources
No metric exists for hallucination rate; the organisation cannot answer "what percentage of our AI outputs are hallucinated?"
After a hallucination incident, the system cannot determine when the hallucination began, how many users were affected, or whether similar outputs were delivered previously
AI system used for financial or clinical decisions with no confidence threshold or escalation trigger
User feedback (thumbs down) is the only hallucination signal, capturing < 5% of actual occurrences

Cost of Inaction

Undetected hallucinations in high-risk domains (legal, financial, clinical) create direct liability
Regulatory finding under EU AI Act Article 9 for high-risk AI systems lacking risk management controls
APRA model risk management finding for material AI models lacking output validation
Reputational damage when hallucinations become public — disproportionate media amplification
Customer trust erosion: users who encounter one hallucination permanently reduce AI usage

3. Context

When to Apply

RAG systems where retrieved context is available for grounding verification
AI systems making factual claims (not creative writing, brainstorming, or opinion generation)
High-risk AI applications (legal, financial, clinical, government)
Systems where hallucination rate > 1% has business or regulatory consequence
Prerequisite: EAAPL-OBS001 telemetry must provide the log stream for detection events

When NOT to Apply

Pure creative writing or brainstorming systems (no factual grounding to verify)
Systems generating code (different validation strategy: execution testing, not NLI)
Very low-stakes applications where hallucination consequence is negligible
Real-time conversational systems where < 100ms response budget makes synchronous NLI infeasible (use asynchronous streaming detection)

Prerequisites

Prerequisite	Required	Notes
RAG pipeline with retrievable source chunks	Strongly Recommended	Grounding verification requires access to the context the model used
NLI model or LLM-as-judge service	Required	Core detection mechanism
EAAPL-OBS001 AI Telemetry Infrastructure	Required	Log storage for detection events and calibration data
Human review workflow	Required	Escalation target; calibration labeling
Feedback collection mechanism	Required	Post-delivery signal for calibration

Industry Applicability

Industry	Applicability	Primary Driver
Financial Services	Critical	ASIC regulatory guidance, APRA model risk, financial advice liability
Healthcare	Critical	Clinical safety, TGA obligations for software as medical device
Legal Services	Critical	Professional liability, bar association obligations
Government	High	Public trust, mandatory accuracy obligations
Retail / E-Commerce	Medium	Product information accuracy, consumer law
Technology / SaaS	High	Contractual accuracy obligations, platform liability

4. Architecture Overview

The Hallucination Detection Architecture is a multi-layer system combining real-time output analysis, statistical quality monitoring, and human-in-the-loop review. The architecture acknowledges that no single detection mechanism achieves acceptable precision and recall across all hallucination types; a defence-in-depth approach is required.

Layer 1: RAG Grounding Verification

When a RAG pipeline retrieves source chunks and passes them to the model as context, the generated output can be evaluated for grounding in those chunks. The grounding verifier takes the generated response and the source context as inputs and asks: does the context support the claims made in the response? This is an NLI (Natural Language Inference) task with three labels: entailment (context supports claim), neutral (claim is not addressed by context), and contradiction (context contradicts claim).

Two implementation options are viable. NLI model: a fine-tuned cross-encoder model (e.g., DeBERTa-v3-large fine-tuned on NLI datasets like MultiNLI and ANLI) runs inference on response-chunk pairs. Sentence decomposition splits the response into individual factual claims before NLI evaluation. A response with any claim scoring contradiction with confidence > 0.8 is flagged. LLM-as-judge: a separate LLM call (using a different model from the generation model to avoid correlated errors) evaluates whether the response is grounded in the provided context, returning a structured verdict with confidence score and explanation. LLM-as-judge is more expensive but handles complex reasoning chains better.

Layer 2: Source Citation Validation

When the model's response includes explicit source citations (document titles, URLs, or retrieved chunk IDs), a citation validator checks: (a) does the cited source exist in the retrieval corpus? (b) does the cited source contain content that supports the claim attributed to it? This catches a specific and common hallucination pattern: the model fabricates plausible-sounding source titles that don't exist, or correctly cites a real source but misrepresents what it says.

Layer 3: Factual Consistency Checking

For responses containing numerical claims, dates, names, or specific facts, an extraction and verification step checks these specific claims against retrieved context. Named entity recognition identifies claims of type NUMBER, DATE, PERSON, ORG, and STATISTIC in the response. Each extracted claim is then checked against the retrieved context using exact or fuzzy matching. Mismatches (e.g., response says "$2.3 billion", context says "$2.3 million") are flagged as high-confidence hallucinations.

Layer 4: Confidence Calibration Monitoring

Model APIs often return logprob-based confidence scores or can be prompted for self-assessed confidence. The confidence calibration monitor tracks the relationship between these scores and actual accuracy (as measured by human labels and NLI grounding scores). A well-calibrated model has confidence 0.9 when it is correct 90% of the time. Calibration is measured using Expected Calibration Error (ECE) and visualised with calibration curves. Degrading calibration — the model becomes confidently wrong more often — is a signal of model drift (see EAAPL-OBS005).

Layer 5: Hallucination Rate Metric

A statistically valid hallucination rate requires human-labeled ground truth. A sampling pipeline draws 1% of all responses for human review. Reviewers label each response as: grounded, partially grounded, or hallucinated. This labeled sample trains and continuously retrains an ML classifier that can then estimate hallucination probability on the full unlabeled output stream. The resulting hallucination rate metric (estimated percentage of outputs with hallucination) is reported on dashboards and used for SLO tracking.

Human-in-the-Loop Escalation

When the grounding verifier or factual consistency checker assigns a hallucination probability > configurable threshold (default: 0.7), the response is flagged for human review before delivery. The escalation workflow routes the flagged response to a human reviewer queue. The reviewer can: approve delivery (with or without modification), reject delivery, or approve a modified version. For synchronous user-facing applications, this creates a user-perceptible delay; acceptable only when the risk of delivering an unverified response exceeds the cost of latency. For asynchronous applications (document generation, batch processing), escalation is always preferred over delivery of flagged responses.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Detection["Detection Layers"] A[RAG Response + Context] B[Grounding Verifier] C[Factual Consistency Checker] end subgraph Routing["Routing Decision"] D{Hallucination Probability} E[Human Review Queue] end subgraph Calibration["Feedback Loop"] F[1% Sampling Pipeline] G[Human Labeling] H[Classifier Retraining] end A --> B B --> C C --> D D -->|prob above 0.7| E D -->|prob below 0.3| I[Deliver to User] E --> I I --> F F --> G G --> H H --> D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#fee2e2,stroke:#ef4444 style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Grounding Verifier	ML Inference Service	NLI-based grounding check of response against retrieved chunks	DeBERTa-v3-large NLI fine-tuned; LLM-as-judge (GPT-4o, Claude 3.5); sentence-transformers	Critical
Citation Validator	Service	Verify existence and accuracy of cited sources	Custom service querying vector index + retrieval corpus	High
Factual Consistency Checker	Service	Extract and verify numerical, date, named entity claims	spaCy NER + fuzzy matching against retrieved context; custom rules	High
Hallucination Classifier	ML Model	Estimate hallucination probability on unlabeled outputs	Trained on human-labeled sample; DistilBERT or similar; updated weekly	High
Human Review Queue	Workflow	Route high-risk responses to human reviewers; track decisions	Jira, ServiceNow, custom review UI; priority-sorted queue	Critical
Confidence Calibration Monitor	Batch Job	Track calibration curves; compute ECE; alert on degradation	Python (sklearn calibration_curve); scheduled analysis	Medium
Sampling Pipeline	Stream Processor	Draw 1% sample for human labeling; stratify by model, template, confidence	Kafka consumer with sampling; or scheduled SQL query	High
Human Labeling Interface	UI	Present response + context to reviewers; capture grounded/partial/hallucinated label	Custom React app; Label Studio; Prodigy	High
Hallucination Rate Dashboard	UI	Real-time and trend hallucination rate by model, template, use case	Grafana; Datadog; custom dashboard	Medium
Alert Manager	Integration	Alert on hallucination rate SLO breach; escalation routing	Alertmanager, PagerDuty; severity-based routing	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	RAG Pipeline	Retrieves context chunks; generates response; passes both to detection layer	Response text + retrieved_chunks[] + requestId
2	Grounding Verifier	Decomposes response into claims; runs NLI on each claim against each chunk	Grounding score per claim; overall grounding confidence; contradiction flags
3	Citation Validator	Extracts any explicit citations from response; verifies existence and content alignment	Citation validity score; flagged invalid or misrepresented citations
4	Factual Consistency Checker	Extracts entities (numbers, dates, names); checks against chunks	Consistency score; flagged mismatched facts
5	Hallucination Classifier	Combines grounding, citation, consistency signals; outputs hallucination probability	Hallucination probability score (0.0–1.0)
6	Routing Decision	Threshold logic: > 0.7 → human queue; 0.3–0.7 → flag + deliver; < 0.3 → deliver	Routing decision + detection log record
7	Human Reviewer (if escalated)	Receives response + context + detection flags; reviews; approves/rejects/modifies	Reviewer decision + correction if modified
8	Telemetry Logger	Records full detection event with requestId, scores, routing decision, reviewer outcome	Immutable detection event record
9	Calibration Pipeline	Samples 1% of all decisions for human labeling; uses labels to retrain classifier	Updated classifier weights; calibration metrics

Error Flow

Error Scenario	Detection	Action	Recovery
NLI model service unavailable	Health check failure; grounding API timeout	Escalate all responses to human queue; alert P1	Restore NLI service; process queue backlog
Grounding verifier latency spike (> 500ms)	SLO breach alert	Increase timeout threshold; alert; switch to LLM-as-judge fallback	Investigate; scale NLI service
Human review queue overwhelmed (> 100 pending)	Queue depth metric alert	Alert to AI engineering; consider temporary threshold relaxation with risk acceptance	Increase reviewer capacity; temporary threshold adjustment requires risk sign-off
Hallucination classifier stale (last retrain > 30 days)	Retraining job failure; staleness alert	Alert to ML platform; run emergency retraining	Investigate labeling pipeline; manual retraining trigger
Retrieved chunks not available in detection layer	Missing field in request to detection service	Skip grounding verification; apply higher-risk flag; escalate to human review	Fix RAG pipeline to pass chunks to detection layer

8. Security Considerations

Authentication: Detection services authenticate to the main AI pipeline via service-to-service tokens. The human review queue authenticates reviewers via SSO with MFA. Access to hallucination detection logs requires separate authorisation from general telemetry logs.

Authorisation: Hallucination detection results are sensitive — they reveal AI system quality levels that may be commercially sensitive or relevant to litigation. Access restricted to AI engineering, risk management, and legal. Dashboard access for quality monitoring is broader.

Secrets Management: LLM-as-judge service API keys stored in secrets manager with separate rotation from primary generation keys. NLI model served internally requires no external API key.

Data Classification: Response text stored in detection logs is classified at least as Confidential. Detection events for high-risk domains (clinical, legal, financial) classified as Restricted. Reviewer comments on rejected responses classified as Privileged if written by legal reviewers.

Encryption: All detection data encrypted in transit (TLS 1.3) and at rest (AES-256). Human review queue messages encrypted end-to-end if containing sensitive response content.

Auditability: Every hallucination detection event is immutable. Reviewer decisions are immutable. The chain from requestId to detection event to reviewer decision is traceable for legal discovery.

OWASP LLM Top 10 Coverage

OWASP LLM Risk	Hallucination Detection Control	Implementation
LLM01 Prompt Injection	Injected instructions may produce hallucinations; detection surfaces them	Hallucinations triggered by injection flagged for security review
LLM02 Insecure Output Handling	Hallucinated outputs may contain injections or unsafe content	Output validation in detection layer; unsafe content flagged before delivery
LLM03 Training Data Poisoning	Poisoned training may systematically shift hallucination patterns	Calibration monitoring detects systematic accuracy degradation
LLM04 Model Denial of Service	Detection layer adds compute; may be targeted for overload	Detection service has independent rate limiting and circuit breaker
LLM05 Supply Chain Vulnerabilities	NLI model or LLM-as-judge may be compromised	Model integrity verification on deployment; vendor security review
LLM06 Sensitive Information Disclosure	Hallucinated content may fabricate sensitive details	Factual consistency check may detect fabricated sensitive info patterns
LLM07 Insecure Plugin Design	Tools may return data that models hallucinate over	Tool output included in retrieved context for grounding verification
LLM08 Excessive Agency	Hallucinated instructions in agentic chains can cause cascading errors	Agent output grounding check before passing to next agent step
LLM09 Overreliance	Hallucination detection directly addresses the overreliance risk	Confidence calibration monitoring; low-confidence responses flagged to users
LLM10 Model Theft	Out of scope for this pattern	Covered by EAAPL-OBS002 (Prompt Monitoring)

9. Governance Considerations

Responsible AI: Hallucination detection is a primary control in the AI harm prevention framework. The existence of systematic detection, human escalation, and correction workflows constitutes a defensible reasonable-care standard for AI-influenced decisions.

Model Risk Management: The hallucination rate metric is a Key Risk Indicator (KRI) for any material AI model. Persistent hallucination rate > 5% should trigger model risk review and may require model replacement or use-case restriction.

Human Approval: All responses with hallucination probability > 0.7 require human review before delivery. This threshold is reviewed quarterly. Domain-specific thresholds (e.g., clinical AI: > 0.3) may be stricter. Any threshold change requires risk officer sign-off.

Policy: Hallucination detection results must be retained as model performance records. Reviewer corrections and rejections are permanent model performance records. Hallucination rate KRI breaches must be reported to the AI risk committee within 2 business days.

Traceability: Every AI-influenced decision can be linked to its hallucination detection outcome, enabling post-hoc review of whether a specific harmful output was reviewed before delivery, and what controls were active.

Governance Artefacts

Artefact	Owner	Frequency	Format
Hallucination Rate KRI Report	AI Risk Officer	Weekly	Dashboard export + executive summary
Human Review Queue Statistics	AI Engineering	Daily	Automated report: volume, resolution time, rejection rate
Calibration Curve Report	ML Platform	Monthly	Calibration plot + ECE metric trend
Hallucination Classifier Audit	AI Governance	Quarterly	Classifier performance on holdout set + bias analysis
Escalation Threshold Review	AI Risk + Legal	Quarterly	Signed review document
High-Risk Hallucination Incident Log	Risk Management	Per incident	Incident record with timeline and remediation

10. Operational Considerations

Monitoring: The NLI model inference service is a critical dependency. Its latency (p99 must be < 200ms for synchronous path), availability (> 99.5%), and model version are all monitored. Hallucination rate trend alerts fire when the 24-hour rolling rate increases > 2 percentage points above the 7-day baseline.

Logging: Detection events are stored in a dedicated detection log store, separate from general AI telemetry, with stricter access controls. Reviewer decisions are stored in an immutable audit log.

Incident Response: Hallucination rate spike (> 10% in 1 hour) is a P1 incident triggering immediate review of recent model or prompt changes. All in-flight responses are escalated to human queue until rate returns to baseline.

Disaster Recovery: Human review queue must survive infrastructure failures. Queue is backed by durable message store (Kafka, SQS). NLI service failure activates "all-escalate" policy — all responses routed to human queue. This is sustainable for up to 4 hours before queue overwhelm.

Capacity Planning: Human reviewer capacity is the binding constraint. At 1% sampling rate and 5% hallucination rate, 100K daily requests generates 50 items for human labeling per day. At the 0.7 escalation threshold, approximately 1–3% of requests require reviewer intervention. Reviewer capacity must be planned accordingly.

SLO Table

SLO	Target	Measurement	Alert Threshold
Hallucination rate (estimated)	< 5% for general use cases; < 1% for high-risk	Classifier estimate on full output stream	Sustained breach for 30 minutes
Grounding verifier latency	< 200ms p99	NLI inference latency histogram	> 500ms for 5 minutes
Human review resolution time	< 4 hours p90 for P1 escalations	Queue item creation to resolution timestamp	> 8 hours for P1
Calibration ECE	< 0.10 (10% calibration error)	Monthly calibration report	ECE > 0.15 triggers model review

Disaster Recovery Table

Component	RTO	RPO	Recovery Approach
NLI Inference Service	10 minutes	N/A (stateless)	Auto-scale; all-escalate policy during outage
Human Review Queue	5 minutes	Near-zero	Durable message store (Kafka / SQS) with replication
Detection Log Store	30 minutes	1 hour	Replicated storage; write-ahead log
Hallucination Classifier	60 minutes	Last checkpoint	Load previous checkpoint; retrain when pipeline recovers

11. Cost Considerations

Cost Drivers

Driver	Description	Relative Cost
NLI model inference (synchronous)	Compute-intensive; scales with response length and chunk count	High
LLM-as-judge (if used)	Additional LLM call per response evaluation; can double inference cost	Very High
Human reviewer time	1% sample at 5 minutes/review = 50 reviews/day per 100K requests	High (human labour)
Detection log storage	Full response text + context + detection signals; large records	Medium
Classifier retraining compute	Weekly retraining job; scales with labeled data volume	Low

Scaling Risks: At very high volumes (> 1M requests/day), synchronous NLI inference may become cost-prohibitive. At scale, asynchronous detection with post-delivery flagging and correction workflow is the cost-sustainable approach. The trade-off is accepting some harmful delivery before human review catches it — only acceptable with a low-risk AI use case.

Optimisations:

Self-hosted NLI model (DeBERTa) vs. LLM-as-judge reduces cost by 50–90x
Selective activation: apply full grounding verification only to responses with confidence < 0.9 or containing factual claim signals (numbers, dates, named entities)
Cache NLI results for identical context-response pairs (rare but possible in templated outputs)

Indicative Cost Range

Scale	AI Requests/Day	Estimated Hallucination Detection Cost/Month
Small	10,000	$500–$1,500 (human review dominates)
Medium	500,000	$3,000–$8,000
Large	5,000,000	$15,000–$40,000 (NLI compute dominates)
Enterprise	50,000,000+	$50,000–$200,000 (requires asynchronous architecture)

12. Trade-Off Analysis

Approach Comparison

Approach	Pros	Cons	Best For
Synchronous NLI grounding verification	Blocks hallucinated responses before delivery; defensible due diligence	Adds 50–200ms latency; NLI service becomes critical dependency	High-risk AI (clinical, legal, financial)
Asynchronous post-delivery detection + correction	No latency impact; scalable at high volume	Hallucinations delivered before detected; correction requires recall/notification workflow	Lower-risk use cases; very high volume systems
LLM-as-judge	High accuracy on complex reasoning; handles nuanced grounding; configurable criteria	10–100x cost of NLI model; correlated errors with generation model possible	Premium use cases with high accuracy requirement and budget
Human review only (no automated detection)	Highest accuracy; no false positives	Scales poorly; expensive; slow; samples < 1% of actual volume	Ultra-high-risk decisions where automation is not trusted

Architectural Tensions

Tension	Description	Resolution
Latency vs. Safety	Synchronous detection adds p99 latency to every response	Use async detection for lower-risk paths; sync for high-risk only; optimise NLI serving
Precision vs. Recall	High precision (few false positives) means some hallucinations pass; high recall floods human queue	Tune threshold by domain: 0.3 for clinical, 0.7 for general; monitor precision-recall trade-off quarterly
Cost vs. Coverage	Full NLI check on every response is expensive	Apply risk-tiered coverage: full check for high-risk domains, classifier-only for low-risk
Automation vs. Human Trust	Organisations may distrust automated detection; always want human review	Establish calibration data showing automated detection accuracy; earn trust with evidence

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
NLI model mis-scores complex multi-hop reasoning	Medium	High (hallucination passes)	Calibration monitoring; human review audit	Fine-tune NLI model on domain-specific examples
Human review queue overwhelm during incident	Medium	High (hallucinations delivered)	Queue depth metric; review time SLO breach	Emergency reviewer surge; temporary threshold raise with risk acceptance
Classifier concept drift (stale model)	Medium	Medium (inaccurate rate estimates)	Calibration ECE alert; holdout accuracy monitoring	Trigger emergency retraining
Retrieved chunks not passed to detection layer	Low	Critical (grounding verification blind)	Missing field alert in detection service	Fix data contract; audit all RAG pipeline integrations
LLM-as-judge inherits generation model errors	Medium	Medium (correlated failure)	Use different model family as judge	Multi-model ensemble for high-stakes decisions

Cascading Scenarios

Scenario 1: NLI service latency spike → timeout → all-escalate policy activates → human queue overwhelmed → reviewers approve without reading → hallucinations delivered at scale. Mitigation: reviewer capacity must be sized for 100% escalation for 1 hour; NLI SLO enforced.
Scenario 2: Classifier becomes stale → hallucination rate underestimated → SLO shows green → management increases AI deployment → actual hallucination rate high but undetected. Mitigation: classifier staleness alert; mandatory retraining schedule.

14. Regulatory Considerations

Regulation	Clause	Requirement	Hallucination Detection Implementation
EU AI Act	Article 9.2 (Risk Management)	High-risk AI must implement risk management measures including identification and analysis of known risks	Hallucination is a known risk; detection system implements technical control
EU AI Act	Article 9.5 (Testing)	High-risk AI systems must be tested to identify appropriate risk management measures	Continuous monitoring and human labeling = ongoing testing
EU AI Act	Article 14 (Human Oversight)	High-risk AI systems must allow human oversight; humans must be able to intervene	Human review queue directly implements Article 14 override capability
EU AI Act	Article 13 (Transparency)	High-risk AI users must be informed of capabilities and limitations	Hallucination detection enables informed disclosure of accuracy rates
APRA CPG 234	Paragraph 43 (Model Risk)	Material models require validation including ongoing performance monitoring	Hallucination rate KRI is a core model performance metric
Privacy Act 1988 (AU)	APP 3 (Collection)	Only collect information reasonably necessary	Human review records of responses must meet necessity test
ISO/IEC 42001	Clause 8.4 (AI System Operation)	Operational procedures for AI systems must include monitoring and intervention	Human review workflow documents operational intervention procedure
NIST AI RMF	MANAGE 2.4	Residual risks of AI systems monitored and managed	Hallucination rate KRI feeds residual risk tracking

15. Reference Implementations

AWS

NLI Inference: SageMaker endpoint hosting DeBERTa-v3-large NLI model; auto-scaling
LLM-as-Judge: Amazon Bedrock (Claude 3.5 Haiku for cost efficiency at scale)
Human Review Queue: Amazon SQS + custom review UI on React/Next.js; reviewer auth via Cognito
Detection Logs: Amazon DynamoDB (per-request detection events); Amazon S3 for archive
Classifier Retraining: SageMaker Training Job triggered by EventBridge weekly schedule
Dashboards: Amazon QuickSight; CloudWatch custom metrics
Alerts: CloudWatch Alarms → SNS → PagerDuty

Azure

NLI Inference: Azure Machine Learning Managed Endpoint; Azure Container Instances for NLI model
LLM-as-Judge: Azure OpenAI Service (GPT-4o)
Human Review Queue: Azure Service Bus + Azure Static Web App review UI
Detection Logs: Azure Cosmos DB; Azure Blob Storage archive
Classifier Retraining: Azure ML Pipelines scheduled job
Dashboards: Azure Monitor Workbooks; Power BI
Alerts: Azure Monitor Alerts → Logic Apps → Teams / PagerDuty

GCP

NLI Inference: Vertex AI Prediction endpoint with DeBERTa-v3 model
LLM-as-Judge: Vertex AI Gemini 1.5 Flash (cost-optimised judge)
Human Review Queue: Cloud Tasks + Cloud Run review UI
Detection Logs: Firestore (real-time); BigQuery (analytics)
Classifier Retraining: Vertex AI Pipelines
Dashboards: Looker; Cloud Monitoring
Alerts: Cloud Monitoring Alerting → PagerDuty

On-Premises

NLI Inference: Self-hosted DeBERTa-v3-large on GPU cluster (Triton Inference Server)
LLM-as-Judge: Self-hosted Llama 3.1 70B with structured output prompting
Human Review Queue: Apache Kafka + custom React review application
Detection Logs: PostgreSQL (events); ClickHouse (analytics); MinIO (archive)
Classifier Retraining: MLflow + custom training script on GPU cluster
Dashboards: Grafana
Alerts: Alertmanager → OpsGenie / PagerDuty

Pattern ID	Pattern Name	Relationship	Notes
EAAPL-OBS001	AI Telemetry Architecture	Foundation	Log and trace infrastructure required; requestId linkage
EAAPL-OBS002	Prompt Monitoring	Sibling	Prompt-side controls; this pattern covers output-side controls
EAAPL-OBS004	AI Incident Management	Depends On	Hallucination rate spike is a defined incident type in OBS004
EAAPL-OBS005	Model Drift Detection	Sibling	Confidence calibration degradation detected here feeds drift monitoring
EAAPL-OBS008	AI Performance Benchmarking	Sibling	Offline benchmark hallucination rate complemented by online detection rate

17. Maturity Assessment

Overall Maturity: Emerging

Dimension	Score (1–5)	Rationale
Adoption Breadth	2	Fewer than 20% of enterprise AI deployments have systematic runtime detection
Tooling Ecosystem	3	NLI models mature; hallucination-specific evaluation frameworks (RAGAS, TruLens) maturing rapidly
Operational Runbook Coverage	2	Runbooks are organisation-specific; no widely adopted standard
Regulatory Evidence	3	EU AI Act Article 14 creates explicit demand; APRA guidance emerging
Cost Predictability	2	NLI inference cost at scale still being benchmarked; LLM-as-judge cost is high and variable
Team Skill Availability	2	NLI fine-tuning and calibration skills are specialised; limited talent pool

18. Revision History

Version	Date	Author	Changes
1.0.0	2026-06-12	EAAPL Working Group	Initial publication

Track this pattern for APRA/ASIC review

← Back to Library More Observability & Monitoring →

EAAPL-OBS003 · Hallucination Detection

EAAPL-OBS003 · Hallucination Detection

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow

Error Flow

8. Security Considerations

OWASP LLM Top 10 Coverage

9. Governance Considerations

Governance Artefacts

10. Operational Considerations

SLO Table

Disaster Recovery Table

11. Cost Considerations

Indicative Cost Range

12. Trade-Off Analysis

Approach Comparison

Architectural Tensions

13. Failure Modes

Cascading Scenarios

14. Regulatory Considerations

15. Reference Implementations

AWS

Azure

GCP

On-Premises

16. Related Patterns

17. Maturity Assessment

18. Revision History