EAAPL-OBS003 · Hallucination Detection
Pattern ID: EAAPL-OBS003
Status: Emerging
Complexity: High
Tags: rag llm human-oversight model-risk high-complexity
Version: 1.0.0
Last Reviewed: 2026-06-12
1. Executive Summary
Large language models fabricate plausible-sounding content with confidence. In enterprise settings, a hallucinated legal citation, an invented financial figure, or a fabricated clinical guideline can trigger regulatory liability, customer harm, and reputational damage. Despite widespread awareness of this risk, fewer than 20% of enterprise AI deployments have systematic runtime detection — most rely on post-hoc user feedback, which captures only a fraction of hallucinations and arrives too late to prevent harm.
This pattern defines a layered, runtime hallucination detection architecture for production AI systems, with emphasis on RAG (retrieval-augmented generation) deployments where grounding verification is tractable. It covers: RAG grounding verification using NLI (Natural Language Inference) models and LLM-as-judge; source citation validation; factual consistency checking against retrieved context; confidence calibration monitoring; automated hallucination rate estimation using a hybrid human-labeled/ML-classifier approach; human-in-the-loop escalation for high-probability hallucinations; and post-delivery feedback collection to close the calibration loop. The outcome is a defensible claim that the organisation has implemented reasonable technical controls to detect and respond to AI hallucinations — a requirement emerging in EU AI Act Article 9 and APRA's AI Risk Management guidance.
Target Audience: CIO, CTO, AI Engineering Lead, Risk Officer, Chief Compliance Officer Time to Implement: 10–16 weeks (high complexity; NLI model selection and calibration are the long poles)
2. Problem Statement
Business Problem
Organisations are deploying AI systems to answer customer questions, draft legal documents, generate financial summaries, and provide clinical guidance. When these systems hallucinate — and they do, at rates between 3% and 27% depending on the task and model — the business consequences range from customer complaints to regulatory sanctions to legal liability. The absence of runtime hallucination detection means the organisation cannot demonstrate due diligence when harm occurs.
Technical Problem
Hallucination is not a binary property and cannot be detected with a single mechanism. A response can be well-grounded in retrieved context yet still factually incorrect (the context itself was wrong). It can cite real sources that don't say what the response claims. It can include accurate general facts alongside fabricated specifics. Detecting hallucinations requires multiple complementary mechanisms operating at different levels of the output.
Symptoms
- Customers or staff occasionally report AI outputs that contradict facts or cite non-existent sources
- No metric exists for hallucination rate; the organisation cannot answer "what percentage of our AI outputs are hallucinated?"
- After a hallucination incident, the system cannot determine when the hallucination began, how many users were affected, or whether similar outputs were delivered previously
- AI system used for financial or clinical decisions with no confidence threshold or escalation trigger
- User feedback (thumbs down) is the only hallucination signal, capturing < 5% of actual occurrences
Cost of Inaction
- Undetected hallucinations in high-risk domains (legal, financial, clinical) create direct liability
- Regulatory finding under EU AI Act Article 9 for high-risk AI systems lacking risk management controls
- APRA model risk management finding for material AI models lacking output validation
- Reputational damage when hallucinations become public — disproportionate media amplification
- Customer trust erosion: users who encounter one hallucination permanently reduce AI usage
3. Context
When to Apply
- RAG systems where retrieved context is available for grounding verification
- AI systems making factual claims (not creative writing, brainstorming, or opinion generation)
- High-risk AI applications (legal, financial, clinical, government)
- Systems where hallucination rate > 1% has business or regulatory consequence
- Prerequisite: EAAPL-OBS001 telemetry must provide the log stream for detection events
When NOT to Apply
- Pure creative writing or brainstorming systems (no factual grounding to verify)
- Systems generating code (different validation strategy: execution testing, not NLI)
- Very low-stakes applications where hallucination consequence is negligible
- Real-time conversational systems where < 100ms response budget makes synchronous NLI infeasible (use asynchronous streaming detection)
Prerequisites
| Prerequisite | Required | Notes |
|---|---|---|
| RAG pipeline with retrievable source chunks | Strongly Recommended | Grounding verification requires access to the context the model used |
| NLI model or LLM-as-judge service | Required | Core detection mechanism |
| EAAPL-OBS001 AI Telemetry Infrastructure | Required | Log storage for detection events and calibration data |
| Human review workflow | Required | Escalation target; calibration labeling |
| Feedback collection mechanism | Required | Post-delivery signal for calibration |
Industry Applicability
| Industry | Applicability | Primary Driver |
|---|---|---|
| Financial Services | Critical | ASIC regulatory guidance, APRA model risk, financial advice liability |
| Healthcare | Critical | Clinical safety, TGA obligations for software as medical device |
| Legal Services | Critical | Professional liability, bar association obligations |
| Government | High | Public trust, mandatory accuracy obligations |
| Retail / E-Commerce | Medium | Product information accuracy, consumer law |
| Technology / SaaS | High | Contractual accuracy obligations, platform liability |
4. Architecture Overview
The Hallucination Detection Architecture is a multi-layer system combining real-time output analysis, statistical quality monitoring, and human-in-the-loop review. The architecture acknowledges that no single detection mechanism achieves acceptable precision and recall across all hallucination types; a defence-in-depth approach is required.
Layer 1: RAG Grounding Verification
When a RAG pipeline retrieves source chunks and passes them to the model as context, the generated output can be evaluated for grounding in those chunks. The grounding verifier takes the generated response and the source context as inputs and asks: does the context support the claims made in the response? This is an NLI (Natural Language Inference) task with three labels: entailment (context supports claim), neutral (claim is not addressed by context), and contradiction (context contradicts claim).
Two implementation options are viable. NLI model: a fine-tuned cross-encoder model (e.g., DeBERTa-v3-large fine-tuned on NLI datasets like MultiNLI and ANLI) runs inference on response-chunk pairs. Sentence decomposition splits the response into individual factual claims before NLI evaluation. A response with any claim scoring contradiction with confidence > 0.8 is flagged. LLM-as-judge: a separate LLM call (using a different model from the generation model to avoid correlated errors) evaluates whether the response is grounded in the provided context, returning a structured verdict with confidence score and explanation. LLM-as-judge is more expensive but handles complex reasoning chains better.
Layer 2: Source Citation Validation
When the model's response includes explicit source citations (document titles, URLs, or retrieved chunk IDs), a citation validator checks: (a) does the cited source exist in the retrieval corpus? (b) does the cited source contain content that supports the claim attributed to it? This catches a specific and common hallucination pattern: the model fabricates plausible-sounding source titles that don't exist, or correctly cites a real source but misrepresents what it says.
Layer 3: Factual Consistency Checking
For responses containing numerical claims, dates, names, or specific facts, an extraction and verification step checks these specific claims against retrieved context. Named entity recognition identifies claims of type NUMBER, DATE, PERSON, ORG, and STATISTIC in the response. Each extracted claim is then checked against the retrieved context using exact or fuzzy matching. Mismatches (e.g., response says "$2.3 billion", context says "$2.3 million") are flagged as high-confidence hallucinations.
Layer 4: Confidence Calibration Monitoring
Model APIs often return logprob-based confidence scores or can be prompted for self-assessed confidence. The confidence calibration monitor tracks the relationship between these scores and actual accuracy (as measured by human labels and NLI grounding scores). A well-calibrated model has confidence 0.9 when it is correct 90% of the time. Calibration is measured using Expected Calibration Error (ECE) and visualised with calibration curves. Degrading calibration — the model becomes confidently wrong more often — is a signal of model drift (see EAAPL-OBS005).
Layer 5: Hallucination Rate Metric
A statistically valid hallucination rate requires human-labeled ground truth. A sampling pipeline draws 1% of all responses for human review. Reviewers label each response as: grounded, partially grounded, or hallucinated. This labeled sample trains and continuously retrains an ML classifier that can then estimate hallucination probability on the full unlabeled output stream. The resulting hallucination rate metric (estimated percentage of outputs with hallucination) is reported on dashboards and used for SLO tracking.
Human-in-the-Loop Escalation
When the grounding verifier or factual consistency checker assigns a hallucination probability > configurable threshold (default: 0.7), the response is flagged for human review before delivery. The escalation workflow routes the flagged response to a human reviewer queue. The reviewer can: approve delivery (with or without modification), reject delivery, or approve a modified version. For synchronous user-facing applications, this creates a user-perceptible delay; acceptable only when the risk of delivering an unverified response exceeds the cost of latency. For asynchronous applications (document generation, batch processing), escalation is always preferred over delivery of flagged responses.
5. Architecture Diagram
6. Components
| Component | Type | Responsibility | Technology Options | Criticality |
|---|---|---|---|---|
| Grounding Verifier | ML Inference Service | NLI-based grounding check of response against retrieved chunks | DeBERTa-v3-large NLI fine-tuned; LLM-as-judge (GPT-4o, Claude 3.5); sentence-transformers | Critical |
| Citation Validator | Service | Verify existence and accuracy of cited sources | Custom service querying vector index + retrieval corpus | High |
| Factual Consistency Checker | Service | Extract and verify numerical, date, named entity claims | spaCy NER + fuzzy matching against retrieved context; custom rules | High |
| Hallucination Classifier | ML Model | Estimate hallucination probability on unlabeled outputs | Trained on human-labeled sample; DistilBERT or similar; updated weekly | High |
| Human Review Queue | Workflow | Route high-risk responses to human reviewers; track decisions | Jira, ServiceNow, custom review UI; priority-sorted queue | Critical |
| Confidence Calibration Monitor | Batch Job | Track calibration curves; compute ECE; alert on degradation | Python (sklearn calibration_curve); scheduled analysis | Medium |
| Sampling Pipeline | Stream Processor | Draw 1% sample for human labeling; stratify by model, template, confidence | Kafka consumer with sampling; or scheduled SQL query | High |
| Human Labeling Interface | UI | Present response + context to reviewers; capture grounded/partial/hallucinated label | Custom React app; Label Studio; Prodigy | High |
| Hallucination Rate Dashboard | UI | Real-time and trend hallucination rate by model, template, use case | Grafana; Datadog; custom dashboard | Medium |
| Alert Manager | Integration | Alert on hallucination rate SLO breach; escalation routing | Alertmanager, PagerDuty; severity-based routing | High |
7. Data Flow
Primary Flow
| Step | Actor | Action | Output |
|---|---|---|---|
| 1 | RAG Pipeline | Retrieves context chunks; generates response; passes both to detection layer | Response text + retrieved_chunks[] + requestId |
| 2 | Grounding Verifier | Decomposes response into claims; runs NLI on each claim against each chunk | Grounding score per claim; overall grounding confidence; contradiction flags |
| 3 | Citation Validator | Extracts any explicit citations from response; verifies existence and content alignment | Citation validity score; flagged invalid or misrepresented citations |
| 4 | Factual Consistency Checker | Extracts entities (numbers, dates, names); checks against chunks | Consistency score; flagged mismatched facts |
| 5 | Hallucination Classifier | Combines grounding, citation, consistency signals; outputs hallucination probability | Hallucination probability score (0.0–1.0) |
| 6 | Routing Decision | Threshold logic: > 0.7 → human queue; 0.3–0.7 → flag + deliver; < 0.3 → deliver | Routing decision + detection log record |
| 7 | Human Reviewer (if escalated) | Receives response + context + detection flags; reviews; approves/rejects/modifies | Reviewer decision + correction if modified |
| 8 | Telemetry Logger | Records full detection event with requestId, scores, routing decision, reviewer outcome | Immutable detection event record |
| 9 | Calibration Pipeline | Samples 1% of all decisions for human labeling; uses labels to retrain classifier | Updated classifier weights; calibration metrics |
Error Flow
| Error Scenario | Detection | Action | Recovery |
|---|---|---|---|
| NLI model service unavailable | Health check failure; grounding API timeout | Escalate all responses to human queue; alert P1 | Restore NLI service; process queue backlog |
| Grounding verifier latency spike (> 500ms) | SLO breach alert | Increase timeout threshold; alert; switch to LLM-as-judge fallback | Investigate; scale NLI service |
| Human review queue overwhelmed (> 100 pending) | Queue depth metric alert | Alert to AI engineering; consider temporary threshold relaxation with risk acceptance | Increase reviewer capacity; temporary threshold adjustment requires risk sign-off |
| Hallucination classifier stale (last retrain > 30 days) | Retraining job failure; staleness alert | Alert to ML platform; run emergency retraining | Investigate labeling pipeline; manual retraining trigger |
| Retrieved chunks not available in detection layer | Missing field in request to detection service | Skip grounding verification; apply higher-risk flag; escalate to human review | Fix RAG pipeline to pass chunks to detection layer |
8. Security Considerations
Authentication: Detection services authenticate to the main AI pipeline via service-to-service tokens. The human review queue authenticates reviewers via SSO with MFA. Access to hallucination detection logs requires separate authorisation from general telemetry logs.
Authorisation: Hallucination detection results are sensitive — they reveal AI system quality levels that may be commercially sensitive or relevant to litigation. Access restricted to AI engineering, risk management, and legal. Dashboard access for quality monitoring is broader.
Secrets Management: LLM-as-judge service API keys stored in secrets manager with separate rotation from primary generation keys. NLI model served internally requires no external API key.
Data Classification: Response text stored in detection logs is classified at least as Confidential. Detection events for high-risk domains (clinical, legal, financial) classified as Restricted. Reviewer comments on rejected responses classified as Privileged if written by legal reviewers.
Encryption: All detection data encrypted in transit (TLS 1.3) and at rest (AES-256). Human review queue messages encrypted end-to-end if containing sensitive response content.
Auditability: Every hallucination detection event is immutable. Reviewer decisions are immutable. The chain from requestId to detection event to reviewer decision is traceable for legal discovery.
OWASP LLM Top 10 Coverage
| OWASP LLM Risk | Hallucination Detection Control | Implementation |
|---|---|---|
| LLM01 Prompt Injection | Injected instructions may produce hallucinations; detection surfaces them | Hallucinations triggered by injection flagged for security review |
| LLM02 Insecure Output Handling | Hallucinated outputs may contain injections or unsafe content | Output validation in detection layer; unsafe content flagged before delivery |
| LLM03 Training Data Poisoning | Poisoned training may systematically shift hallucination patterns | Calibration monitoring detects systematic accuracy degradation |
| LLM04 Model Denial of Service | Detection layer adds compute; may be targeted for overload | Detection service has independent rate limiting and circuit breaker |
| LLM05 Supply Chain Vulnerabilities | NLI model or LLM-as-judge may be compromised | Model integrity verification on deployment; vendor security review |
| LLM06 Sensitive Information Disclosure | Hallucinated content may fabricate sensitive details | Factual consistency check may detect fabricated sensitive info patterns |
| LLM07 Insecure Plugin Design | Tools may return data that models hallucinate over | Tool output included in retrieved context for grounding verification |
| LLM08 Excessive Agency | Hallucinated instructions in agentic chains can cause cascading errors | Agent output grounding check before passing to next agent step |
| LLM09 Overreliance | Hallucination detection directly addresses the overreliance risk | Confidence calibration monitoring; low-confidence responses flagged to users |
| LLM10 Model Theft | Out of scope for this pattern | Covered by EAAPL-OBS002 (Prompt Monitoring) |
9. Governance Considerations
Responsible AI: Hallucination detection is a primary control in the AI harm prevention framework. The existence of systematic detection, human escalation, and correction workflows constitutes a defensible reasonable-care standard for AI-influenced decisions.
Model Risk Management: The hallucination rate metric is a Key Risk Indicator (KRI) for any material AI model. Persistent hallucination rate > 5% should trigger model risk review and may require model replacement or use-case restriction.
Human Approval: All responses with hallucination probability > 0.7 require human review before delivery. This threshold is reviewed quarterly. Domain-specific thresholds (e.g., clinical AI: > 0.3) may be stricter. Any threshold change requires risk officer sign-off.
Policy: Hallucination detection results must be retained as model performance records. Reviewer corrections and rejections are permanent model performance records. Hallucination rate KRI breaches must be reported to the AI risk committee within 2 business days.
Traceability: Every AI-influenced decision can be linked to its hallucination detection outcome, enabling post-hoc review of whether a specific harmful output was reviewed before delivery, and what controls were active.
Governance Artefacts
| Artefact | Owner | Frequency | Format |
|---|---|---|---|
| Hallucination Rate KRI Report | AI Risk Officer | Weekly | Dashboard export + executive summary |
| Human Review Queue Statistics | AI Engineering | Daily | Automated report: volume, resolution time, rejection rate |
| Calibration Curve Report | ML Platform | Monthly | Calibration plot + ECE metric trend |
| Hallucination Classifier Audit | AI Governance | Quarterly | Classifier performance on holdout set + bias analysis |
| Escalation Threshold Review | AI Risk + Legal | Quarterly | Signed review document |
| High-Risk Hallucination Incident Log | Risk Management | Per incident | Incident record with timeline and remediation |
10. Operational Considerations
Monitoring: The NLI model inference service is a critical dependency. Its latency (p99 must be < 200ms for synchronous path), availability (> 99.5%), and model version are all monitored. Hallucination rate trend alerts fire when the 24-hour rolling rate increases > 2 percentage points above the 7-day baseline.
Logging: Detection events are stored in a dedicated detection log store, separate from general AI telemetry, with stricter access controls. Reviewer decisions are stored in an immutable audit log.
Incident Response: Hallucination rate spike (> 10% in 1 hour) is a P1 incident triggering immediate review of recent model or prompt changes. All in-flight responses are escalated to human queue until rate returns to baseline.
Disaster Recovery: Human review queue must survive infrastructure failures. Queue is backed by durable message store (Kafka, SQS). NLI service failure activates "all-escalate" policy — all responses routed to human queue. This is sustainable for up to 4 hours before queue overwhelm.
Capacity Planning: Human reviewer capacity is the binding constraint. At 1% sampling rate and 5% hallucination rate, 100K daily requests generates 50 items for human labeling per day. At the 0.7 escalation threshold, approximately 1–3% of requests require reviewer intervention. Reviewer capacity must be planned accordingly.
SLO Table
| SLO | Target | Measurement | Alert Threshold |
|---|---|---|---|
| Hallucination rate (estimated) | < 5% for general use cases; < 1% for high-risk | Classifier estimate on full output stream | Sustained breach for 30 minutes |
| Grounding verifier latency | < 200ms p99 | NLI inference latency histogram | > 500ms for 5 minutes |
| Human review resolution time | < 4 hours p90 for P1 escalations | Queue item creation to resolution timestamp | > 8 hours for P1 |
| Calibration ECE | < 0.10 (10% calibration error) | Monthly calibration report | ECE > 0.15 triggers model review |
Disaster Recovery Table
| Component | RTO | RPO | Recovery Approach |
|---|---|---|---|
| NLI Inference Service | 10 minutes | N/A (stateless) | Auto-scale; all-escalate policy during outage |
| Human Review Queue | 5 minutes | Near-zero | Durable message store (Kafka / SQS) with replication |
| Detection Log Store | 30 minutes | 1 hour | Replicated storage; write-ahead log |
| Hallucination Classifier | 60 minutes | Last checkpoint | Load previous checkpoint; retrain when pipeline recovers |
11. Cost Considerations
Cost Drivers
| Driver | Description | Relative Cost |
|---|---|---|
| NLI model inference (synchronous) | Compute-intensive; scales with response length and chunk count | High |
| LLM-as-judge (if used) | Additional LLM call per response evaluation; can double inference cost | Very High |
| Human reviewer time | 1% sample at 5 minutes/review = 50 reviews/day per 100K requests | High (human labour) |
| Detection log storage | Full response text + context + detection signals; large records | Medium |
| Classifier retraining compute | Weekly retraining job; scales with labeled data volume | Low |
Scaling Risks: At very high volumes (> 1M requests/day), synchronous NLI inference may become cost-prohibitive. At scale, asynchronous detection with post-delivery flagging and correction workflow is the cost-sustainable approach. The trade-off is accepting some harmful delivery before human review catches it — only acceptable with a low-risk AI use case.
Optimisations:
- Self-hosted NLI model (DeBERTa) vs. LLM-as-judge reduces cost by 50–90x
- Selective activation: apply full grounding verification only to responses with confidence < 0.9 or containing factual claim signals (numbers, dates, named entities)
- Cache NLI results for identical context-response pairs (rare but possible in templated outputs)
Indicative Cost Range
| Scale | AI Requests/Day | Estimated Hallucination Detection Cost/Month |
|---|---|---|
| Small | 10,000 | $500–$1,500 (human review dominates) |
| Medium | 500,000 | $3,000–$8,000 |
| Large | 5,000,000 | $15,000–$40,000 (NLI compute dominates) |
| Enterprise | 50,000,000+ | $50,000–$200,000 (requires asynchronous architecture) |
12. Trade-Off Analysis
Approach Comparison
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Synchronous NLI grounding verification | Blocks hallucinated responses before delivery; defensible due diligence | Adds 50–200ms latency; NLI service becomes critical dependency | High-risk AI (clinical, legal, financial) |
| Asynchronous post-delivery detection + correction | No latency impact; scalable at high volume | Hallucinations delivered before detected; correction requires recall/notification workflow | Lower-risk use cases; very high volume systems |
| LLM-as-judge | High accuracy on complex reasoning; handles nuanced grounding; configurable criteria | 10–100x cost of NLI model; correlated errors with generation model possible | Premium use cases with high accuracy requirement and budget |
| Human review only (no automated detection) | Highest accuracy; no false positives | Scales poorly; expensive; slow; samples < 1% of actual volume | Ultra-high-risk decisions where automation is not trusted |
Architectural Tensions
| Tension | Description | Resolution |
|---|---|---|
| Latency vs. Safety | Synchronous detection adds p99 latency to every response | Use async detection for lower-risk paths; sync for high-risk only; optimise NLI serving |
| Precision vs. Recall | High precision (few false positives) means some hallucinations pass; high recall floods human queue | Tune threshold by domain: 0.3 for clinical, 0.7 for general; monitor precision-recall trade-off quarterly |
| Cost vs. Coverage | Full NLI check on every response is expensive | Apply risk-tiered coverage: full check for high-risk domains, classifier-only for low-risk |
| Automation vs. Human Trust | Organisations may distrust automated detection; always want human review | Establish calibration data showing automated detection accuracy; earn trust with evidence |
13. Failure Modes
| Failure | Likelihood | Impact | Detection | Recovery |
|---|---|---|---|---|
| NLI model mis-scores complex multi-hop reasoning | Medium | High (hallucination passes) | Calibration monitoring; human review audit | Fine-tune NLI model on domain-specific examples |
| Human review queue overwhelm during incident | Medium | High (hallucinations delivered) | Queue depth metric; review time SLO breach | Emergency reviewer surge; temporary threshold raise with risk acceptance |
| Classifier concept drift (stale model) | Medium | Medium (inaccurate rate estimates) | Calibration ECE alert; holdout accuracy monitoring | Trigger emergency retraining |
| Retrieved chunks not passed to detection layer | Low | Critical (grounding verification blind) | Missing field alert in detection service | Fix data contract; audit all RAG pipeline integrations |
| LLM-as-judge inherits generation model errors | Medium | Medium (correlated failure) | Use different model family as judge | Multi-model ensemble for high-stakes decisions |
Cascading Scenarios
- Scenario 1: NLI service latency spike → timeout → all-escalate policy activates → human queue overwhelmed → reviewers approve without reading → hallucinations delivered at scale. Mitigation: reviewer capacity must be sized for 100% escalation for 1 hour; NLI SLO enforced.
- Scenario 2: Classifier becomes stale → hallucination rate underestimated → SLO shows green → management increases AI deployment → actual hallucination rate high but undetected. Mitigation: classifier staleness alert; mandatory retraining schedule.
14. Regulatory Considerations
| Regulation | Clause | Requirement | Hallucination Detection Implementation |
|---|---|---|---|
| EU AI Act | Article 9.2 (Risk Management) | High-risk AI must implement risk management measures including identification and analysis of known risks | Hallucination is a known risk; detection system implements technical control |
| EU AI Act | Article 9.5 (Testing) | High-risk AI systems must be tested to identify appropriate risk management measures | Continuous monitoring and human labeling = ongoing testing |
| EU AI Act | Article 14 (Human Oversight) | High-risk AI systems must allow human oversight; humans must be able to intervene | Human review queue directly implements Article 14 override capability |
| EU AI Act | Article 13 (Transparency) | High-risk AI users must be informed of capabilities and limitations | Hallucination detection enables informed disclosure of accuracy rates |
| APRA CPG 234 | Paragraph 43 (Model Risk) | Material models require validation including ongoing performance monitoring | Hallucination rate KRI is a core model performance metric |
| Privacy Act 1988 (AU) | APP 3 (Collection) | Only collect information reasonably necessary | Human review records of responses must meet necessity test |
| ISO/IEC 42001 | Clause 8.4 (AI System Operation) | Operational procedures for AI systems must include monitoring and intervention | Human review workflow documents operational intervention procedure |
| NIST AI RMF | MANAGE 2.4 | Residual risks of AI systems monitored and managed | Hallucination rate KRI feeds residual risk tracking |
15. Reference Implementations
AWS
- NLI Inference: SageMaker endpoint hosting DeBERTa-v3-large NLI model; auto-scaling
- LLM-as-Judge: Amazon Bedrock (Claude 3.5 Haiku for cost efficiency at scale)
- Human Review Queue: Amazon SQS + custom review UI on React/Next.js; reviewer auth via Cognito
- Detection Logs: Amazon DynamoDB (per-request detection events); Amazon S3 for archive
- Classifier Retraining: SageMaker Training Job triggered by EventBridge weekly schedule
- Dashboards: Amazon QuickSight; CloudWatch custom metrics
- Alerts: CloudWatch Alarms → SNS → PagerDuty
Azure
- NLI Inference: Azure Machine Learning Managed Endpoint; Azure Container Instances for NLI model
- LLM-as-Judge: Azure OpenAI Service (GPT-4o)
- Human Review Queue: Azure Service Bus + Azure Static Web App review UI
- Detection Logs: Azure Cosmos DB; Azure Blob Storage archive
- Classifier Retraining: Azure ML Pipelines scheduled job
- Dashboards: Azure Monitor Workbooks; Power BI
- Alerts: Azure Monitor Alerts → Logic Apps → Teams / PagerDuty
GCP
- NLI Inference: Vertex AI Prediction endpoint with DeBERTa-v3 model
- LLM-as-Judge: Vertex AI Gemini 1.5 Flash (cost-optimised judge)
- Human Review Queue: Cloud Tasks + Cloud Run review UI
- Detection Logs: Firestore (real-time); BigQuery (analytics)
- Classifier Retraining: Vertex AI Pipelines
- Dashboards: Looker; Cloud Monitoring
- Alerts: Cloud Monitoring Alerting → PagerDuty
On-Premises
- NLI Inference: Self-hosted DeBERTa-v3-large on GPU cluster (Triton Inference Server)
- LLM-as-Judge: Self-hosted Llama 3.1 70B with structured output prompting
- Human Review Queue: Apache Kafka + custom React review application
- Detection Logs: PostgreSQL (events); ClickHouse (analytics); MinIO (archive)
- Classifier Retraining: MLflow + custom training script on GPU cluster
- Dashboards: Grafana
- Alerts: Alertmanager → OpsGenie / PagerDuty
16. Related Patterns
| Pattern ID | Pattern Name | Relationship | Notes |
|---|---|---|---|
| EAAPL-OBS001 | AI Telemetry Architecture | Foundation | Log and trace infrastructure required; requestId linkage |
| EAAPL-OBS002 | Prompt Monitoring | Sibling | Prompt-side controls; this pattern covers output-side controls |
| EAAPL-OBS004 | AI Incident Management | Depends On | Hallucination rate spike is a defined incident type in OBS004 |
| EAAPL-OBS005 | Model Drift Detection | Sibling | Confidence calibration degradation detected here feeds drift monitoring |
| EAAPL-OBS008 | AI Performance Benchmarking | Sibling | Offline benchmark hallucination rate complemented by online detection rate |
17. Maturity Assessment
Overall Maturity: Emerging
| Dimension | Score (1–5) | Rationale |
|---|---|---|
| Adoption Breadth | 2 | Fewer than 20% of enterprise AI deployments have systematic runtime detection |
| Tooling Ecosystem | 3 | NLI models mature; hallucination-specific evaluation frameworks (RAGAS, TruLens) maturing rapidly |
| Operational Runbook Coverage | 2 | Runbooks are organisation-specific; no widely adopted standard |
| Regulatory Evidence | 3 | EU AI Act Article 14 creates explicit demand; APRA guidance emerging |
| Cost Predictability | 2 | NLI inference cost at scale still being benchmarked; LLM-as-judge cost is high and variable |
| Team Skill Availability | 2 | NLI fine-tuning and calibration skills are specialised; limited talent pool |
18. Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-06-12 | EAAPL Working Group | Initial publication |