EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryObservability & Monitoring
Proven
⇄ Compare

EAAPL-OBS003 · Hallucination Detection

📊 Observability & Monitoring🏭 Field-tested in AU

EAAPL-OBS003 · Hallucination Detection

Pattern ID: EAAPL-OBS003 Status: Emerging Complexity: High Tags: rag llm human-oversight model-risk high-complexity Version: 1.0.0 Last Reviewed: 2026-06-12


1. Executive Summary

Large language models fabricate plausible-sounding content with confidence. In enterprise settings, a hallucinated legal citation, an invented financial figure, or a fabricated clinical guideline can trigger regulatory liability, customer harm, and reputational damage. Despite widespread awareness of this risk, fewer than 20% of enterprise AI deployments have systematic runtime detection — most rely on post-hoc user feedback, which captures only a fraction of hallucinations and arrives too late to prevent harm.

This pattern defines a layered, runtime hallucination detection architecture for production AI systems, with emphasis on RAG (retrieval-augmented generation) deployments where grounding verification is tractable. It covers: RAG grounding verification using NLI (Natural Language Inference) models and LLM-as-judge; source citation validation; factual consistency checking against retrieved context; confidence calibration monitoring; automated hallucination rate estimation using a hybrid human-labeled/ML-classifier approach; human-in-the-loop escalation for high-probability hallucinations; and post-delivery feedback collection to close the calibration loop. The outcome is a defensible claim that the organisation has implemented reasonable technical controls to detect and respond to AI hallucinations — a requirement emerging in EU AI Act Article 9 and APRA's AI Risk Management guidance.

Target Audience: CIO, CTO, AI Engineering Lead, Risk Officer, Chief Compliance Officer Time to Implement: 10–16 weeks (high complexity; NLI model selection and calibration are the long poles)


2. Problem Statement

Business Problem

Organisations are deploying AI systems to answer customer questions, draft legal documents, generate financial summaries, and provide clinical guidance. When these systems hallucinate — and they do, at rates between 3% and 27% depending on the task and model — the business consequences range from customer complaints to regulatory sanctions to legal liability. The absence of runtime hallucination detection means the organisation cannot demonstrate due diligence when harm occurs.

Technical Problem

Hallucination is not a binary property and cannot be detected with a single mechanism. A response can be well-grounded in retrieved context yet still factually incorrect (the context itself was wrong). It can cite real sources that don't say what the response claims. It can include accurate general facts alongside fabricated specifics. Detecting hallucinations requires multiple complementary mechanisms operating at different levels of the output.

Symptoms

  • Customers or staff occasionally report AI outputs that contradict facts or cite non-existent sources
  • No metric exists for hallucination rate; the organisation cannot answer "what percentage of our AI outputs are hallucinated?"
  • After a hallucination incident, the system cannot determine when the hallucination began, how many users were affected, or whether similar outputs were delivered previously
  • AI system used for financial or clinical decisions with no confidence threshold or escalation trigger
  • User feedback (thumbs down) is the only hallucination signal, capturing < 5% of actual occurrences

Cost of Inaction

  • Undetected hallucinations in high-risk domains (legal, financial, clinical) create direct liability
  • Regulatory finding under EU AI Act Article 9 for high-risk AI systems lacking risk management controls
  • APRA model risk management finding for material AI models lacking output validation
  • Reputational damage when hallucinations become public — disproportionate media amplification
  • Customer trust erosion: users who encounter one hallucination permanently reduce AI usage

3. Context

When to Apply

  • RAG systems where retrieved context is available for grounding verification
  • AI systems making factual claims (not creative writing, brainstorming, or opinion generation)
  • High-risk AI applications (legal, financial, clinical, government)
  • Systems where hallucination rate > 1% has business or regulatory consequence
  • Prerequisite: EAAPL-OBS001 telemetry must provide the log stream for detection events

When NOT to Apply

  • Pure creative writing or brainstorming systems (no factual grounding to verify)
  • Systems generating code (different validation strategy: execution testing, not NLI)
  • Very low-stakes applications where hallucination consequence is negligible
  • Real-time conversational systems where < 100ms response budget makes synchronous NLI infeasible (use asynchronous streaming detection)

Prerequisites

Prerequisite Required Notes
RAG pipeline with retrievable source chunks Strongly Recommended Grounding verification requires access to the context the model used
NLI model or LLM-as-judge service Required Core detection mechanism
EAAPL-OBS001 AI Telemetry Infrastructure Required Log storage for detection events and calibration data
Human review workflow Required Escalation target; calibration labeling
Feedback collection mechanism Required Post-delivery signal for calibration

Industry Applicability

Industry Applicability Primary Driver
Financial Services Critical ASIC regulatory guidance, APRA model risk, financial advice liability
Healthcare Critical Clinical safety, TGA obligations for software as medical device
Legal Services Critical Professional liability, bar association obligations
Government High Public trust, mandatory accuracy obligations
Retail / E-Commerce Medium Product information accuracy, consumer law
Technology / SaaS High Contractual accuracy obligations, platform liability

4. Architecture Overview

The Hallucination Detection Architecture is a multi-layer system combining real-time output analysis, statistical quality monitoring, and human-in-the-loop review. The architecture acknowledges that no single detection mechanism achieves acceptable precision and recall across all hallucination types; a defence-in-depth approach is required.

Layer 1: RAG Grounding Verification

When a RAG pipeline retrieves source chunks and passes them to the model as context, the generated output can be evaluated for grounding in those chunks. The grounding verifier takes the generated response and the source context as inputs and asks: does the context support the claims made in the response? This is an NLI (Natural Language Inference) task with three labels: entailment (context supports claim), neutral (claim is not addressed by context), and contradiction (context contradicts claim).

Two implementation options are viable. NLI model: a fine-tuned cross-encoder model (e.g., DeBERTa-v3-large fine-tuned on NLI datasets like MultiNLI and ANLI) runs inference on response-chunk pairs. Sentence decomposition splits the response into individual factual claims before NLI evaluation. A response with any claim scoring contradiction with confidence > 0.8 is flagged. LLM-as-judge: a separate LLM call (using a different model from the generation model to avoid correlated errors) evaluates whether the response is grounded in the provided context, returning a structured verdict with confidence score and explanation. LLM-as-judge is more expensive but handles complex reasoning chains better.

Layer 2: Source Citation Validation

When the model's response includes explicit source citations (document titles, URLs, or retrieved chunk IDs), a citation validator checks: (a) does the cited source exist in the retrieval corpus? (b) does the cited source contain content that supports the claim attributed to it? This catches a specific and common hallucination pattern: the model fabricates plausible-sounding source titles that don't exist, or correctly cites a real source but misrepresents what it says.

Layer 3: Factual Consistency Checking

For responses containing numerical claims, dates, names, or specific facts, an extraction and verification step checks these specific claims against retrieved context. Named entity recognition identifies claims of type NUMBER, DATE, PERSON, ORG, and STATISTIC in the response. Each extracted claim is then checked against the retrieved context using exact or fuzzy matching. Mismatches (e.g., response says "$2.3 billion", context says "$2.3 million") are flagged as high-confidence hallucinations.

Layer 4: Confidence Calibration Monitoring

Model APIs often return logprob-based confidence scores or can be prompted for self-assessed confidence. The confidence calibration monitor tracks the relationship between these scores and actual accuracy (as measured by human labels and NLI grounding scores). A well-calibrated model has confidence 0.9 when it is correct 90% of the time. Calibration is measured using Expected Calibration Error (ECE) and visualised with calibration curves. Degrading calibration — the model becomes confidently wrong more often — is a signal of model drift (see EAAPL-OBS005).

Layer 5: Hallucination Rate Metric

A statistically valid hallucination rate requires human-labeled ground truth. A sampling pipeline draws 1% of all responses for human review. Reviewers label each response as: grounded, partially grounded, or hallucinated. This labeled sample trains and continuously retrains an ML classifier that can then estimate hallucination probability on the full unlabeled output stream. The resulting hallucination rate metric (estimated percentage of outputs with hallucination) is reported on dashboards and used for SLO tracking.

Human-in-the-Loop Escalation

When the grounding verifier or factual consistency checker assigns a hallucination probability > configurable threshold (default: 0.7), the response is flagged for human review before delivery. The escalation workflow routes the flagged response to a human reviewer queue. The reviewer can: approve delivery (with or without modification), reject delivery, or approve a modified version. For synchronous user-facing applications, this creates a user-perceptible delay; acceptable only when the risk of delivering an unverified response exceeds the cost of latency. For asynchronous applications (document generation, batch processing), escalation is always preferred over delivery of flagged responses.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Detection["Detection Layers"] A[RAG Response + Context] B[Grounding Verifier] C[Factual Consistency Checker] end subgraph Routing["Routing Decision"] D{Hallucination Probability} E[Human Review Queue] end subgraph Calibration["Feedback Loop"] F[1% Sampling Pipeline] G[Human Labeling] H[Classifier Retraining] end A --> B B --> C C --> D D -->|prob above 0.7| E D -->|prob below 0.3| I[Deliver to User] E --> I I --> F F --> G G --> H H --> D style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#f0fdf4,stroke:#22c55e style D fill:#f3e8ff,stroke:#a855f7 style E fill:#fee2e2,stroke:#ef4444 style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Grounding Verifier ML Inference Service NLI-based grounding check of response against retrieved chunks DeBERTa-v3-large NLI fine-tuned; LLM-as-judge (GPT-4o, Claude 3.5); sentence-transformers Critical
Citation Validator Service Verify existence and accuracy of cited sources Custom service querying vector index + retrieval corpus High
Factual Consistency Checker Service Extract and verify numerical, date, named entity claims spaCy NER + fuzzy matching against retrieved context; custom rules High
Hallucination Classifier ML Model Estimate hallucination probability on unlabeled outputs Trained on human-labeled sample; DistilBERT or similar; updated weekly High
Human Review Queue Workflow Route high-risk responses to human reviewers; track decisions Jira, ServiceNow, custom review UI; priority-sorted queue Critical
Confidence Calibration Monitor Batch Job Track calibration curves; compute ECE; alert on degradation Python (sklearn calibration_curve); scheduled analysis Medium
Sampling Pipeline Stream Processor Draw 1% sample for human labeling; stratify by model, template, confidence Kafka consumer with sampling; or scheduled SQL query High
Human Labeling Interface UI Present response + context to reviewers; capture grounded/partial/hallucinated label Custom React app; Label Studio; Prodigy High
Hallucination Rate Dashboard UI Real-time and trend hallucination rate by model, template, use case Grafana; Datadog; custom dashboard Medium
Alert Manager Integration Alert on hallucination rate SLO breach; escalation routing Alertmanager, PagerDuty; severity-based routing High

7. Data Flow

Primary Flow

Step Actor Action Output
1 RAG Pipeline Retrieves context chunks; generates response; passes both to detection layer Response text + retrieved_chunks[] + requestId
2 Grounding Verifier Decomposes response into claims; runs NLI on each claim against each chunk Grounding score per claim; overall grounding confidence; contradiction flags
3 Citation Validator Extracts any explicit citations from response; verifies existence and content alignment Citation validity score; flagged invalid or misrepresented citations
4 Factual Consistency Checker Extracts entities (numbers, dates, names); checks against chunks Consistency score; flagged mismatched facts
5 Hallucination Classifier Combines grounding, citation, consistency signals; outputs hallucination probability Hallucination probability score (0.0–1.0)
6 Routing Decision Threshold logic: > 0.7 → human queue; 0.3–0.7 → flag + deliver; < 0.3 → deliver Routing decision + detection log record
7 Human Reviewer (if escalated) Receives response + context + detection flags; reviews; approves/rejects/modifies Reviewer decision + correction if modified
8 Telemetry Logger Records full detection event with requestId, scores, routing decision, reviewer outcome Immutable detection event record
9 Calibration Pipeline Samples 1% of all decisions for human labeling; uses labels to retrain classifier Updated classifier weights; calibration metrics

Error Flow

Error Scenario Detection Action Recovery
NLI model service unavailable Health check failure; grounding API timeout Escalate all responses to human queue; alert P1 Restore NLI service; process queue backlog
Grounding verifier latency spike (> 500ms) SLO breach alert Increase timeout threshold; alert; switch to LLM-as-judge fallback Investigate; scale NLI service
Human review queue overwhelmed (> 100 pending) Queue depth metric alert Alert to AI engineering; consider temporary threshold relaxation with risk acceptance Increase reviewer capacity; temporary threshold adjustment requires risk sign-off
Hallucination classifier stale (last retrain > 30 days) Retraining job failure; staleness alert Alert to ML platform; run emergency retraining Investigate labeling pipeline; manual retraining trigger
Retrieved chunks not available in detection layer Missing field in request to detection service Skip grounding verification; apply higher-risk flag; escalate to human review Fix RAG pipeline to pass chunks to detection layer

8. Security Considerations

Authentication: Detection services authenticate to the main AI pipeline via service-to-service tokens. The human review queue authenticates reviewers via SSO with MFA. Access to hallucination detection logs requires separate authorisation from general telemetry logs.

Authorisation: Hallucination detection results are sensitive — they reveal AI system quality levels that may be commercially sensitive or relevant to litigation. Access restricted to AI engineering, risk management, and legal. Dashboard access for quality monitoring is broader.

Secrets Management: LLM-as-judge service API keys stored in secrets manager with separate rotation from primary generation keys. NLI model served internally requires no external API key.

Data Classification: Response text stored in detection logs is classified at least as Confidential. Detection events for high-risk domains (clinical, legal, financial) classified as Restricted. Reviewer comments on rejected responses classified as Privileged if written by legal reviewers.

Encryption: All detection data encrypted in transit (TLS 1.3) and at rest (AES-256). Human review queue messages encrypted end-to-end if containing sensitive response content.

Auditability: Every hallucination detection event is immutable. Reviewer decisions are immutable. The chain from requestId to detection event to reviewer decision is traceable for legal discovery.

OWASP LLM Top 10 Coverage

OWASP LLM Risk Hallucination Detection Control Implementation
LLM01 Prompt Injection Injected instructions may produce hallucinations; detection surfaces them Hallucinations triggered by injection flagged for security review
LLM02 Insecure Output Handling Hallucinated outputs may contain injections or unsafe content Output validation in detection layer; unsafe content flagged before delivery
LLM03 Training Data Poisoning Poisoned training may systematically shift hallucination patterns Calibration monitoring detects systematic accuracy degradation
LLM04 Model Denial of Service Detection layer adds compute; may be targeted for overload Detection service has independent rate limiting and circuit breaker
LLM05 Supply Chain Vulnerabilities NLI model or LLM-as-judge may be compromised Model integrity verification on deployment; vendor security review
LLM06 Sensitive Information Disclosure Hallucinated content may fabricate sensitive details Factual consistency check may detect fabricated sensitive info patterns
LLM07 Insecure Plugin Design Tools may return data that models hallucinate over Tool output included in retrieved context for grounding verification
LLM08 Excessive Agency Hallucinated instructions in agentic chains can cause cascading errors Agent output grounding check before passing to next agent step
LLM09 Overreliance Hallucination detection directly addresses the overreliance risk Confidence calibration monitoring; low-confidence responses flagged to users
LLM10 Model Theft Out of scope for this pattern Covered by EAAPL-OBS002 (Prompt Monitoring)

9. Governance Considerations

Responsible AI: Hallucination detection is a primary control in the AI harm prevention framework. The existence of systematic detection, human escalation, and correction workflows constitutes a defensible reasonable-care standard for AI-influenced decisions.

Model Risk Management: The hallucination rate metric is a Key Risk Indicator (KRI) for any material AI model. Persistent hallucination rate > 5% should trigger model risk review and may require model replacement or use-case restriction.

Human Approval: All responses with hallucination probability > 0.7 require human review before delivery. This threshold is reviewed quarterly. Domain-specific thresholds (e.g., clinical AI: > 0.3) may be stricter. Any threshold change requires risk officer sign-off.

Policy: Hallucination detection results must be retained as model performance records. Reviewer corrections and rejections are permanent model performance records. Hallucination rate KRI breaches must be reported to the AI risk committee within 2 business days.

Traceability: Every AI-influenced decision can be linked to its hallucination detection outcome, enabling post-hoc review of whether a specific harmful output was reviewed before delivery, and what controls were active.

Governance Artefacts

Artefact Owner Frequency Format
Hallucination Rate KRI Report AI Risk Officer Weekly Dashboard export + executive summary
Human Review Queue Statistics AI Engineering Daily Automated report: volume, resolution time, rejection rate
Calibration Curve Report ML Platform Monthly Calibration plot + ECE metric trend
Hallucination Classifier Audit AI Governance Quarterly Classifier performance on holdout set + bias analysis
Escalation Threshold Review AI Risk + Legal Quarterly Signed review document
High-Risk Hallucination Incident Log Risk Management Per incident Incident record with timeline and remediation

10. Operational Considerations

Monitoring: The NLI model inference service is a critical dependency. Its latency (p99 must be < 200ms for synchronous path), availability (> 99.5%), and model version are all monitored. Hallucination rate trend alerts fire when the 24-hour rolling rate increases > 2 percentage points above the 7-day baseline.

Logging: Detection events are stored in a dedicated detection log store, separate from general AI telemetry, with stricter access controls. Reviewer decisions are stored in an immutable audit log.

Incident Response: Hallucination rate spike (> 10% in 1 hour) is a P1 incident triggering immediate review of recent model or prompt changes. All in-flight responses are escalated to human queue until rate returns to baseline.

Disaster Recovery: Human review queue must survive infrastructure failures. Queue is backed by durable message store (Kafka, SQS). NLI service failure activates "all-escalate" policy — all responses routed to human queue. This is sustainable for up to 4 hours before queue overwhelm.

Capacity Planning: Human reviewer capacity is the binding constraint. At 1% sampling rate and 5% hallucination rate, 100K daily requests generates 50 items for human labeling per day. At the 0.7 escalation threshold, approximately 1–3% of requests require reviewer intervention. Reviewer capacity must be planned accordingly.

SLO Table

SLO Target Measurement Alert Threshold
Hallucination rate (estimated) < 5% for general use cases; < 1% for high-risk Classifier estimate on full output stream Sustained breach for 30 minutes
Grounding verifier latency < 200ms p99 NLI inference latency histogram > 500ms for 5 minutes
Human review resolution time < 4 hours p90 for P1 escalations Queue item creation to resolution timestamp > 8 hours for P1
Calibration ECE < 0.10 (10% calibration error) Monthly calibration report ECE > 0.15 triggers model review

Disaster Recovery Table

Component RTO RPO Recovery Approach
NLI Inference Service 10 minutes N/A (stateless) Auto-scale; all-escalate policy during outage
Human Review Queue 5 minutes Near-zero Durable message store (Kafka / SQS) with replication
Detection Log Store 30 minutes 1 hour Replicated storage; write-ahead log
Hallucination Classifier 60 minutes Last checkpoint Load previous checkpoint; retrain when pipeline recovers

11. Cost Considerations

Cost Drivers

Driver Description Relative Cost
NLI model inference (synchronous) Compute-intensive; scales with response length and chunk count High
LLM-as-judge (if used) Additional LLM call per response evaluation; can double inference cost Very High
Human reviewer time 1% sample at 5 minutes/review = 50 reviews/day per 100K requests High (human labour)
Detection log storage Full response text + context + detection signals; large records Medium
Classifier retraining compute Weekly retraining job; scales with labeled data volume Low

Scaling Risks: At very high volumes (> 1M requests/day), synchronous NLI inference may become cost-prohibitive. At scale, asynchronous detection with post-delivery flagging and correction workflow is the cost-sustainable approach. The trade-off is accepting some harmful delivery before human review catches it — only acceptable with a low-risk AI use case.

Optimisations:

  • Self-hosted NLI model (DeBERTa) vs. LLM-as-judge reduces cost by 50–90x
  • Selective activation: apply full grounding verification only to responses with confidence < 0.9 or containing factual claim signals (numbers, dates, named entities)
  • Cache NLI results for identical context-response pairs (rare but possible in templated outputs)

Indicative Cost Range

Scale AI Requests/Day Estimated Hallucination Detection Cost/Month
Small 10,000 $500–$1,500 (human review dominates)
Medium 500,000 $3,000–$8,000
Large 5,000,000 $15,000–$40,000 (NLI compute dominates)
Enterprise 50,000,000+ $50,000–$200,000 (requires asynchronous architecture)

12. Trade-Off Analysis

Approach Comparison

Approach Pros Cons Best For
Synchronous NLI grounding verification Blocks hallucinated responses before delivery; defensible due diligence Adds 50–200ms latency; NLI service becomes critical dependency High-risk AI (clinical, legal, financial)
Asynchronous post-delivery detection + correction No latency impact; scalable at high volume Hallucinations delivered before detected; correction requires recall/notification workflow Lower-risk use cases; very high volume systems
LLM-as-judge High accuracy on complex reasoning; handles nuanced grounding; configurable criteria 10–100x cost of NLI model; correlated errors with generation model possible Premium use cases with high accuracy requirement and budget
Human review only (no automated detection) Highest accuracy; no false positives Scales poorly; expensive; slow; samples < 1% of actual volume Ultra-high-risk decisions where automation is not trusted

Architectural Tensions

Tension Description Resolution
Latency vs. Safety Synchronous detection adds p99 latency to every response Use async detection for lower-risk paths; sync for high-risk only; optimise NLI serving
Precision vs. Recall High precision (few false positives) means some hallucinations pass; high recall floods human queue Tune threshold by domain: 0.3 for clinical, 0.7 for general; monitor precision-recall trade-off quarterly
Cost vs. Coverage Full NLI check on every response is expensive Apply risk-tiered coverage: full check for high-risk domains, classifier-only for low-risk
Automation vs. Human Trust Organisations may distrust automated detection; always want human review Establish calibration data showing automated detection accuracy; earn trust with evidence

13. Failure Modes

Failure Likelihood Impact Detection Recovery
NLI model mis-scores complex multi-hop reasoning Medium High (hallucination passes) Calibration monitoring; human review audit Fine-tune NLI model on domain-specific examples
Human review queue overwhelm during incident Medium High (hallucinations delivered) Queue depth metric; review time SLO breach Emergency reviewer surge; temporary threshold raise with risk acceptance
Classifier concept drift (stale model) Medium Medium (inaccurate rate estimates) Calibration ECE alert; holdout accuracy monitoring Trigger emergency retraining
Retrieved chunks not passed to detection layer Low Critical (grounding verification blind) Missing field alert in detection service Fix data contract; audit all RAG pipeline integrations
LLM-as-judge inherits generation model errors Medium Medium (correlated failure) Use different model family as judge Multi-model ensemble for high-stakes decisions

Cascading Scenarios

  • Scenario 1: NLI service latency spike → timeout → all-escalate policy activates → human queue overwhelmed → reviewers approve without reading → hallucinations delivered at scale. Mitigation: reviewer capacity must be sized for 100% escalation for 1 hour; NLI SLO enforced.
  • Scenario 2: Classifier becomes stale → hallucination rate underestimated → SLO shows green → management increases AI deployment → actual hallucination rate high but undetected. Mitigation: classifier staleness alert; mandatory retraining schedule.

14. Regulatory Considerations

Regulation Clause Requirement Hallucination Detection Implementation
EU AI Act Article 9.2 (Risk Management) High-risk AI must implement risk management measures including identification and analysis of known risks Hallucination is a known risk; detection system implements technical control
EU AI Act Article 9.5 (Testing) High-risk AI systems must be tested to identify appropriate risk management measures Continuous monitoring and human labeling = ongoing testing
EU AI Act Article 14 (Human Oversight) High-risk AI systems must allow human oversight; humans must be able to intervene Human review queue directly implements Article 14 override capability
EU AI Act Article 13 (Transparency) High-risk AI users must be informed of capabilities and limitations Hallucination detection enables informed disclosure of accuracy rates
APRA CPG 234 Paragraph 43 (Model Risk) Material models require validation including ongoing performance monitoring Hallucination rate KRI is a core model performance metric
Privacy Act 1988 (AU) APP 3 (Collection) Only collect information reasonably necessary Human review records of responses must meet necessity test
ISO/IEC 42001 Clause 8.4 (AI System Operation) Operational procedures for AI systems must include monitoring and intervention Human review workflow documents operational intervention procedure
NIST AI RMF MANAGE 2.4 Residual risks of AI systems monitored and managed Hallucination rate KRI feeds residual risk tracking

15. Reference Implementations

AWS

  • NLI Inference: SageMaker endpoint hosting DeBERTa-v3-large NLI model; auto-scaling
  • LLM-as-Judge: Amazon Bedrock (Claude 3.5 Haiku for cost efficiency at scale)
  • Human Review Queue: Amazon SQS + custom review UI on React/Next.js; reviewer auth via Cognito
  • Detection Logs: Amazon DynamoDB (per-request detection events); Amazon S3 for archive
  • Classifier Retraining: SageMaker Training Job triggered by EventBridge weekly schedule
  • Dashboards: Amazon QuickSight; CloudWatch custom metrics
  • Alerts: CloudWatch Alarms → SNS → PagerDuty

Azure

  • NLI Inference: Azure Machine Learning Managed Endpoint; Azure Container Instances for NLI model
  • LLM-as-Judge: Azure OpenAI Service (GPT-4o)
  • Human Review Queue: Azure Service Bus + Azure Static Web App review UI
  • Detection Logs: Azure Cosmos DB; Azure Blob Storage archive
  • Classifier Retraining: Azure ML Pipelines scheduled job
  • Dashboards: Azure Monitor Workbooks; Power BI
  • Alerts: Azure Monitor Alerts → Logic Apps → Teams / PagerDuty

GCP

  • NLI Inference: Vertex AI Prediction endpoint with DeBERTa-v3 model
  • LLM-as-Judge: Vertex AI Gemini 1.5 Flash (cost-optimised judge)
  • Human Review Queue: Cloud Tasks + Cloud Run review UI
  • Detection Logs: Firestore (real-time); BigQuery (analytics)
  • Classifier Retraining: Vertex AI Pipelines
  • Dashboards: Looker; Cloud Monitoring
  • Alerts: Cloud Monitoring Alerting → PagerDuty

On-Premises

  • NLI Inference: Self-hosted DeBERTa-v3-large on GPU cluster (Triton Inference Server)
  • LLM-as-Judge: Self-hosted Llama 3.1 70B with structured output prompting
  • Human Review Queue: Apache Kafka + custom React review application
  • Detection Logs: PostgreSQL (events); ClickHouse (analytics); MinIO (archive)
  • Classifier Retraining: MLflow + custom training script on GPU cluster
  • Dashboards: Grafana
  • Alerts: Alertmanager → OpsGenie / PagerDuty

Pattern ID Pattern Name Relationship Notes
EAAPL-OBS001 AI Telemetry Architecture Foundation Log and trace infrastructure required; requestId linkage
EAAPL-OBS002 Prompt Monitoring Sibling Prompt-side controls; this pattern covers output-side controls
EAAPL-OBS004 AI Incident Management Depends On Hallucination rate spike is a defined incident type in OBS004
EAAPL-OBS005 Model Drift Detection Sibling Confidence calibration degradation detected here feeds drift monitoring
EAAPL-OBS008 AI Performance Benchmarking Sibling Offline benchmark hallucination rate complemented by online detection rate

17. Maturity Assessment

Overall Maturity: Emerging

Dimension Score (1–5) Rationale
Adoption Breadth 2 Fewer than 20% of enterprise AI deployments have systematic runtime detection
Tooling Ecosystem 3 NLI models mature; hallucination-specific evaluation frameworks (RAGAS, TruLens) maturing rapidly
Operational Runbook Coverage 2 Runbooks are organisation-specific; no widely adopted standard
Regulatory Evidence 3 EU AI Act Article 14 creates explicit demand; APRA guidance emerging
Cost Predictability 2 NLI inference cost at scale still being benchmarked; LLM-as-judge cost is high and variable
Team Skill Availability 2 NLI fine-tuning and calibration skills are specialised; limited talent pool

18. Revision History

Version Date Author Changes
1.0.0 2026-06-12 EAAPL Working Group Initial publication
← Back to LibraryMore Observability & Monitoring