EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryObservability & Monitoring
Mature
⇄ Compare

LLM Evaluation Pipeline

📊 Observability & MonitoringEU AI ActISO/IEC 42001

[EAAPL-OBS006] LLM Evaluation Pipeline

Category: Observability & Monitoring Sub-category: LLM Quality Version: 1.0 Maturity: Proven Tags: llm-evaluation, hallucination-detection, faithfulness, judge-llm, ci-cd-gate, quality-metrics, ragas, continuous-evaluation Regulatory Relevance: EU AI Act Article 9 & 12, APRA CPS 230, ISO/IEC 42001 Clause 9.1, NIST AI RMF MEASURE 2.5


1. Executive Summary

Deploying an LLM update to production without a structured evaluation gate is equivalent to deploying software without a test suite. Unlike functional software, LLMs can regress on quality dimensions — hallucination rate, faithfulness to source documents, answer relevance, and toxic output rate — without any change to the application code. The regression is invisible until users complain or a regulatory audit surfaces it. The LLM Evaluation Pipeline closes this gap by embedding automated, multi-dimensional quality scoring into the CI/CD pipeline, creating a quality gate that blocks model updates from reaching production when scores fall below defined thresholds.

This pattern defines an automated evaluation architecture that combines three complementary evaluation modes: LLM-as-judge (a capable judge model scores responses on a rubric), deterministic checks (exact match, citation grounding, factual constraint assertions), and human spot-check workflows (sampled responses routed to human reviewers for ground-truth calibration). Evaluation runs against versioned golden datasets, and results are stored as time-series quality metrics enabling trend analysis across model versions and prompt versions. The outcome is a quality-gated deployment workflow with full traceability: every model or prompt update carries a signed evaluation report showing which metrics passed, which were borderline, and which gating criteria were applied.


2. Problem Statement

Business Problem

Organisations upgrading LLM providers or model versions have no systematic mechanism to verify that quality is maintained or improved. Teams rely on manual spot-checks by engineers — a process that does not scale, is not reproducible, and produces no auditable evidence. Quality regressions reach production and are discovered by customers or, in regulated industries, by auditors.

Technical Problem

LLM quality is multidimensional: a response can be fluent and relevant but factually wrong (hallucination), or accurate but harmful, or correct but not grounded in the provided context (faithfulness failure). No single metric captures all dimensions. Existing software testing frameworks (unit tests, integration tests) cannot evaluate probabilistic text outputs. Evaluation requires purpose-built infrastructure with embedding-based semantic similarity, judge LLM scoring, and statistical aggregation over representative sample sets.

Symptoms of Absence

  • Model provider silently updates a model version; hallucination rate increases 15%; discovered in customer complaints three weeks later
  • Prompt engineers iterate on prompt templates with no objective before/after quality comparison
  • CI/CD pipeline has no LLM-specific gate; model changes ship through the same pipeline as database migrations
  • Quality scores are collected ad hoc in spreadsheets with no historical trend visibility
  • Human review is the only evaluation mechanism; reviewer fatigue and inconsistency produce unreliable signals

Cost of Inaction

  • Quality: Hallucination rate regressions compound trust damage; users who receive wrong answers do not return
  • Compliance: EU AI Act Article 9 requires ongoing risk management including performance monitoring; absence of systematic evaluation is a documented deficiency
  • Operational: Manual evaluation cannot scale with model update frequency; engineering teams spend 20–40% of AI sprint time on manual quality checks that a pipeline would automate

3. Context

When to Apply

  • Any production LLM system that receives model version updates (including provider-side silent updates)
  • RAG (Retrieval-Augmented Generation) systems where faithfulness to retrieved context is a correctness requirement
  • LLM systems in regulated industries where quality evidence is required for audit
  • Teams iterating on prompt templates at a cadence faster than manual review can sustain
  • Systems where hallucination or toxic output carries legal or safety consequences

When NOT to Apply

  • Proof-of-concept or internal demo systems with no production users
  • LLM applications where outputs are entirely creative/generative with no correctness criteria (pure creative writing assistants)
  • Organisations without a golden dataset or capacity to build one — evaluation against low-quality reference data produces misleading scores

Prerequisites

  • Versioned golden dataset with input prompts, expected outputs or reference documents, and ground-truth labels
  • Access to a capable judge LLM (GPT-4o, Claude Sonnet, or equivalent) with a stable API
  • CI/CD pipeline that can be extended with a quality gate step
  • Metrics storage backend (from EAAPL-OBS001 or equivalent) for evaluation result time-series
  • Human reviewer pool or annotation platform for spot-check calibration

Industry Applicability

Industry Use Case Value Adoption Level
Financial Services Validate LLM advice and document summarisation against regulatory standards Prevents hallucinated financial guidance reaching customers High
Healthcare Score clinical AI responses for factual accuracy and safety before deployment Reduces risk of harmful clinical misinformation Emerging
Legal Services Evaluate case summary and contract review LLM for citation faithfulness Prevents professional liability from hallucinated legal citations Emerging
Technology / SaaS Gate model upgrades in AI-powered products (code assistants, chat agents) Maintains product quality through model churn Proven
Government Validate citizen-facing AI responses for factual accuracy and policy compliance Enables auditable AI quality evidence for ministerial accountability Emerging

4. Architecture Overview

The LLM Evaluation Pipeline is triggered by two events: a CI/CD deployment pipeline stage (triggered on every model version bump or prompt template change) and a scheduled production monitoring job (running on a rolling sample of live traffic). Both paths share the same evaluation engine but differ in data source and disposition: the CI/CD path gates deployment; the production monitoring path feeds quality dashboards and drift alerts.

The evaluation engine accepts a batch of (input, output, reference) triples and applies a multi-scorer stack. The first scorer is the LLM judge: a separate, independently versioned judge LLM is given a structured rubric and asked to score each response on faithfulness (0–1), relevance (0–1), coherence (0–1), and, for RAG systems, groundedness (whether claims are supported by the provided context). The judge LLM prompt is itself versioned and pinned to prevent the judge from silently changing behaviour. The second scorer is deterministic: exact-match checks for extractive tasks, regex assertions for required output structure, forbidden-token checks for safety constraints, and citation-grounding checks that verify every factual claim maps to a cited source chunk. The third scorer is the human spot-check workflow, which routes a sampled 2–5% of responses to a human reviewer queue (implemented as an annotation tool such as Label Studio or Argilla) and collects binary pass/fail labels that are used to calibrate judge LLM scoring drift.

Results from all scorers are aggregated into a per-run evaluation report. The report includes per-metric mean scores, standard deviations, pass rates against thresholds, sample-level detail rows, and a binary gate decision (PASS / FAIL / BORDERLINE). The gate decision is computed by comparing per-metric means against configured thresholds defined in a quality policy YAML file that lives in the application repository and is version-controlled alongside the model configuration. Borderline results trigger a mandatory human review before promotion proceeds.

Evaluation results are stored as structured records in the metrics backend, enabling time-series queries such as: "Show me the trend in faithfulness score for model GPT-4o-mini across the last 10 prompt versions." This historical view is the primary instrument for detecting gradual quality decay — a pattern that CI/CD gates alone (which only compare adjacent versions) cannot detect.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Trigger["Trigger Sources"] A[CI/CD Pipeline Stage] B[Scheduled Production Sampler] end subgraph EvalEngine["Evaluation Engine"] C[Input/Output/Reference Batcher] D[Judge LLM Scorer] E[Deterministic Checker] F[Human Spot-Check Queue] end subgraph Aggregation["Result Aggregation"] G[Score Aggregator] H{Gate Decision} end subgraph Disposition["Disposition"] I[Block Deployment] J[Promote to Production] K[Quality Metrics Store] end A --> C B --> C C --> D C --> E C --> F D --> G E --> G F --> G G --> H H -->|fail| I H -->|pass| J H -->|borderline| F G --> K

6. Components

Component Responsibility Technology Examples
Golden Dataset Store Versioned repository of (input, reference, label) triples used as evaluation ground truth DVC + S3, Hugging Face Datasets, PostgreSQL with version tagging
Judge LLM Client Calls a judge LLM with a structured rubric prompt; parses structured score output; handles judge API failures gracefully OpenAI GPT-4o, Anthropic Claude Sonnet, self-hosted Llama-3-70B via vLLM
Deterministic Checker Runs regex, exact-match, forbidden-token, and citation-grounding assertions against each output Custom Python assertions, DeepEval deterministic scorers, Pytest-based harness
Human Annotation Queue Routes sampled responses to human reviewers; collects binary labels; feeds calibration loop Label Studio, Argilla, Prodigy, AWS Augmented AI
Score Aggregator Collects per-sample scores from all scorers; computes per-metric statistics; applies threshold policy Python/Pandas aggregation, MLflow evaluation module, Ragas framework
Quality Gate Controller Reads aggregated scores against policy YAML; emits PASS/FAIL/BORDERLINE gate decision to CI/CD Custom gate script, GitHub Actions step, Jenkins quality gate plugin
Evaluation Metrics Backend Time-series storage of per-run evaluation results; supports trend queries across model and prompt versions PostgreSQL with TimescaleDB, InfluxDB, MLflow tracking server
Quality Dashboard Visualises per-metric trend lines, version comparisons, and threshold breach history Grafana, MLflow UI, Evidence.dev, custom React dashboard

7. Implementation Steps

Step 1: Build and Version the Golden Dataset

Construct a golden dataset of at minimum 200 representative (input, reference) pairs covering the key use cases, edge cases, and known difficult examples from production. Tag each example with a topic category and difficulty level. Store the dataset in a version-controlled store (DVC + S3 or Hugging Face Datasets). Establish a dataset refresh cadence (quarterly minimum) with a process for incorporating production examples that exposed quality issues. The golden dataset is the evaluation system's ground truth — its quality determines the reliability of every metric derived from it.

Step 2: Implement the Judge LLM Scorer with a Pinned Rubric

Define a structured judge rubric as a versioned prompt template. The rubric must specify: the scoring dimensions (faithfulness, relevance, coherence, groundedness for RAG), the scale (0.0–1.0), the output format (JSON with per-dimension scores and a brief rationale), and examples for each score level (few-shot examples in the judge prompt). Pin the judge model version explicitly (e.g., gpt-4o-2024-11-20 not gpt-4o). Implement the judge client with retry logic, structured output parsing, and fallback behaviour for judge API failures. Validate the judge's reliability by comparing its scores against human labels on a calibration set; target Spearman correlation > 0.75 before using the judge in a gate.

Step 3: Wire Deterministic Checkers and CI/CD Integration

Implement deterministic checks as a test suite that runs against each output in the batch: exact-match assertions for extractive tasks, forbidden-token lists for safety constraints, citation-grounding checks for RAG outputs (every factual sentence must map to a retrieved chunk by embedding similarity > 0.85). Integrate the full evaluation run as a CI/CD pipeline stage triggered on pull requests that change model configuration, prompt templates, or model version pins. The stage must: run the evaluation batch, compute aggregate scores, compare against the quality policy thresholds, and emit a PASS or FAIL status that gates the merge or deployment.

Step 4: Deploy Production Monitoring and Calibration Loop

Run the evaluation pipeline on a rolling 2–5% sample of live production traffic on a daily schedule. Store all evaluation results in the metrics backend with a consistent schema (run_id, model_version, prompt_version, timestamp, per-metric scores, gate_decision). Configure dashboard alerts for metric trend regressions (e.g., rolling 7-day faithfulness score drops > 5% from the 30-day baseline). Route a fraction of production-sampled evaluations to human reviewers to maintain calibration between judge LLM scores and ground truth. Run quarterly judge calibration checks and update the judge prompt if correlation with human labels drifts below 0.70.


8. Security Considerations

OWASP LLM Top 10 Mapping

OWASP ID Threat Mitigation
LLM01 Prompt Injection Adversarial inputs in the golden dataset or production sample could manipulate the judge LLM into producing inflated scores Sanitise inputs before judge submission; run judge in a restricted system prompt context; monitor for anomalous score distributions
LLM02 Insecure Output Handling Judge LLM output parsed as structured JSON could contain injected code if output parsing is insufficiently strict Use schema-validated JSON parsing (Pydantic, jsonschema); reject malformed judge responses; never eval() judge output
LLM06 Sensitive Information Disclosure Golden dataset may contain PII from production examples used to construct evaluation cases Apply the same PII scrubbing pipeline used in EAAPL-OBS001 to all evaluation data; restrict golden dataset access to ML engineering
LLM09 Overreliance Automated evaluation scores may create false confidence, causing teams to skip human review of borderline or high-stakes outputs Enforce mandatory human review for BORDERLINE gate decisions; publish calibration metrics so teams understand judge reliability limits

9. Governance Artefacts

  • Quality Policy YAML file (version-controlled in application repository): defines per-metric gate thresholds, evaluation dataset version pin, judge model version pin, and borderline review SLA
  • Evaluation Run Report (generated per CI/CD run and per scheduled production run): signed artefact containing run metadata, per-metric aggregate scores, per-sample detail, gate decision, and reviewer identity for any human review steps
  • Judge Calibration Report (quarterly): Spearman correlation of judge scores vs. human labels; drift analysis vs. previous quarter; recommendation to update judge prompt if needed
  • Golden Dataset Changelog: log of dataset additions, removals, and version bumps with rationale; required to explain evaluation score changes attributable to dataset evolution
  • Quality Trend Dashboard (live): time-series of per-metric scores across model versions and prompt versions; visible to product, engineering, and compliance stakeholders

10. SLOs

SLO Target Measurement
CI/CD evaluation run duration < 15 minutes for a 200-sample golden dataset batch Wall-clock time from stage trigger to gate decision
Judge LLM scorer availability > 99.5% of evaluation runs complete without judge API failure Failed judge calls / total judge calls per week
Human spot-check review SLA Borderline gate decisions reviewed within 4 business hours Time from BORDERLINE flag to human decision recorded
Faithfulness score threshold (RAG systems) > 0.80 mean score across golden dataset Per-run aggregated faithfulness score in evaluation report
Hallucination gate pass rate > 95% of outputs score above hallucination threshold on deterministic checks Per-run deterministic check pass rate
Evaluation latency (CI gate) <90s per 100-sample batch P99 pipeline duration
Drift alert MTTD (Mean Time to Detect) <24 hours Time from regression onset to alert firing

11. Cost Model

Cost Driver Estimate Notes
Judge LLM API calls (CI/CD runs) $0.10–$2.00 per evaluation run Depends on golden dataset size (200–1000 samples) and judge model; GPT-4o at $5/1M input tokens, 200 samples at ~500 tokens each = ~$0.50
Judge LLM API calls (production monitoring) $50–$500/month Scales with production volume sampled; 2% of 100K daily requests = 2K samples/day at ~$0.003/sample
Human annotation platform $0–$500/month Label Studio open-source (free) to Argilla Cloud or Scale AI for enterprise annotation
Evaluation metrics storage $20–$100/month Structured records in PostgreSQL/TimescaleDB; small volume relative to telemetry
Compute for evaluation worker $50–$200/month Containerised evaluation runner on ECS/Cloud Run; bursty workload suits serverless

12. Trade-off Analysis

Dimension Benefit Trade-off
Judge LLM scoring Captures semantic quality dimensions no deterministic check can measure; scales to any output type Judge LLM itself can hallucinate or apply rubric inconsistently; requires calibration and is not free
Golden dataset gating Reproducible, versioned, objective gate; enables before/after comparison across versions Golden dataset quality is a hard ceiling on evaluation quality; stale or unrepresentative datasets produce misleading gates
CI/CD integration Catches regressions before they reach production; shifts quality left into the development workflow Adds latency to deployment pipeline; requires engineering discipline to not skip the gate under schedule pressure
Human spot-check calibration Provides ground-truth anchor for judge reliability; surfaces rubric drift Slow, expensive, and creates reviewer bottleneck; cannot scale to full production volume
Automated PASS/FAIL gate Removes human bottleneck from routine model upgrades Binary gate can block beneficial changes for borderline-but-acceptable score dips; threshold calibration requires ongoing attention

13. Failure Modes

Failure Trigger Recovery
Judge LLM API unavailable during CI/CD gate Provider outage or rate limit at evaluation time Implement retry with exponential backoff; fall back to deterministic-only gate with a human review flag; do not fail open and skip the gate
Golden dataset staleness causes misleading gate pass Production distribution shifts away from golden dataset examples; gate passes on out-of-distribution regressions Quarterly dataset refresh process; production monitoring catches what CI/CD gate misses; add production-sampled hard examples to dataset after every production quality incident
Judge prompt drift changes score distribution Judge model provider silently updates model; rubric interpretation shifts; historical scores become non-comparable Pin judge model version explicitly; run calibration check on each judge model update; maintain calibration corpus for trend adjustment
Gate threshold too strict blocks all deployments Threshold set at launch but model quality at P80 never stably clears it Threshold calibration review process; thresholds must be set based on achieved baseline scores, not aspirational targets
Human review bottleneck under BORDERLINE volume Many borderline decisions queue simultaneously; 4-hour SLA breached Pre-allocate reviewer capacity for deployment windows; escalation path to senior AI engineer for time-sensitive deployments

14. Regulatory Mapping

Regulation Requirement How Pattern Addresses It
EU AI Act Article 9 High-risk AI systems must implement a risk management system including continuous performance monitoring Evaluation pipeline provides systematic, documented quality measurement at every deployment; evaluation reports are auditable evidence
EU AI Act Article 12 High-risk AI systems must maintain logs enabling post-market monitoring and investigation Evaluation run reports and per-sample score records provide the evidence layer required for Article 12 compliance
APRA CPS 230 Material models used in financial services require model risk management including validation and ongoing monitoring Quality gate thresholds and trend dashboards satisfy model monitoring requirements; evaluation reports satisfy periodic validation documentation
APRA CPS 230 §21 AI systems classified as critical operations require monitoring that demonstrates the system is operating within defined performance parameters The evaluation pipeline produces the evidence artefact (evaluation scorecard with rolling baseline) that satisfies the 'regular testing of operational resilience' requirement
APRA CPS 234 §36 Material changes to AI system behaviour may constitute a 'material information security incident' or 'material service provider change' requiring APRA notification within 72 hours Detection capability provided by this pattern is the prerequisite for meeting that 72-hour notification timeline; evaluation results surface material behavioural changes (prompt drift, model version change, significant accuracy regression) as soon as they occur
ISO/IEC 42001 Clause 9.1 AI management system must define methods for monitoring, measurement, analysis, and evaluation of AI performance This pattern directly implements the measurement and analysis obligation; evaluation reports are the required evidence artefacts
NIST AI RMF MEASURE 2.5 AI systems must be tested against appropriate metrics to evaluate trustworthiness Multi-dimensional scoring (faithfulness, relevance, toxicity) across the full metric spectrum required by MEASURE 2.5

15. Reference Implementations

AWS

  • Evaluation Runner: AWS Lambda or ECS Task triggered from CodePipeline quality gate stage
  • Judge LLM: Amazon Bedrock (Anthropic Claude via Bedrock API) with Guardrails for judge prompt protection
  • Deterministic Checks: pytest-based test suite in CodeBuild step
  • Human Annotation: Amazon Augmented AI (A2I) for human spot-check workflow
  • Metrics Storage: Amazon RDS PostgreSQL + Amazon Managed Grafana for dashboards
  • Golden Dataset: S3 + AWS Data Version Control (DVC) integration

Azure

  • Evaluation Runner: Azure Container Apps Job triggered from Azure DevOps pipeline
  • Judge LLM: Azure OpenAI GPT-4o with Prompt Shield enabled on judge prompts
  • Deterministic Checks: pytest in Azure DevOps task
  • Human Annotation: Azure Machine Learning Data Labelling
  • Metrics Storage: Azure Database for PostgreSQL + Azure Managed Grafana
  • Golden Dataset: Azure Blob Storage + MLflow dataset tracking

On-Premises

  • Evaluation Runner: Containerised Python worker on Kubernetes CronJob and CI pipeline hook
  • Judge LLM: Self-hosted Llama-3-70B or Mistral-Large via vLLM; reduces cost for high-volume evaluation
  • Deterministic Checks: DeepEval framework or custom pytest harness
  • Human Annotation: Label Studio (open-source, self-hosted)
  • Metrics Storage: PostgreSQL + TimescaleDB + Grafana
  • Golden Dataset: DVC + MinIO object storage

  • EAAPL-OBS001 AI Telemetry Architecture — provides the metrics backend and structured log schema that evaluation results are stored in
  • EAAPL-OBS003 Hallucination Detection — real-time hallucination detection at inference time; this pattern provides the batch evaluation gate that validates hallucination rate before deployment
  • EAAPL-OBS007 Prompt Drift Detection — drift monitoring for production prompt performance; this pattern provides the CI/CD gate; OBS007 provides the production monitoring layer
  • EAAPL-OBS008 A/B Model Evaluation — canary deployment pattern that complements this gate-based pattern; use OBS006 to qualify a challenger model before OBS008 routes live traffic to it
  • EAAPL-OBS005 Model Drift Detection — population stability index drift on input/output distributions; evaluation pipeline scores are the quality signal that drift monitoring observes

17. Maturity Assessment

Dimension Level Notes
Adoption Breadth 3 — Emerging-Proven Systematic LLM evaluation pipelines are established at AI-native companies; slower adoption in traditional enterprises where model risk management processes predate LLM-specific tooling
Tooling Ecosystem 4 — Proven Ragas, DeepEval, MLflow LLM evaluation, and ROUGE/BLEU libraries are mature; LLM-as-judge pattern is well-documented and widely validated
Regulatory Evidence 4 — Proven EU AI Act conformance assessments now routinely cite systematic evaluation as a required control; pattern aligns with published AI Act technical documentation
Cost Predictability 3 — Moderate Judge LLM costs scale predictably with dataset size; production monitoring costs depend on traffic volume and sampling rate, which are controllable

18. Revision History

Version Date Change
1.0 2026-06-14 Initial release
← Back to LibraryMore Observability & Monitoring