Proven

LLM Evaluation Pipeline

Observability & MonitoringEU AI ActISO/IEC 42001

[EAAPL-OBS006] LLM Evaluation Pipeline

Category: Observability & Monitoring Sub-category: LLM Quality Version: 1.0 Maturity: Proven Tags: llm-evaluation, hallucination-detection, faithfulness, judge-llm, ci-cd-gate, quality-metrics, ragas, continuous-evaluation Regulatory Relevance: EU AI Act Article 9 & 12, APRA CPS 230, ISO/IEC 42001 Clause 9.1, NIST AI RMF MEASURE 2.5

1. Executive Summary

Deploying an LLM update to production without a structured evaluation gate is equivalent to deploying software without a test suite. Unlike functional software, LLMs can regress on quality dimensions — hallucination rate, faithfulness to source documents, answer relevance, and toxic output rate — without any change to the application code. The regression is invisible until users complain or a regulatory audit surfaces it. The LLM Evaluation Pipeline closes this gap by embedding automated, multi-dimensional quality scoring into the CI/CD pipeline, creating a quality gate that blocks model updates from reaching production when scores fall below defined thresholds.

This pattern defines an automated evaluation architecture that combines three complementary evaluation modes: LLM-as-judge (a capable judge model scores responses on a rubric), deterministic checks (exact match, citation grounding, factual constraint assertions), and human spot-check workflows (sampled responses routed to human reviewers for ground-truth calibration). Evaluation runs against versioned golden datasets, and results are stored as time-series quality metrics enabling trend analysis across model versions and prompt versions. The outcome is a quality-gated deployment workflow with full traceability: every model or prompt update carries a signed evaluation report showing which metrics passed, which were borderline, and which gating criteria were applied.

2. Problem Statement

Business Problem

Organisations upgrading LLM providers or model versions have no systematic mechanism to verify that quality is maintained or improved. Teams rely on manual spot-checks by engineers — a process that does not scale, is not reproducible, and produces no auditable evidence. Quality regressions reach production and are discovered by customers or, in regulated industries, by auditors.

Technical Problem

LLM quality is multidimensional: a response can be fluent and relevant but factually wrong (hallucination), or accurate but harmful, or correct but not grounded in the provided context (faithfulness failure). No single metric captures all dimensions. Existing software testing frameworks (unit tests, integration tests) cannot evaluate probabilistic text outputs. Evaluation requires purpose-built infrastructure with embedding-based semantic similarity, judge LLM scoring, and statistical aggregation over representative sample sets.

Symptoms of Absence

Model provider silently updates a model version; hallucination rate increases 15%; discovered in customer complaints three weeks later
Prompt engineers iterate on prompt templates with no objective before/after quality comparison
CI/CD pipeline has no LLM-specific gate; model changes ship through the same pipeline as database migrations
Quality scores are collected ad hoc in spreadsheets with no historical trend visibility
Human review is the only evaluation mechanism; reviewer fatigue and inconsistency produce unreliable signals

Cost of Inaction

Quality: Hallucination rate regressions compound trust damage; users who receive wrong answers do not return
Compliance: EU AI Act Article 9 requires ongoing risk management including performance monitoring; absence of systematic evaluation is a documented deficiency
Operational: Manual evaluation cannot scale with model update frequency; engineering teams spend 20–40% of AI sprint time on manual quality checks that a pipeline would automate

3. Context

When to Apply

Any production LLM system that receives model version updates (including provider-side silent updates)
RAG (Retrieval-Augmented Generation) systems where faithfulness to retrieved context is a correctness requirement
LLM systems in regulated industries where quality evidence is required for audit
Teams iterating on prompt templates at a cadence faster than manual review can sustain
Systems where hallucination or toxic output carries legal or safety consequences

When NOT to Apply

Proof-of-concept or internal demo systems with no production users
LLM applications where outputs are entirely creative/generative with no correctness criteria (pure creative writing assistants)
Organisations without a golden dataset or capacity to build one — evaluation against low-quality reference data produces misleading scores

Prerequisites

Versioned golden dataset with input prompts, expected outputs or reference documents, and ground-truth labels
Access to a capable judge LLM (GPT-4o, Claude Sonnet, or equivalent) with a stable API
CI/CD pipeline that can be extended with a quality gate step
Metrics storage backend (from EAAPL-OBS001 or equivalent) for evaluation result time-series
Human reviewer pool or annotation platform for spot-check calibration

Industry Applicability

Industry	Use Case	Value	Adoption Level
Financial Services	Validate LLM advice and document summarisation against regulatory standards	Prevents hallucinated financial guidance reaching customers	High
Healthcare	Score clinical AI responses for factual accuracy and safety before deployment	Reduces risk of harmful clinical misinformation	Emerging
Legal Services	Evaluate case summary and contract review LLM for citation faithfulness	Prevents professional liability from hallucinated legal citations	Emerging
Technology / SaaS	Gate model upgrades in AI-powered products (code assistants, chat agents)	Maintains product quality through model churn	Proven
Government	Validate citizen-facing AI responses for factual accuracy and policy compliance	Enables auditable AI quality evidence for ministerial accountability	Emerging

4. Architecture Overview

The LLM Evaluation Pipeline is triggered by two events: a CI/CD deployment pipeline stage (triggered on every model version bump or prompt template change) and a scheduled production monitoring job (running on a rolling sample of live traffic). Both paths share the same evaluation engine but differ in data source and disposition: the CI/CD path gates deployment; the production monitoring path feeds quality dashboards and drift alerts.

The evaluation engine accepts a batch of (input, output, reference) triples and applies a multi-scorer stack. The first scorer is the LLM judge: a separate, independently versioned judge LLM is given a structured rubric and asked to score each response on faithfulness (0–1), relevance (0–1), coherence (0–1), and, for RAG systems, groundedness (whether claims are supported by the provided context). The judge LLM prompt is itself versioned and pinned to prevent the judge from silently changing behaviour. The second scorer is deterministic: exact-match checks for extractive tasks, regex assertions for required output structure, forbidden-token checks for safety constraints, and citation-grounding checks that verify every factual claim maps to a cited source chunk. The third scorer is the human spot-check workflow, which routes a sampled 2–5% of responses to a human reviewer queue (implemented as an annotation tool such as Label Studio or Argilla) and collects binary pass/fail labels that are used to calibrate judge LLM scoring drift.

Results from all scorers are aggregated into a per-run evaluation report. The report includes per-metric mean scores, standard deviations, pass rates against thresholds, sample-level detail rows, and a binary gate decision (PASS / FAIL / BORDERLINE). The gate decision is computed by comparing per-metric means against configured thresholds defined in a quality policy YAML file that lives in the application repository and is version-controlled alongside the model configuration. Borderline results trigger a mandatory human review before promotion proceeds.

Evaluation results are stored as structured records in the metrics backend, enabling time-series queries such as: "Show me the trend in faithfulness score for model GPT-4o-mini across the last 10 prompt versions." This historical view is the primary instrument for detecting gradual quality decay — a pattern that CI/CD gates alone (which only compare adjacent versions) cannot detect.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Trigger["Trigger Sources"] A[CI/CD Pipeline Stage] B[Scheduled Production Sampler] end subgraph EvalEngine["Evaluation Engine"] C[Input/Output/Reference Batcher] D[Judge LLM Scorer] E[Deterministic Checker] F[Human Spot-Check Queue] end subgraph Aggregation["Result Aggregation"] G[Score Aggregator] H{Gate Decision} end subgraph Disposition["Disposition"] I[Block Deployment] J[Promote to Production] K[Quality Metrics Store] end A --> C B --> C C --> D C --> E C --> F D --> G E --> G F --> G G --> H H -->|fail| I H -->|pass| J H -->|borderline| F G --> K

6. Components

Component	Responsibility	Technology Examples
Golden Dataset Store	Versioned repository of (input, reference, label) triples used as evaluation ground truth	DVC + S3, Hugging Face Datasets, PostgreSQL with version tagging
Judge LLM Client	Calls a judge LLM with a structured rubric prompt; parses structured score output; handles judge API failures gracefully	OpenAI GPT-4o, Anthropic Claude Sonnet, self-hosted Llama-3-70B via vLLM
Deterministic Checker	Runs regex, exact-match, forbidden-token, and citation-grounding assertions against each output	Custom Python assertions, DeepEval deterministic scorers, Pytest-based harness
Human Annotation Queue	Routes sampled responses to human reviewers; collects binary labels; feeds calibration loop	Label Studio, Argilla, Prodigy, AWS Augmented AI
Score Aggregator	Collects per-sample scores from all scorers; computes per-metric statistics; applies threshold policy	Python/Pandas aggregation, MLflow evaluation module, Ragas framework
Quality Gate Controller	Reads aggregated scores against policy YAML; emits PASS/FAIL/BORDERLINE gate decision to CI/CD	Custom gate script, GitHub Actions step, Jenkins quality gate plugin
Evaluation Metrics Backend	Time-series storage of per-run evaluation results; supports trend queries across model and prompt versions	PostgreSQL with TimescaleDB, InfluxDB, MLflow tracking server
Quality Dashboard	Visualises per-metric trend lines, version comparisons, and threshold breach history	Grafana, MLflow UI, Evidence.dev, custom React dashboard

7. Implementation Steps

Step 1: Build and Version the Golden Dataset

Construct a golden dataset of at minimum 200 representative (input, reference) pairs covering the key use cases, edge cases, and known difficult examples from production. Tag each example with a topic category and difficulty level. Store the dataset in a version-controlled store (DVC + S3 or Hugging Face Datasets). Establish a dataset refresh cadence (quarterly minimum) with a process for incorporating production examples that exposed quality issues. The golden dataset is the evaluation system's ground truth — its quality determines the reliability of every metric derived from it.

Step 2: Implement the Judge LLM Scorer with a Pinned Rubric

Define a structured judge rubric as a versioned prompt template. The rubric must specify: the scoring dimensions (faithfulness, relevance, coherence, groundedness for RAG), the scale (0.0–1.0), the output format (JSON with per-dimension scores and a brief rationale), and examples for each score level (few-shot examples in the judge prompt). Pin the judge model version explicitly (e.g., gpt-4o-2024-11-20 not gpt-4o). Implement the judge client with retry logic, structured output parsing, and fallback behaviour for judge API failures. Validate the judge's reliability by comparing its scores against human labels on a calibration set; target Spearman correlation > 0.75 before using the judge in a gate.

Step 3: Wire Deterministic Checkers and CI/CD Integration

Implement deterministic checks as a test suite that runs against each output in the batch: exact-match assertions for extractive tasks, forbidden-token lists for safety constraints, citation-grounding checks for RAG outputs (every factual sentence must map to a retrieved chunk by embedding similarity > 0.85). Integrate the full evaluation run as a CI/CD pipeline stage triggered on pull requests that change model configuration, prompt templates, or model version pins. The stage must: run the evaluation batch, compute aggregate scores, compare against the quality policy thresholds, and emit a PASS or FAIL status that gates the merge or deployment.

Step 4: Deploy Production Monitoring and Calibration Loop

Run the evaluation pipeline on a rolling 2–5% sample of live production traffic on a daily schedule. Store all evaluation results in the metrics backend with a consistent schema (run_id, model_version, prompt_version, timestamp, per-metric scores, gate_decision). Configure dashboard alerts for metric trend regressions (e.g., rolling 7-day faithfulness score drops > 5% from the 30-day baseline). Route a fraction of production-sampled evaluations to human reviewers to maintain calibration between judge LLM scores and ground truth. Run quarterly judge calibration checks and update the judge prompt if correlation with human labels drifts below 0.70.

8. Security Considerations

OWASP LLM Top 10 Mapping

OWASP ID	Threat	Mitigation
LLM01 Prompt Injection	Adversarial inputs in the golden dataset or production sample could manipulate the judge LLM into producing inflated scores	Sanitise inputs before judge submission; run judge in a restricted system prompt context; monitor for anomalous score distributions
LLM02 Insecure Output Handling	Judge LLM output parsed as structured JSON could contain injected code if output parsing is insufficiently strict	Use schema-validated JSON parsing (Pydantic, jsonschema); reject malformed judge responses; never eval() judge output
LLM06 Sensitive Information Disclosure	Golden dataset may contain PII from production examples used to construct evaluation cases	Apply the same PII scrubbing pipeline used in EAAPL-OBS001 to all evaluation data; restrict golden dataset access to ML engineering
LLM09 Overreliance	Automated evaluation scores may create false confidence, causing teams to skip human review of borderline or high-stakes outputs	Enforce mandatory human review for BORDERLINE gate decisions; publish calibration metrics so teams understand judge reliability limits

9. Governance Artefacts

Quality Policy YAML file (version-controlled in application repository): defines per-metric gate thresholds, evaluation dataset version pin, judge model version pin, and borderline review SLA
Evaluation Run Report (generated per CI/CD run and per scheduled production run): signed artefact containing run metadata, per-metric aggregate scores, per-sample detail, gate decision, and reviewer identity for any human review steps
Judge Calibration Report (quarterly): Spearman correlation of judge scores vs. human labels; drift analysis vs. previous quarter; recommendation to update judge prompt if needed
Golden Dataset Changelog: log of dataset additions, removals, and version bumps with rationale; required to explain evaluation score changes attributable to dataset evolution
Quality Trend Dashboard (live): time-series of per-metric scores across model versions and prompt versions; visible to product, engineering, and compliance stakeholders

10. SLOs

SLO	Target	Measurement
CI/CD evaluation run duration	< 15 minutes for a 200-sample golden dataset batch	Wall-clock time from stage trigger to gate decision
Judge LLM scorer availability	> 99.5% of evaluation runs complete without judge API failure	Failed judge calls / total judge calls per week
Human spot-check review SLA	Borderline gate decisions reviewed within 4 business hours	Time from BORDERLINE flag to human decision recorded
Faithfulness score threshold (RAG systems)	> 0.80 mean score across golden dataset	Per-run aggregated faithfulness score in evaluation report
Hallucination gate pass rate	> 95% of outputs score above hallucination threshold on deterministic checks	Per-run deterministic check pass rate
Evaluation latency (CI gate)	<90s per 100-sample batch	P99 pipeline duration
Drift alert MTTD (Mean Time to Detect)	<24 hours	Time from regression onset to alert firing

11. Cost Model

Cost Driver	Estimate	Notes
Judge LLM API calls (CI/CD runs)	$0.10–$2.00 per evaluation run	Depends on golden dataset size (200–1000 samples) and judge model; GPT-4o at $5/1M input tokens, 200 samples at ~500 tokens each = ~$0.50
Judge LLM API calls (production monitoring)	$50–$500/month	Scales with production volume sampled; 2% of 100K daily requests = 2K samples/day at ~$0.003/sample
Human annotation platform	$0–$500/month	Label Studio open-source (free) to Argilla Cloud or Scale AI for enterprise annotation
Evaluation metrics storage	$20–$100/month	Structured records in PostgreSQL/TimescaleDB; small volume relative to telemetry
Compute for evaluation worker	$50–$200/month	Containerised evaluation runner on ECS/Cloud Run; bursty workload suits serverless

12. Trade-off Analysis

Dimension	Benefit	Trade-off
Judge LLM scoring	Captures semantic quality dimensions no deterministic check can measure; scales to any output type	Judge LLM itself can hallucinate or apply rubric inconsistently; requires calibration and is not free
Golden dataset gating	Reproducible, versioned, objective gate; enables before/after comparison across versions	Golden dataset quality is a hard ceiling on evaluation quality; stale or unrepresentative datasets produce misleading gates
CI/CD integration	Catches regressions before they reach production; shifts quality left into the development workflow	Adds latency to deployment pipeline; requires engineering discipline to not skip the gate under schedule pressure
Human spot-check calibration	Provides ground-truth anchor for judge reliability; surfaces rubric drift	Slow, expensive, and creates reviewer bottleneck; cannot scale to full production volume
Automated PASS/FAIL gate	Removes human bottleneck from routine model upgrades	Binary gate can block beneficial changes for borderline-but-acceptable score dips; threshold calibration requires ongoing attention

13. Failure Modes

Failure	Trigger	Recovery
Judge LLM API unavailable during CI/CD gate	Provider outage or rate limit at evaluation time	Implement retry with exponential backoff; fall back to deterministic-only gate with a human review flag; do not fail open and skip the gate
Golden dataset staleness causes misleading gate pass	Production distribution shifts away from golden dataset examples; gate passes on out-of-distribution regressions	Quarterly dataset refresh process; production monitoring catches what CI/CD gate misses; add production-sampled hard examples to dataset after every production quality incident
Judge prompt drift changes score distribution	Judge model provider silently updates model; rubric interpretation shifts; historical scores become non-comparable	Pin judge model version explicitly; run calibration check on each judge model update; maintain calibration corpus for trend adjustment
Gate threshold too strict blocks all deployments	Threshold set at launch but model quality at P80 never stably clears it	Threshold calibration review process; thresholds must be set based on achieved baseline scores, not aspirational targets
Human review bottleneck under BORDERLINE volume	Many borderline decisions queue simultaneously; 4-hour SLA breached	Pre-allocate reviewer capacity for deployment windows; escalation path to senior AI engineer for time-sensitive deployments

14. Regulatory Mapping

Regulation	Requirement	How Pattern Addresses It
EU AI Act Article 9	High-risk AI systems must implement a risk management system including continuous performance monitoring	Evaluation pipeline provides systematic, documented quality measurement at every deployment; evaluation reports are auditable evidence
EU AI Act Article 12	High-risk AI systems must maintain logs enabling post-market monitoring and investigation	Evaluation run reports and per-sample score records provide the evidence layer required for Article 12 compliance
APRA CPS 230	Material models used in financial services require model risk management including validation and ongoing monitoring	Quality gate thresholds and trend dashboards satisfy model monitoring requirements; evaluation reports satisfy periodic validation documentation
APRA CPS 230 §21	AI systems classified as critical operations require monitoring that demonstrates the system is operating within defined performance parameters	The evaluation pipeline produces the evidence artefact (evaluation scorecard with rolling baseline) that satisfies the 'regular testing of operational resilience' requirement
APRA CPS 234 §36	Material changes to AI system behaviour may constitute a 'material information security incident' or 'material service provider change' requiring APRA notification within 72 hours	Detection capability provided by this pattern is the prerequisite for meeting that 72-hour notification timeline; evaluation results surface material behavioural changes (prompt drift, model version change, significant accuracy regression) as soon as they occur
ISO/IEC 42001 Clause 9.1	AI management system must define methods for monitoring, measurement, analysis, and evaluation of AI performance	This pattern directly implements the measurement and analysis obligation; evaluation reports are the required evidence artefacts
NIST AI RMF MEASURE 2.5	AI systems must be tested against appropriate metrics to evaluate trustworthiness	Multi-dimensional scoring (faithfulness, relevance, toxicity) across the full metric spectrum required by MEASURE 2.5

15. Reference Implementations

AWS

Evaluation Runner: AWS Lambda or ECS Task triggered from CodePipeline quality gate stage
Judge LLM: Amazon Bedrock (Anthropic Claude via Bedrock API) with Guardrails for judge prompt protection
Deterministic Checks: pytest-based test suite in CodeBuild step
Human Annotation: Amazon Augmented AI (A2I) for human spot-check workflow
Metrics Storage: Amazon RDS PostgreSQL + Amazon Managed Grafana for dashboards
Golden Dataset: S3 + AWS Data Version Control (DVC) integration

Azure

Evaluation Runner: Azure Container Apps Job triggered from Azure DevOps pipeline
Judge LLM: Azure OpenAI GPT-4o with Prompt Shield enabled on judge prompts
Deterministic Checks: pytest in Azure DevOps task
Human Annotation: Azure Machine Learning Data Labelling
Metrics Storage: Azure Database for PostgreSQL + Azure Managed Grafana
Golden Dataset: Azure Blob Storage + MLflow dataset tracking

On-Premises

Evaluation Runner: Containerised Python worker on Kubernetes CronJob and CI pipeline hook
Judge LLM: Self-hosted Llama-3-70B or Mistral-Large via vLLM; reduces cost for high-volume evaluation
Deterministic Checks: DeepEval framework or custom pytest harness
Human Annotation: Label Studio (open-source, self-hosted)
Metrics Storage: PostgreSQL + TimescaleDB + Grafana
Golden Dataset: DVC + MinIO object storage

EAAPL-OBS001 AI Telemetry Architecture — provides the metrics backend and structured log schema that evaluation results are stored in
EAAPL-OBS003 Hallucination Detection — real-time hallucination detection at inference time; this pattern provides the batch evaluation gate that validates hallucination rate before deployment
EAAPL-OBS007 Prompt Drift Detection — drift monitoring for production prompt performance; this pattern provides the CI/CD gate; OBS007 provides the production monitoring layer
EAAPL-OBS008 A/B Model Evaluation — canary deployment pattern that complements this gate-based pattern; use OBS006 to qualify a challenger model before OBS008 routes live traffic to it
EAAPL-OBS005 Model Drift Detection — population stability index drift on input/output distributions; evaluation pipeline scores are the quality signal that drift monitoring observes

17. Maturity Assessment

Dimension	Level	Notes
Adoption Breadth	3 — Emerging-Proven	Systematic LLM evaluation pipelines are established at AI-native companies; slower adoption in traditional enterprises where model risk management processes predate LLM-specific tooling
Tooling Ecosystem	4 — Proven	Ragas, DeepEval, MLflow LLM evaluation, and ROUGE/BLEU libraries are mature; LLM-as-judge pattern is well-documented and widely validated
Regulatory Evidence	4 — Proven	EU AI Act conformance assessments now routinely cite systematic evaluation as a required control; pattern aligns with published AI Act technical documentation
Cost Predictability	3 — Moderate	Judge LLM costs scale predictably with dataset size; production monitoring costs depend on traffic volume and sampling rate, which are controllable

18. Revision History

Version	Date	Change
1.0	2026-06-14	Initial release

Track this pattern for APRA/ASIC review

← Back to Library More Observability & Monitoring →