EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryRetrieval-Augmented GenerationEAAPL-RAG001
EAAPL-RAG001Proven↑ Trending

Enterprise Retrieval-Augmented Generation

🔍 Retrieval-Augmented GenerationAPRA CPS234EU AI Act1 signals · Q2 2026

[EAAPL-RAG001] Enterprise Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Foundational RAG Architecture Version: 2.1 Maturity: Mature Tags: rag retrieval embeddings vector-search llm grounding citation enterprise Regulatory Relevance: APRA CPS234, EU AI Act Article 13 (Transparency), ISO/IEC 42001, NIST AI RMF (Govern 1.1, Map 1.1)


1. Executive Summary

Retrieval-Augmented Generation (RAG) is the foundational architecture pattern that grounds Large Language Model (LLM) responses in verifiable enterprise knowledge. Rather than relying solely on parametric knowledge baked into model weights, RAG dynamically retrieves relevant documents at inference time, assembles them into a context window, and instructs the LLM to generate answers grounded exclusively in retrieved evidence.

For enterprise CIOs and CTOs, RAG directly addresses three business-critical concerns: accuracy (answers are anchored to current, authoritative sources rather than potentially stale training data), auditability (every claim can be traced to a source document for compliance and regulatory purposes), and control (the knowledge base is governed by the enterprise, not by a third-party model provider). RAG enables AI-powered enterprise search, internal knowledge assistants, customer service automation, and regulatory document Q&A without the cost and risk of fine-tuning proprietary models. When implemented correctly, RAG reduces hallucination rates by 60–80% compared to prompt-only LLM usage, and provides the citation infrastructure required to satisfy model explainability mandates in APRA, EU AI Act, and ISO 42001 frameworks.


2. Problem Statement

Business Problem

Enterprise knowledge is locked in unstructured repositories — SharePoint libraries, Confluence wikis, PDF policy archives, email threads, and ERP exports. Employees spend an average of 20% of their working week searching for information (McKinsey Global Institute). LLMs offer natural-language access to this knowledge but generate plausible-sounding but factually incorrect answers (hallucinations) at an unacceptable rate for regulated industries.

Technical Problem

Standard LLM prompting cannot access documents outside the model's training window, cannot cite sources for claims, cannot reflect updates made after the model's training cutoff, and cannot respect per-user access controls on confidential documents. Context windows are finite; naively injecting entire document corpora is computationally prohibitive and degrades generation quality.

Symptoms of the Absence of this Pattern

  • Help-desk chatbots that confidently cite policy sections that do not exist or have been superseded
  • Internal search returning keyword-matched results with no synthesis or relevance ranking
  • Compliance teams unable to audit the provenance of AI-generated regulatory summaries
  • Knowledge workers spending >30 minutes constructing answers from multiple source documents
  • Model answers that vary unpredictably across repeated identical queries

Cost of Inaction

  • Regulatory exposure: ungrounded AI outputs used in decision-making violate EU AI Act Article 13 and APRA CPG 234 requirements for explainability
  • Operational cost: manual document synthesis at scale costs $150–$400 per knowledge-worker hour
  • Risk of reputational damage from hallucinated answers in customer-facing applications
  • Inability to retire legacy knowledge portal investments without a viable AI-powered replacement

3. Context

When to Apply

  • Enterprise Q&A systems over internal policy, procedure, or product documentation
  • Customer service automation requiring grounded, citable answers
  • Regulatory and compliance document interrogation
  • Code generation assistants that reference internal SDK and API documentation
  • Research synthesis across large document corpora (legal discovery, clinical guidelines, engineering standards)
  • Any LLM use case where answer provenance and auditability are required

When NOT to Apply

  • Tasks requiring real-time external data not yet ingested (use Streaming RAG, EAAPL-RAG006)
  • Multi-hop reasoning across structured relational data (use Graph RAG, EAAPL-RAG009, or SQL-generation patterns)
  • Use cases where the knowledge corpus is smaller than the context window (direct context injection is simpler)
  • Creative generation tasks where factual grounding is not required
  • Highly latency-sensitive applications (<100ms P99) where vector search overhead is unacceptable

Prerequisites

  • A defined and governed knowledge corpus (documents, wikis, structured exports)
  • An embedding model appropriate to the corpus language and domain
  • A vector database provisioned and accessible from the inference runtime
  • An LLM with sufficient context window to accommodate retrieved passages plus the user query
  • A document ingestion pipeline with scheduling and delta-update capability
  • Logging infrastructure capable of recording retrieval decisions and LLM inputs/outputs

Industry Applicability

Industry Primary Use Case Criticality Regulatory Consideration
Financial Services Policy Q&A, compliance manuals, product disclosure Mission-critical APRA CPS234, MiFID II, Basel III documentation
Healthcare Clinical guideline retrieval, formulary assistance Mission-critical TGA, AHPRA, HIPAA, clinical liability
Government Legislation interpretation, service eligibility High FOI, Privacy Act 1988, APS values
Legal Case law research, contract clause retrieval High Legal professional privilege, confidentiality
Retail/FMCG Product knowledge bases, supplier documentation Medium ACCC consumer guarantees, product liability
Technology Internal developer documentation, runbook Q&A Medium SOC2, ISO 27001
Higher Education Academic policy, research corpus search Medium Copyright Act, FERPA equivalents

4. Architecture Overview

Enterprise RAG decomposes into two distinct temporal phases: an offline ingestion pipeline and an online retrieval-generation pipeline. Understanding the separation of these phases is critical to operating the system correctly at enterprise scale.

Offline Ingestion Pipeline

The ingestion pipeline transforms raw enterprise documents into a searchable vector index. This phase runs continuously or on a schedule and must be treated as a production data pipeline with monitoring, alerting, and schema versioning.

Document acquisition draws from multiple source connectors (SharePoint, Confluence, S3, SFTP, database exports). Each connector must capture not only document content but also metadata: document ID, version, owner, classification level, effective date, and expiry date. Metadata is as important as content for enterprise use cases — it drives filtering, citation generation, and access control enforcement.

Chunking is among the most consequential architectural decisions in any RAG system. The goal is to produce semantically coherent text units that are large enough to contain useful context but small enough to remain topically focused. Three strategies apply at enterprise scale: fixed-size chunking (split by token count, typically 256–512 tokens, with 10–20% overlap) is operationally simple and predictable; semantic chunking (split at natural paragraph or section boundaries) preserves document structure and is preferred for narrative documents such as policy manuals; hierarchical chunking (maintain parent-child relationships between summary chunks and detail chunks) enables retrieval at multiple granularities and is optimal for long technical documents. For regulated environments, hierarchical chunking with section-level metadata (clause number, effective date) is recommended because it enables citation at the regulatory clause level.

Embedding converts each chunk into a dense vector representation using an embedding model. Model selection has long-term consequences: changing the embedding model requires re-embedding the entire corpus. For English-language enterprise corpora, text-embedding-3-large (OpenAI), textembedding-gecko (Google), or bge-large-en-v1.5 (BAAI, self-hostable) are strong choices. For multilingual corpora, multilingual-e5-large or bge-m3 are preferred. The embedding model must be evaluated on a domain-representative benchmark before production selection.

Vector storage persists embeddings alongside the full chunk text and metadata in a vector database. The vector index (typically HNSW — Hierarchical Navigable Small World) enables approximate nearest-neighbour search in milliseconds across tens of millions of vectors. Index construction parameters (ef_construction, M) directly affect recall/latency trade-offs and must be tuned per corpus.

Online Retrieval-Generation Pipeline

At inference time, the user query traverses a multi-stage pipeline before the LLM generates a response.

Query processing applies transformations that materially improve retrieval quality: query expansion (generating alternative phrasings of the question), HyDE (Hypothetical Document Embedding — generating a hypothetical answer and embedding it to find similar real documents), and query decomposition (splitting compound questions into atomic sub-queries). These transformations add 50–150ms latency but improve top-5 recall by 15–30% in empirical benchmarks.

Retrieval executes the vector similarity search against the index, returning the top-K chunks (K typically 5–20) ranked by cosine similarity. Pre-retrieval metadata filtering (by document class, department, effective date) reduces the search space and enforces access control at the vector layer.

Re-ranking applies a cross-encoder model to re-score the top-K retrieved chunks against the original query with higher precision than the bi-encoder embedding model. Cross-encoders (e.g., cross-encoder/ms-marco-MiniLM-L-12-v2) do not scale to full-index search but are highly effective on the top-K set.

Context assembly constructs the final prompt by ordering retrieved chunks (relevance-first or document-structure-first depending on the task), injecting system instructions, and appending the user query. Maximum context budget (the number of retrieved tokens before the LLM's context window is exceeded) must be monitored and enforced.

Generation invokes the LLM with the assembled context. The system prompt must explicitly instruct the model to answer only from the provided context and to include citations. Post-generation, citations are extracted and validated against the retrieved chunk set to detect hallucinated references.

The full pipeline must be instrumented end-to-end. Every query, the retrieved chunk IDs, the assembled context hash, the LLM response, and latency at each stage must be logged to enable quality monitoring, debugging, and audit trail maintenance.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Ingestion["Offline Ingestion"] A[Source Connectors] B[Chunk + Embed] C[Vector Store] end subgraph Retrieval["Online Retrieval"] D[User Query] E[Query Processor] F[Vector Search + Rerank] end subgraph Generation["Generation + Observability"] G[LLM + Context] H[Citation Validator] I[Quality Monitor] end A --> B --> C D --> E -->|filtered ANN search| C C --> F --> G --> H --> D G --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#f0fdf4,stroke:#22c55e style C fill:#fef9c3,stroke:#eab308 style D fill:#dbeafe,stroke:#3b82f6 style E fill:#f0fdf4,stroke:#22c55e style F fill:#f0fdf4,stroke:#22c55e style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981 style I fill:#fef9c3,stroke:#eab308

6. Components

Component Type Responsibility Technology Options Criticality
Source Connectors Integration Pull documents from enterprise repositories on schedule or event trigger Microsoft Graph API, Confluence REST API, S3 Event Notifications, custom JDBC connectors High
Metadata Extractor Data Processing Parse and normalise document metadata; assign classification labels Apache Tika, AWS Textract, Azure Document Intelligence, custom NLP pipeline High
Chunking Engine Data Processing Segment documents into semantically coherent, appropriately-sized chunks LangChain text splitters, LlamaIndex node parsers, custom Python chunkers High
Embedding Model ML Inference Convert text chunks to dense vector representations OpenAI text-embedding-3-large, Google textembedding-gecko, BAAI bge-large-en-v1.5, Cohere embed-v3 Critical
Vector Database Storage Store and index embedding vectors; serve ANN queries Pinecone, Weaviate, Qdrant, pgvector, OpenSearch k-NN, Chroma Critical
Document Store Storage Persist full chunk text and metadata for context assembly Amazon S3, Azure Blob Storage, Google Cloud Storage, PostgreSQL High
Query Processor Inference Enrich and expand user queries before retrieval LangChain, LlamaIndex, custom Python with LLM call Medium
ACL Filter Security Enforce document-level access control before vector search Custom middleware using identity provider claims; RBAC policy engine Critical
Cross-Encoder Re-ranker ML Inference Re-rank top-K retrieved chunks with higher precision Cohere Rerank, ms-marco cross-encoders, Voyage AI rerank High
Context Assembler Orchestration Order chunks, enforce token budget, construct final prompt LangChain, LlamaIndex, custom orchestration High
LLM ML Inference Generate grounded natural-language response from context OpenAI GPT-4o, Anthropic Claude 3.5, Google Gemini 1.5, Azure OpenAI, self-hosted Llama 3 Critical
Citation Extractor Post-processing Extract and validate source references in generated output Regex + structured output parsing, LLM-based extraction High
Observability Layer Operations Log all pipeline stages; monitor quality and latency metrics Datadog, Grafana + Prometheus, Langfuse, Arize AI High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Source Connector Poll or receive webhook from document repository; fetch new/modified documents Raw document bytes + source metadata
2 Metadata Extractor Parse document format; extract title, author, classification, dates, section structure Structured metadata record per document
3 Chunking Engine Apply chunking strategy; assign chunk ID, parent document ID, position index Ordered list of text chunks with metadata
4 Embedding Model Generate dense vector for each chunk (chunk_id, vector[1536], chunk_text, metadata) tuple
5 Vector Database Upsert vector with metadata payload; rebuild/update HNSW index Persisted vector index entry; confirmation receipt
6 Document Store Persist full chunk text and metadata Durable record accessible by chunk_id
7 User / Application Submit natural-language query via API Query string + user identity context
8 Query Processor Expand query; optionally generate HyDE document Enhanced query representation(s)
9 ACL Filter Resolve user's permitted document classes from identity provider Allowlist of namespace/metadata filters for vector search
10 Vector Database Execute ANN search with metadata filters; return top-K (k=20) chunks Ranked list of (chunk_id, score, metadata)
11 Document Store Fetch full chunk text for top-K chunk IDs Chunk text + source metadata for each candidate
12 Cross-Encoder Re-ranker Score each chunk against original query; re-order by cross-encoder score Re-ranked top-N (N=5) chunks
13 Context Assembler Order chunks; prepend system prompt; enforce token budget (≤context_window − reserve) Assembled prompt string
14 LLM Generate response conditioned on assembled context Raw response text with in-line citation markers
15 Citation Extractor Parse citation markers; validate each against retrieved chunk IDs Structured response: answer + verified citations
16 Observability Layer Log query, chunk IDs, context hash, response, latency per stage Audit log record; metrics increment
17 User / Application Receive grounded answer with clickable source citations End-user response

Error Flow

Error Condition Detection Point Recovery Action
Embedding model unavailable Step 4 (ingestion) or Step 8 (query) Retry with exponential backoff; fall back to cached embeddings for known queries
Vector database query timeout Step 10 Retry up to 3 times; degrade to keyword search fallback; surface "reduced quality" indicator to user
Zero results returned after ACL filter Step 10 Return "No accessible documents found" — do NOT fall through to unfiltered search
LLM rate limit or timeout Step 14 Queue retry with jitter; return partial response with "generation pending" status
Citation validation failure (hallucinated source) Step 15 Strip hallucinated citation from response; increment hallucination counter; flag for review
Document ingestion failure Step 2 Dead-letter queue; alert pipeline operator; document remains on previous version in index

8. Security Considerations

Authentication and Authorisation

  • All API endpoints require OAuth 2.0 / OIDC tokens from the enterprise identity provider (Entra ID, Okta, Ping)
  • User identity claims are forwarded through the entire pipeline and recorded in audit logs
  • Vector search is scoped by user-identity-derived metadata filters before execution — retrieval never returns documents the user cannot access
  • Service-to-service calls between pipeline components use mTLS with short-lived certificates

Secrets Management

  • Embedding model API keys stored in HashiCorp Vault or cloud-native secrets manager (AWS Secrets Manager, Azure Key Vault)
  • LLM API keys rotated on a 90-day schedule; rotation must not require pipeline restart
  • Database credentials never hardcoded; injected at runtime via environment variable from secrets manager

Data Classification

  • Source document classification labels (OFFICIAL, SENSITIVE, PROTECTED, etc.) are preserved as metadata through the chunking and embedding pipeline
  • Retrieved chunks inherit the highest classification of their parent document
  • The assembled context window classification is the maximum of all included chunks
  • LLM response is tagged with the classification of the highest-classified source included in context

Encryption

  • Vectors and chunk text at rest: AES-256 encryption in vector database and document store
  • Data in transit: TLS 1.3 minimum between all components
  • Highly sensitive corpora: consider field-level encryption of metadata; evaluate format-preserving encryption for PII fields

Auditability

  • Immutable audit log: every query, user ID, retrieved chunk IDs, context hash (SHA-256), LLM model version, and response hash
  • Audit logs shipped to tamper-evident log store (WORM S3 bucket, Splunk, Azure Sentinel)
  • Audit log retention: minimum 7 years for regulated industries

OWASP LLM Top 10 Mitigations

OWASP LLM Risk Applicability Mitigation in this Pattern
LLM01: Prompt Injection High System prompt hardened; retrieved content treated as data, not instructions; input sanitisation before embedding
LLM02: Insecure Output Handling High Structured output parsing; citation validation; no execution of LLM-generated code in this pattern
LLM03: Training Data Poisoning Medium Not directly applicable post-training; mitigated by corpus quality gates (EAAPL-KNW006)
LLM04: Model Denial of Service High Rate limiting per user/tenant; query complexity limits; context window budget enforcement
LLM05: Supply Chain Vulnerabilities Medium Embedding and LLM model versions pinned; SBOM maintained; provider SLA reviewed
LLM06: Sensitive Information Disclosure Critical ACL pre-filter prevents retrieval; PII redaction post-retrieval; output scanning for classification leakage
LLM07: Insecure Plugin Design Low No plugin execution in foundational RAG; applicable in Agentic RAG (EAAPL-RAG007)
LLM08: Excessive Agency Low RAG is read-only; no write actions available to LLM in this pattern
LLM09: Overreliance High Confidence scores surfaced to users; citations presented for independent verification
LLM10: Model Theft Medium LLM accessed via API only; model weights not exposed; fine-tuned models stored in private registries

9. Governance Considerations

Responsible AI

  • RAG answers must always present source citations to enable human verification
  • Confidence scoring should be implemented; low-confidence answers must be flagged
  • Sensitive-topic classifiers (medical advice, legal advice, financial advice) should trigger "consult a professional" disclaimers
  • Demographic bias monitoring on retrieval: ensure corpus is not systematically missing content relevant to specific user groups

Model Risk Management

  • Embedding model versioning: the corpus must be re-embedded when the embedding model is upgraded; running mixed embeddings (different models for different document batches) produces retrieval quality degradation
  • LLM model versioning: changes to the generation model require regression testing against a held-out QA benchmark
  • Hallucination rate KPI tracked as a model risk indicator; threshold breach triggers review gate

Human Approval Gates

  • Corpus ingestion of Tier 1 (Critical) documents requires human review approval before the document is made retrievable
  • Significant changes to system prompt (which governs LLM behaviour) require a change approval process
  • Quarterly human review of a random sample (n ≥ 100) of query/response pairs for quality and safety

Governance Artefacts

Artefact Owner Frequency Purpose
Corpus Inventory Knowledge Manager Continuous (automated) Track which documents are in the index, their versions, and owners
Embedding Model Card ML Engineer Per model version Document model capabilities, limitations, evaluation benchmarks
RAG Quality Scorecard AI Operations Weekly Track retrieval recall, precision, hallucination rate, answer faithfulness
Audit Log Export Compliance Monthly Regulatory evidence of access controls and output traceability
Responsible AI Assessment AI Governance Board Quarterly Bias, fairness, and explainability review
Data Lineage Record Data Governance Per ingestion run Document-to-chunk-to-vector lineage for every item in the corpus

10. Operational Considerations

Monitoring

Metric Type Collection Method Alert Threshold
Retrieval Latency P99 Latency OpenTelemetry trace > 500ms
End-to-end Query Latency P99 Latency OpenTelemetry trace > 3000ms
Embedding Model Availability Availability Synthetic probe every 60s < 99.5% over 5 min
Vector DB Query Success Rate Availability API response monitoring < 99.9% over 5 min
Hallucination Rate (weekly sample) Quality Manual review + LLM-as-judge > 5% of sampled queries
Answer Faithfulness Score Quality Automated RAGAS evaluation < 0.75 average
Index Freshness (hours since last update) Freshness Ingestion pipeline heartbeat > 24 hours for Tier 1 docs
Context Budget Utilisation Resource Per-query logging > 95% (approaching window limit)

Service Level Objectives

SLO Target Measurement Window
Query Response Time P95 ≤ 2 seconds Rolling 7-day
Query Response Time P99 ≤ 4 seconds Rolling 7-day
Pipeline Availability ≥ 99.9% Monthly
Ingestion Pipeline SLA (document available within N hours of publish) ≤ 4 hours (Tier 1), ≤ 24 hours (Tier 2) Per document
Answer Faithfulness (RAGAS) ≥ 0.80 Weekly evaluation

Logging

  • Structured JSON logs for every pipeline stage
  • Correlation ID propagated through entire query lifecycle
  • PII fields in logs must be masked (hash user IDs, redact query text for high-classification corpora)
  • Log retention: 90 days hot (searchable), 7 years cold (compliance archive)

Incident Response

Incident Type Detection Severity Response
Hallucinated citation in high-stakes answer Citation validator alert / user report P1 Immediate rollback of affected system prompt; manual review of last 24h queries
Cross-tenant data leakage ACL audit log anomaly P0 Immediate service suspension; security team activation; regulatory notification
Vector DB unavailability Synthetic probe P1 Fail over to read replica; page on-call SRE; degrade to keyword search
Ingestion pipeline stall Freshness SLO breach P2 Restart pipeline; alert knowledge manager; communicate staleness to users

Disaster Recovery

Component RTO RPO DR Strategy
Vector Database 1 hour 1 hour Cross-region replica; daily snapshot to object storage
Document Store 30 minutes 0 (versioned) Multi-region S3 replication; versioning enabled
Ingestion Pipeline 4 hours N/A (re-runnable) Infrastructure-as-code re-deploy; idempotent re-ingestion
LLM API 15 minutes N/A Multi-provider fallback (primary + secondary LLM provider)

11. Cost Considerations

Cost Drivers

Cost Driver Unit Approximate Cost Scaling Behaviour
Embedding model (batch ingestion) Per million tokens $0.02–$0.13 (OpenAI/Google) Linear with corpus size; one-time then incremental
Embedding model (query time) Per million tokens $0.02–$0.13 Linear with query volume
Vector database hosting Per million vectors/month $70–$200 (managed); $20–$80 (self-hosted) Sub-linear with sharding
LLM generation Per million tokens (input+output) $2–$15 (GPT-4o class) Linear with query volume × context length
Cross-encoder re-ranking Per million tokens $1–$3 (Cohere) Linear with query volume × K
Object storage (document store) Per TB/month $20–$25 Linear with corpus size
Compute (orchestration, ingestion workers) Per vCPU-hour $0.05–$0.15 Bursty during ingestion; steady-state low

Scaling Risks

  • Context length growth: as users discover RAG and ask more complex queries, context tokens per query creep upward, driving LLM cost non-linearly
  • Re-embedding cost spike: a mandatory embedding model upgrade on a 100M-token corpus costs $2,000–$13,000 and requires careful planning
  • Vector index rebuild: adding new metadata fields requires a full index rebuild, causing temporary retrieval degradation

Cost Optimisations

  • Use tiered embedding: cheap embedding model for initial retrieval, expensive model only for re-ranking candidates
  • Implement semantic caching (cache responses for near-duplicate queries using embedding similarity)
  • Batch embedding during off-peak hours to take advantage of batch API discounts (50% on OpenAI)
  • Right-size the LLM: use a smaller/cheaper model for low-stakes queries, route to premium model only for complex or high-classification queries
  • Compress stored vectors using scalar quantisation (INT8) to reduce storage by 4× with <2% recall degradation

Indicative Cost Range

Deployment Scale Monthly Cost Range Notes
Small (< 1M vectors, < 10K queries/day) $500 – $2,000 Startup or departmental deployment
Medium (1M–10M vectors, 10K–100K queries/day) $2,000 – $15,000 Enterprise divisional deployment
Large (> 10M vectors, > 100K queries/day) $15,000 – $80,000 Enterprise-wide deployment; optimisation critical

12. Trade-Off Analysis

Chunking Strategy Comparison

Option Recall Quality Operational Complexity Citation Granularity Recommended For
Fixed-size chunking (512 tokens, 10% overlap) Moderate Low Low (mid-paragraph boundaries) Initial deployments; homogeneous corpora
Semantic chunking (paragraph/section boundaries) High Medium High (section-level) Policy/procedure documents; structured reports
Hierarchical chunking (summary + detail chunks) Very High High Very High (clause-level) Regulated documents; long technical specifications
Sentence-level chunking Low (context fragmentation) Low Very High Not recommended for enterprise RAG

Embedding Model Comparison

Option Quality (MTEB) Cost Hosting Lock-in Risk
OpenAI text-embedding-3-large Highest $0.13/M tokens Cloud API High (OpenAI dependency)
Google textembedding-gecko-004 High $0.025/M tokens Cloud API High (GCP dependency)
BAAI bge-large-en-v1.5 High Compute cost only Self-hosted None
Cohere embed-v3 High $0.10/M tokens Cloud API Medium

Architectural Tensions

Tension Option A Option B Recommended Resolution
Freshness vs. Ingestion Cost Real-time ingestion (high cost) Batch nightly ingestion (stale) Risk-tiered: Tier 1 docs hourly, Tier 2 daily
Retrieval Depth (high K) vs. Latency K=50 for high recall K=5 for low latency K=20 + cross-encoder re-rank to N=5
Open-source self-hosting vs. Managed services Lower ongoing cost, full control Higher managed cost, faster time-to-value Managed for initial deployment; migrate to self-hosted at >$5K/month savings threshold
Context richness vs. Context window cost Large context (high accuracy) Small context (low cost) Adaptive context: scale K with query complexity score

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Hallucinated citation (LLM invents source) Medium High Citation validator comparing generated refs against retrieved chunk IDs Strip invalid citation; log for model quality review
Index staleness (document updated but old version retrieved) Medium Medium Freshness monitoring; version mismatch detection Re-trigger ingestion for affected document; surface version warning to user
Embedding drift (new documents in a different semantic space) Low Medium Retrieval quality metric degradation over time Re-embed affected documents; monitor RAGAS faithfulness
ACL filter bypass (misconfiguration) Low Critical Anomaly detection on retrieval patterns; classification label mismatches in outputs Immediate service suspension; full ACL audit
Cross-encoder timeout causing degraded ranking Medium Low P99 latency alert Serve top-K from vector search without re-ranking; log degradation
Context window overflow (truncated context) Medium High Token count monitoring per request Reduce K; prioritise by re-rank score; alert when budget > 90%
LLM generates answer outside provided context Medium High Faithfulness scoring via RAGAS or LLM-as-judge Tighten system prompt; consider output classifier
Vector database corruption Very Low Critical Data integrity checksums; retrieval anomaly detection Restore from last snapshot; re-ingest since snapshot timestamp

Cascading Failure Scenarios

  • Embedding model API outage during peak query period: Query processor cannot embed queries → vector search cannot execute → entire RAG pipeline fails. Mitigation: implement query-embedding caching for recently seen queries; maintain a keyword search fallback with explicit quality degradation notice.
  • ACL metadata missing from newly ingested documents: Documents ingest without access control metadata → ACL filter passes all requests → data leakage. Mitigation: mandatory ACL metadata validation before ingestion completion; reject documents lacking classification metadata.

14. Regulatory Considerations

Regulation Requirement RAG Pattern Response
APRA CPS 230 (Operational Resilience) Critical service continuity; third-party risk for cloud LLM DR plan per component; LLM provider assessed as material service provider; multi-provider fallback
APRA CPS 234 (Information Security) Information asset classification; access control ACL-aware retrieval; classification labels preserved; encrypted at rest and in transit
Privacy Act 1988 (Australia) Minimum necessary data collection; right to erasure PII detection before corpus ingestion; erasure procedure deletes chunk + vector + source document
EU AI Act Article 13 Transparency: users must know they are interacting with AI UI disclosure: "Answers generated by AI based on [source]"; citation of source documents
EU AI Act Article 14 Human oversight for high-risk AI systems Human review gate for high-stakes RAG answers (medical, legal, financial advice)
ISO/IEC 42001 (AI Management System) Risk management; accountability; transparency Corpus inventory; model card; quality scorecard; audit logs as required artefacts
NIST AI RMF (Govern 1.1, Map 1.1) Document AI system context and intended use System card documenting intended use, limitations, and risk mitigations
GDPR Article 22 No solely automated decisions affecting individuals Human-in-the-loop for consequential decisions informed by RAG outputs

15. Reference Implementations

AWS

  • Source connectors: Amazon Kendra (managed) or custom Lambda + EventBridge
  • Chunking & embedding: AWS Lambda (Python) + Amazon Bedrock Titan Embeddings v2
  • Vector store: Amazon OpenSearch Service with k-NN plugin, or Amazon Aurora pgvector
  • Document store: Amazon S3 with S3 Versioning
  • LLM: Amazon Bedrock (Claude 3.5 Sonnet, Llama 3)
  • Orchestration: AWS Step Functions + LangChain on Lambda
  • Observability: Amazon CloudWatch + AWS X-Ray + Langfuse

Azure

  • Source connectors: Azure Logic Apps + Microsoft Graph connector
  • Chunking & embedding: Azure Functions + Azure OpenAI Service (text-embedding-3-large)
  • Vector store: Azure AI Search (with vector search mode)
  • Document store: Azure Blob Storage
  • LLM: Azure OpenAI Service (GPT-4o)
  • Orchestration: Azure AI Studio Prompt Flow
  • Observability: Azure Monitor + Application Insights + Azure AI Content Safety

GCP

  • Source connectors: Cloud Run jobs + Pub/Sub for event-driven ingestion
  • Chunking & embedding: Cloud Run + Vertex AI Embeddings (textembedding-gecko)
  • Vector store: Vertex AI Vector Search (formerly Matching Engine) or AlloyDB pgvector
  • Document store: Google Cloud Storage
  • LLM: Vertex AI (Gemini 1.5 Pro)
  • Orchestration: Vertex AI Agent Builder or LangChain on Cloud Run
  • Observability: Cloud Monitoring + Cloud Trace + Vertex AI Model Monitoring

On-Premises / Air-Gapped

  • Source connectors: Custom Python connectors + Apache NiFi
  • Chunking & embedding: GPU inference server (NVIDIA A10G) + BAAI bge-large-en-v1.5
  • Vector store: Weaviate or Qdrant self-hosted on Kubernetes
  • Document store: MinIO (S3-compatible object storage)
  • LLM: vLLM serving Llama 3.1 70B or Mistral Large on GPU cluster
  • Orchestration: LangChain / LlamaIndex on Kubernetes
  • Observability: Prometheus + Grafana + Langfuse self-hosted

Pattern ID Pattern Name Relationship
EAAPL-RAG002 Multi-Source RAG Extends RAG001 to heterogeneous source types; inherits all foundational components
EAAPL-RAG003 Secure RAG Extends RAG001 with enterprise ACL enforcement; recommended overlay for any regulated deployment
EAAPL-RAG004 Federated RAG Extends RAG001 to distributed knowledge bases; replaces centralised vector store
EAAPL-RAG005 Hybrid RAG Extends RAG001 retrieval layer with BM25 + RRF; drop-in upgrade to retrieval component
EAAPL-RAG006 Streaming RAG Extends RAG001 ingestion pipeline for real-time data sources
EAAPL-RAG007 Agentic RAG Wraps RAG001 in an AI agent loop for multi-hop retrieval
EAAPL-RAG008 Multimodal RAG Extends RAG001 embedding and retrieval for non-text modalities
EAAPL-RAG009 Graph RAG Replaces/augments vector retrieval with knowledge graph traversal
EAAPL-RAG010 Contextual RAG with Metadata Filtering Extends RAG001 with richer metadata schema and filter composition
EAAPL-KNW003 AI Knowledge Corpus Management Governs the document corpus that RAG001 indexes
EAAPL-KNW004 Vector Database Management Governs operational management of the vector store used by RAG001
EAAPL-KNW006 Corpus Quality Assurance Provides quality gates for corpus ingested into RAG001

17. Maturity Assessment

Overall Maturity: Mature — Enterprise RAG is widely deployed across regulated industries; tooling is production-grade; best practices are documented; failure modes are well-understood.

Dimension Score (1–5) Rationale
Technology Readiness 5 All components (vector DBs, embedding models, LLM APIs) are GA and production-proven
Tooling Ecosystem 5 LangChain, LlamaIndex, LlamaHub, Haystack, and cloud-native RAG services are mature
Operational Guidance 4 RAGAS evaluation framework, hallucination benchmarks, and SRE practices are established but evolving
Security & Compliance Guidance 4 ACL-aware retrieval and audit patterns are documented; regulatory mapping is still being formalised by standards bodies
Scalability Evidence 4 Production deployments at 100M+ vector scale are documented; optimisation at extreme scale requires expertise
Cost Predictability 3 LLM token costs are volatile; embedding model pricing changes frequently; cost modelling is an ongoing effort

18. Revision History

Version Date Author Changes
1.0 2024-01-15 EAAPL Working Group Initial publication
1.1 2024-04-20 EAAPL Working Group Added HyDE query expansion; updated OWASP LLM Top 10 to 2024 edition
2.0 2024-09-10 EAAPL Working Group Major revision: hierarchical chunking strategy added; cross-encoder re-ranking formalised; regulatory section updated for EU AI Act final text
2.1 2025-02-28 EAAPL Working Group Updated cost tables; added GCP Vertex AI reference implementation; expanded failure modes with cascading scenarios
← Back to LibraryMore Retrieval-Augmented Generation