EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryRetrieval-Augmented GenerationEAAPL-RAG006
EAAPL-RAG006Proven
⇄ Compare

Streaming Retrieval-Augmented Generation

🔍 Retrieval-Augmented GenerationAPRA CPS230ISO/IEC 42001

[EAAPL-RAG006] Streaming Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Real-Time and Event-Driven RAG Version: 1.1 Maturity: Proven Tags: rag streaming real-time event-driven freshness kafka time-windowed-retrieval change-data-capture Regulatory Relevance: APRA CPS230 (operational resilience for real-time systems), MiFID II (real-time data obligations in financial services), ISO/IEC 42001 Section 8.4


1. Executive Summary

Streaming RAG extends the foundational RAG architecture to continuously ingest and index data from real-time event streams, enabling language model responses to be grounded in current, up-to-the-minute knowledge. While standard RAG operates on a periodically refreshed static corpus, Streaming RAG treats the knowledge index as a continuously evolving data product where events (new documents, document updates, market data changes, operational alerts) are reflected in retrievable context within seconds to minutes of occurrence.

For business leaders in time-sensitive domains — financial trading, real-time customer operations, industrial IoT, news and media — the distinction between a RAG system that knows what happened yesterday versus one that knows what happened five minutes ago can determine whether the AI assistant is fit for operational use. A customer service AI grounded only in yesterday's product data will give incorrect answers about today's promotional pricing. A risk management AI that cannot access this morning's regulatory alerts may miss a material compliance event. Streaming RAG closes this gap by making knowledge freshness a first-class architectural concern, traded off explicitly against retrieval quality and operational complexity.


2. Problem Statement

Business Problem

Standard batch-ingested RAG systems have knowledge that is hours or days stale. For many enterprise use cases this is acceptable; for many others it is not. Customer service assistants need to know today's pricing, promotions, and service status. Operations assistants need to know the current status of incidents, deployments, and SLA breaches. Financial services assistants need to know this morning's rate changes and regulatory updates. A RAG system that cannot reflect these changes in real-time is operationally unusable for these use cases.

Technical Problem

The batch ingestion architecture of standard RAG introduces inherent latency between when a document is created or updated and when it becomes retrievable. Document crawling, format parsing, chunking, embedding, and index upsert are sequential pipeline steps that typically take minutes to hours per document in a batch schedule. Event-driven and streaming systems demand a different architecture: an ingestion pipeline triggered by events rather than schedules, optimised for low-latency single-document processing rather than high-throughput batch processing.

Additionally, real-time data creates a freshness/quality tension: a document ingested seconds ago has not undergone the quality validation, editorial review, or metadata enrichment that a batch-ingested document has. The retrieval strategy must account for this — freshness and quality are both relevant signals for ranking, and the appropriate balance depends on the use case.

Symptoms

  • AI assistant gives incorrect answers because its knowledge is stale (e.g., cites a policy that was updated this morning)
  • Customer service AI provides yesterday's pricing to customers who have received today's promotional offer
  • Operations runbook AI cannot answer questions about the current production incident because the incident report was just created
  • News and media AI continues to reference an old version of a story after a material retraction was published

Cost of Inaction

  • Customer complaints and churn from incorrect information delivery in real-time service channels
  • Regulatory exposure: financial services AI providing stale rate or compliance data
  • Operational risk: incident response AI grounded in stale runbooks during a live incident
  • Trust erosion: users who notice knowledge staleness lose confidence in the AI assistant and revert to manual lookup

3. Context

When to Apply

  • AI assistants for customer-facing channels where product, pricing, or service data changes frequently (daily or faster)
  • Operations and incident response assistants where runbook and incident data must be current
  • Financial services AI where market data, rate sheets, and regulatory alerts must be reflected within minutes
  • News, media, and content platforms where articles, corrections, and editorial updates are continuous
  • IoT and industrial operations where equipment status, alerts, and sensor thresholds change continuously

When NOT to Apply

  • Knowledge corpora that change less frequently than once per day (standard batch ingestion is simpler and less expensive)
  • Use cases where answer freshness is not a user requirement (historical research, legal document review)
  • Environments lacking a mature event streaming infrastructure (Kafka, Kinesis) — the operational overhead of deploying streaming infrastructure solely for RAG is rarely justified
  • Use cases where raw streaming data quality is insufficient for grounded AI generation (malformed events, incomplete records)

Prerequisites

  • An event streaming platform (Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub) producing document change events
  • Change Data Capture (CDC) tooling for database-originated changes (Debezium, AWS DMS)
  • A vector database that supports low-latency upsert operations (Weaviate, Qdrant, Pinecone — all support real-time upsert)
  • An embedding inference service with low-latency API (< 100ms per chunk) to avoid ingestion lag
  • Freshness metadata (created_at, updated_at timestamps) that must be indexed alongside vectors

Industry Applicability

Industry Event Source Acceptable Freshness Lag Primary Use Case
Financial Services Market data feeds, regulatory alert APIs Seconds to minutes Rate and compliance queries
E-commerce / Retail Product catalogue events, inventory CDC Minutes Pricing and availability Q&A
Operations / DevOps Incident management webhooks, monitoring alerts Seconds Live incident runbook assistance
Healthcare Clinical note CDC, pharmacy formulary updates Minutes to hours Clinical decision support
News / Media CMS publish webhooks Seconds News article Q&A and summarisation
IoT / Manufacturing Sensor alert streams, equipment status CDC Seconds Equipment status and troubleshooting

4. Architecture Overview

Streaming RAG introduces an event-driven ingestion pipeline that runs continuously alongside (or instead of) the batch ingestion pipeline. The two pipelines share the same vector database but differ in their triggering, throughput optimisation, and quality guarantees.

Event-Driven Ingestion Pipeline

The event-driven pipeline is triggered by document change events rather than a schedule. Events are consumed from a streaming platform (Kafka topic, Kinesis stream, or event hub). Each event carries the document content (or a reference to it) and a change type (CREATE, UPDATE, DELETE). The pipeline processes events through the standard chunking → embedding → upsert sequence but is optimised for low per-event latency rather than high batch throughput.

Key design decisions for the event-driven pipeline:

  • Exactly-once semantics: document CREATE and UPDATE events must be processed exactly once to avoid duplicate vectors. Use idempotent upserts (upsert by document ID, not insert) and transactional offset commits in Kafka.
  • DELETE propagation: when a document is deleted, all its chunks must be deleted from the vector index. This requires a chunk-to-document mapping in a metadata store (the document store serves this purpose).
  • Partial document updates: if only a section of a document changes, a partial update strategy re-chunks and re-embeds only the changed sections, significantly reducing embedding compute cost for large documents.
  • Event ordering: out-of-order events (UPDATE before CREATE due to partition skew) must be handled. Idempotent upserts naturally handle this; version numbers in event payloads allow stale-write detection.

Freshness-Aware Retrieval

The retrieval layer must incorporate document freshness as a ranking signal alongside semantic relevance. Two approaches are common:

  1. Hard time-window filtering: pre-retrieval metadata filter restricts results to documents updated within a configurable time window (e.g., last 24 hours for market data; last 7 days for policies). This is simple and guarantees freshness but may exclude relevant older documents.
  2. Freshness-decay scoring: the retrieval score is multiplied by a time-decay function exp(-λ × age_in_days) where λ controls the decay rate. Freshness is a soft preference, not a hard cut-off. Documents updated in the last hour score near-full relevance; documents updated a month ago have discounted relevance.

The appropriate strategy depends on the use case: real-time market data use cases benefit from hard time-window filtering; policy and procedure use cases (where a 3-year-old policy is still valid) benefit from freshness-decay scoring.

Index Consistency During Real-Time Upserts

Real-time vector upserts create a brief window of index inconsistency while the HNSW index is being updated. Most enterprise vector databases handle concurrent reads and writes gracefully (MVCC-style), but very high-throughput streaming scenarios may require index partitioning strategies (e.g., separate "hot" index for recent documents, "cold" index for stable historical documents) with a merge retrieval strategy. The hot/cold partition approach also enables different HNSW tuning parameters: the hot index is tuned for high write throughput; the cold index is tuned for high recall.

Quality Considerations for Streaming Data

Raw event data is often lower quality than batch-processed documents: events may be incomplete, may contain duplicate or near-duplicate content, or may represent intermediate states (a draft document, an unreviewed news post). A lightweight quality filter at the event consumer should enforce minimum quality thresholds (minimum document length, no-duplicate check via content hash, optional classification check) before triggering embedding and indexing.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Events["Event Sources"] A[Kafka / Kinesis Stream] B[Batch Ingestion] end subgraph Ingestion["Streaming Ingestion"] C[Stream Consumer] D[Hot Vector Index] E[Cold Vector Index] end subgraph Query["Freshness-Aware Query"] F[User Query] G[Freshness Filter] H[Merge + Rerank] I[LLM Generation] end A -->|exactly-once upsert| C B --> E C --> D D --> H E --> H F --> G --> H --> I --> F style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#fef9c3,stroke:#eab308 style F fill:#dbeafe,stroke:#3b82f6 style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Event Streaming Platform Infrastructure Durable, ordered event log for document change events Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub Critical
CDC Connector Integration Capture row-level changes from databases as events Debezium, AWS DMS, Azure Data Factory CDC mode High
Stream Consumer Application Consume events; commit offsets with exactly-once semantics Kafka Consumer (Python), Faust, Apache Flink Critical
Quality Filter Data Quality Reject malformed, duplicate, or below-threshold events Custom Python; great_expectations; pydantic validators High
Real-Time Chunker Data Processing Chunk document on arrival; support partial update chunking LangChain splitters; custom partial-update diff algorithm High
Embedding Service ML Inference Low-latency embedding API (< 100ms per chunk) OpenAI embeddings API, Google Vertex AI embeddings, self-hosted sentence-transformers Critical
Hot Vector Index Storage Receive real-time upserts; optimised for write throughput Weaviate (live index), Qdrant (real-time segment), Pinecone upsert Critical
Cold Vector Index Storage Stable archival index for historical documents; optimised for recall pgvector, OpenSearch, Pinecone (separate index) High
Merge Retrieval Layer Orchestration Query both hot and cold indexes; combine and deduplicate results Custom Python; LangChain custom retriever High
Freshness Decay Scorer Algorithm Apply time-decay function to re-weight retrieved chunks by age Custom Python (configurable λ parameter) Medium
As-Of Timestamp Tracker Observability Record and surface the knowledge cutoff timestamp for each response Custom metadata; surfaced in response header High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Source System Publish document change event (CREATE/UPDATE/DELETE) Kafka message: {event_type, doc_id, content, metadata, timestamp}
2 Stream Consumer Consume event; validate schema; deduplicate via content hash Validated event or SKIP (duplicate)
3 Quality Filter Check minimum length, classification, and formatting Quality-passed event
4 Real-Time Chunker Chunk document; for UPDATE events, diff against last-known chunk set New/modified chunks with doc_id and metadata
5 Embedding Service Embed each chunk (< 100ms per chunk target) Chunk vectors
6 Hot Vector Index Idempotent upsert: if chunk_id exists, replace; else insert Vector index updated; stale chunks removed
7 Hot Index Aging Job Periodic job: move documents older than 7 days from hot to cold index Cold index updated; hot index cleaned
8 User Submit query Query string
9 Query Processor Expand query Enhanced query
10 Freshness Strategy Determine retrieval strategy: hard window or decay parameter Filter expression or decay λ
11 Merge Retrieval Query both hot and cold indexes; combine results with freshness weighting Merged, freshness-weighted candidate list
12 Re-ranker Apply cross-encoder on freshness × relevance combined score Top-N re-ranked candidates
13 Context Assembler Assemble prompt; include "as-of" timestamp in system context Assembled prompt with freshness indicator
14 LLM Generate answer Response
15 Response Delivery Return answer with "Knowledge current as of [timestamp]" indicator Final response

Error Flow

Error Condition Detection Recovery
Embedding service latency spike (> 500ms) Per-event latency monitoring Increase consumer lag; alert; route to backup embedding service
Consumer lag growing (events accumulating) Kafka consumer group lag monitoring Scale consumer instances; alert on lag threshold breach
Hot index write conflict Vector DB error response Retry with exponential backoff (3 retries); dead-letter queue
Stale DELETE event arriving after new CREATE Version check in event metadata Reject stale DELETE using version comparison; log discarded event

8. Security Considerations

Real-Time ACL Update Propagation

In streaming RAG, documents can be ingested faster than ACL metadata is normalised. The stream consumer must reject document events that lack ACL metadata rather than ingesting with empty ACLs (which would make the document accessible to everyone). A document with no ACL metadata is ingested into a "pending ACL" queue and indexed only once ACL metadata is resolved.

Event Stream Security

The event streaming platform must enforce topic-level access control (Kafka ACLs or IAM policies for managed services). The embedding service API key must be stored in a secrets manager and rotated regularly. Event payloads containing highly sensitive content should be encrypted in transit with envelope encryption.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk Streaming-Specific Concern Mitigation
LLM01: Prompt Injection Malicious content injected into a high-priority event stream to influence AI responses Content sanitisation in quality filter; treat all event content as untrusted
LLM04: Model Denial of Service Event flood (high-rate update events) causes embedding service overload Rate limiting at event consumer; embedding service auto-scaling with circuit breaker
LLM09: Overreliance User assumes response reflects very recent events; system may still have lag Surface "as-of" timestamp in every response; document the ingestion lag SLA

9. Governance Considerations

Freshness SLA Governance

The ingestion lag SLA (time from event publication to document retrievability) must be formally agreed and monitored. For each source type, the SLA should be documented in the Source Inventory (EAAPL-KNW003). Breach of the freshness SLA for Tier 1 sources must trigger an automated alert.

Governance Artefacts

Artefact Owner Frequency Purpose
Ingestion Lag SLA Data Governance Per source type; reviewed quarterly Agreed freshness commitments per source
Consumer Lag Dashboard AI Operations Continuous Real-time view of ingestion pipeline health
Stale Response Incident Log AI Operations Per incident Track cases where stale knowledge led to incorrect responses
Hot/Cold Index Merge Audit Data Engineering Weekly Verify documents correctly migrated from hot to cold index

10. Operational Considerations

Monitoring

Metric Alert Threshold Notes
Event consumer lag (messages behind) > 1,000 messages (Tier 1) Scale consumer; alert ops team
Embedding API P99 latency > 200ms Check embedding service; scale if needed
Hot index write latency P99 > 50ms Check vector DB cluster health
As-of timestamp lag (latest indexed doc age) > 5 min (Tier 1 sources) Freshness SLA breach; alert
Dead-letter queue depth > 10 messages Failed events require investigation

Service Level Objectives

SLO Target Measurement
End-to-end ingestion lag (event → retrievable) ≤ 60 seconds (Tier 1), ≤ 5 minutes (Tier 2) Synthetic event latency probe
Hot index availability ≥ 99.95% Health check
Consumer availability (no lag growth) ≥ 99.9% Lag monitoring

Disaster Recovery

Component RTO RPO DR Strategy
Event Streaming Platform 15 min (managed) 0 (durable log) Managed service with multi-AZ; retain messages for 7 days
Hot Vector Index 30 min 1 hour Snapshot to cold storage; replay events from Kafka from last snapshot offset
Stream Consumer 5 min 0 (Kafka offset replay) Auto-scaling group; Kubernetes pod restart

11. Cost Considerations

Cost Drivers

Cost Driver Notes Optimisation
Embedding API (real-time, per-event) Real-time calls are more expensive than batch (no batch discount) Batch micro-events (group events within a 5-second window)
Kafka / streaming platform Managed streaming services ($0.02–$0.10 per GB throughput) Right-size retention period; use tiered storage for cold Kafka data
Hot index write throughput High write-throughput vector DB configuration costs more Tune hot index for write throughput; migrate to cold index promptly
Embedding for deleted documents Deletion events do not require embedding but do require chunk lookup Maintain chunk-to-document mapping for O(1) deletion

Indicative Cost Range

Event Volume Incremental Cost vs. Batch RAG Notes
< 1,000 events/day +$200–$500/month Low-volume streaming; streaming infra fixed cost dominates
1,000–50,000 events/day +$500–$2,000/month Moderate streaming; embedding cost becomes significant
> 50,000 events/day +$2,000–$10,000/month High-volume; batch micro-event grouping essential

12. Trade-Off Analysis

Freshness vs. Retrieval Quality

Approach Freshness Retrieval Quality Risk Complexity
Hard time-window filter Guaranteed (only recent docs) High (may miss relevant older docs) Low
Freshness-decay scoring Soft preference for recent docs Low (older docs still retrievable) Medium
Hot/cold index merge Balanced (recent = hot; historical = cold) Low High

Event Granularity for Streaming

Granularity Ingestion Cost Index Churn Recommended For
Full document on every update High (re-embed all chunks) High Small documents (< 1,000 tokens)
Section-level diff (changed sections only) Low Low Large documents (policies, manuals)
Event field payload (structured data update) Very Low Very Low Database record changes (pricing, inventory)

Architectural Tensions

Tension Trade-off Recommendation
Real-time freshness vs. quality validation Instant ingestion: fresh but unvalidated; delayed ingestion: validated but stale Ingest immediately; tag as "unreviewed" in metadata; surface review status to users
Hot index write throughput vs. recall High write throughput: lower HNSW ef_construction → lower recall; rebuild pauses index Separate hot/cold indexes with different HNSW parameters

13. Failure Modes

Failure Mode Likelihood Impact Detection Recovery
Embedding service rate limit during event surge Medium High Consumer lag + embedding error rate Pre-provision embedding service quota; circuit breaker routes excess to batch queue
Out-of-order DELETE before CREATE Low Medium Version check in consumer Idempotent upsert handles gracefully; DELETE is no-op if doc not present
Hot index capacity exhaustion Medium High Index storage utilisation monitoring Automated migration of old docs to cold index; alert at 80% capacity
Duplicate events causing duplicate vectors Medium Low Content hash deduplication in quality filter Idempotent upsert prevents duplicate vectors even if filter fails
Consumer reprocessing on offset reset causes full re-index Low Medium Consumer offset monitoring Idempotent upserts make reprocessing safe; monitor for unexpected full-corpus re-index

14. Regulatory Considerations

Regulation Requirement Streaming RAG Response
MiFID II (EU) Financial data must be current; obligations around data timestamping As-of timestamp surfaced in every response; ingestion lag SLA ≤ 5 minutes for rate data
APRA CPS 230 Operational resilience; recovery time for critical services Streaming platform multi-AZ; Kafka message retention enables replay on outage
Privacy Act 1988 Real-time CDC may capture personal information in database changes CDC PII filter applied before Kafka topic ingestion; personal data changes not ingested to RAG

15. Reference Implementations

AWS

  • CDC: AWS DMS → Amazon Kinesis Data Streams
  • Streaming consumer: AWS Lambda (Kinesis trigger) + Amazon Bedrock Titan Embeddings
  • Hot index: Amazon OpenSearch Service (real-time upsert)
  • Cold index: Amazon Aurora pgvector
  • Freshness decay: Lambda post-processor

Azure

  • CDC: Azure Data Factory CDC → Azure Event Hubs
  • Streaming consumer: Azure Functions (Event Hub trigger) + Azure OpenAI embeddings
  • Hot/cold index: Azure AI Search (hot namespace + cold namespace)
  • Merge layer: Azure Function combining hot/cold results

GCP

  • CDC: Datastream → Google Pub/Sub
  • Streaming consumer: Cloud Run (Pub/Sub push) + Vertex AI embeddings
  • Hot index: Vertex AI Vector Search (real-time upsert supported)
  • Cold index: AlloyDB pgvector

Pattern ID Pattern Name Relationship
EAAPL-RAG001 Enterprise RAG Foundation; RAG006 replaces the ingestion pipeline component
EAAPL-RAG003 Secure RAG ACL enforcement applied at streaming ingestion; mandatory overlay
EAAPL-RAG007 Agentic RAG Agentic loops may trigger targeted real-time indexing for discovered sources
EAAPL-KNW003 AI Knowledge Corpus Management Corpus management policies apply to streaming corpora; freshness SLA is a corpus management concern

17. Maturity Assessment

Overall Maturity: Proven — Event-driven RAG ingestion is deployed in production at financial services and e-commerce scale; freshness-aware retrieval patterns are well-understood; hot/cold index partitioning is operationally mature.

Dimension Score (1–5) Rationale
Technology Readiness 4 All components production-ready; partial-update chunking is the least mature component
Tooling Ecosystem 4 Kafka, Kinesis, and vector DB upsert are mature; Flink/Faust for stream processing are production-grade
Operational Guidance 3 Consumer lag management and hot/cold migration tuning require Kafka and vector DB expertise
Security & Compliance 3 CDC PII filtering and ACL propagation in streaming are complex; less tooling than batch equivalents
Scalability Evidence 4 Financial services deployments at millions of events/day are documented
Cost Predictability 3 Real-time embedding costs are variable; event volume spikes can cause cost overruns

18. Revision History

Version Date Author Changes
1.0 2024-05-01 EAAPL Working Group Initial publication
1.1 2025-03-15 EAAPL Working Group Hot/cold index partitioning formalised; partial-update chunking strategy added; freshness-decay formula documented
← Back to LibraryMore Retrieval-Augmented Generation