EAAPL-RAG006Proven

Streaming Retrieval-Augmented Generation

Retrieval-Augmented GenerationAPRA CPS230ISO/IEC 42001

[EAAPL-RAG006] Streaming Retrieval-Augmented Generation

Category: Artificial Intelligence / Retrieval-Augmented Generation Sub-category: Real-Time and Event-Driven RAG Version: 1.1 Maturity: Proven Tags: rag streaming real-time event-driven freshness kafka time-windowed-retrieval change-data-capture Regulatory Relevance: APRA CPS230 (operational resilience for real-time systems), MiFID II (real-time data obligations in financial services), ISO/IEC 42001 Section 8.4

1. Executive Summary

Streaming RAG extends the foundational RAG architecture to continuously ingest and index data from real-time event streams, enabling language model responses to be grounded in current, up-to-the-minute knowledge. While standard RAG operates on a periodically refreshed static corpus, Streaming RAG treats the knowledge index as a continuously evolving data product where events (new documents, document updates, market data changes, operational alerts) are reflected in retrievable context within seconds to minutes of occurrence.

For business leaders in time-sensitive domains — financial trading, real-time customer operations, industrial IoT, news and media — the distinction between a RAG system that knows what happened yesterday versus one that knows what happened five minutes ago can determine whether the AI assistant is fit for operational use. A customer service AI grounded only in yesterday's product data will give incorrect answers about today's promotional pricing. A risk management AI that cannot access this morning's regulatory alerts may miss a material compliance event. Streaming RAG closes this gap by making knowledge freshness a first-class architectural concern, traded off explicitly against retrieval quality and operational complexity.

2. Problem Statement

Business Problem

Standard batch-ingested RAG systems have knowledge that is hours or days stale. For many enterprise use cases this is acceptable; for many others it is not. Customer service assistants need to know today's pricing, promotions, and service status. Operations assistants need to know the current status of incidents, deployments, and SLA breaches. Financial services assistants need to know this morning's rate changes and regulatory updates. A RAG system that cannot reflect these changes in real-time is operationally unusable for these use cases.

Technical Problem

The batch ingestion architecture of standard RAG introduces inherent latency between when a document is created or updated and when it becomes retrievable. Document crawling, format parsing, chunking, embedding, and index upsert are sequential pipeline steps that typically take minutes to hours per document in a batch schedule. Event-driven and streaming systems demand a different architecture: an ingestion pipeline triggered by events rather than schedules, optimised for low-latency single-document processing rather than high-throughput batch processing.

Additionally, real-time data creates a freshness/quality tension: a document ingested seconds ago has not undergone the quality validation, editorial review, or metadata enrichment that a batch-ingested document has. The retrieval strategy must account for this — freshness and quality are both relevant signals for ranking, and the appropriate balance depends on the use case.

Symptoms

AI assistant gives incorrect answers because its knowledge is stale (e.g., cites a policy that was updated this morning)
Customer service AI provides yesterday's pricing to customers who have received today's promotional offer
Operations runbook AI cannot answer questions about the current production incident because the incident report was just created
News and media AI continues to reference an old version of a story after a material retraction was published

Cost of Inaction

Customer complaints and churn from incorrect information delivery in real-time service channels
Regulatory exposure: financial services AI providing stale rate or compliance data
Operational risk: incident response AI grounded in stale runbooks during a live incident
Trust erosion: users who notice knowledge staleness lose confidence in the AI assistant and revert to manual lookup

3. Context

When to Apply

AI assistants for customer-facing channels where product, pricing, or service data changes frequently (daily or faster)
Operations and incident response assistants where runbook and incident data must be current
Financial services AI where market data, rate sheets, and regulatory alerts must be reflected within minutes
News, media, and content platforms where articles, corrections, and editorial updates are continuous
IoT and industrial operations where equipment status, alerts, and sensor thresholds change continuously

When NOT to Apply

Knowledge corpora that change less frequently than once per day (standard batch ingestion is simpler and less expensive)
Use cases where answer freshness is not a user requirement (historical research, legal document review)
Environments lacking a mature event streaming infrastructure (Kafka, Kinesis) — the operational overhead of deploying streaming infrastructure solely for RAG is rarely justified
Use cases where raw streaming data quality is insufficient for grounded AI generation (malformed events, incomplete records)

Prerequisites

An event streaming platform (Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub) producing document change events
Change Data Capture (CDC) tooling for database-originated changes (Debezium, AWS DMS)
A vector database that supports low-latency upsert operations (Weaviate, Qdrant, Pinecone — all support real-time upsert)
An embedding inference service with low-latency API (< 100ms per chunk) to avoid ingestion lag
Freshness metadata (created_at, updated_at timestamps) that must be indexed alongside vectors

Industry Applicability

Industry	Event Source	Acceptable Freshness Lag	Primary Use Case
Financial Services	Market data feeds, regulatory alert APIs	Seconds to minutes	Rate and compliance queries
E-commerce / Retail	Product catalogue events, inventory CDC	Minutes	Pricing and availability Q&A
Operations / DevOps	Incident management webhooks, monitoring alerts	Seconds	Live incident runbook assistance
Healthcare	Clinical note CDC, pharmacy formulary updates	Minutes to hours	Clinical decision support
News / Media	CMS publish webhooks	Seconds	News article Q&A and summarisation
IoT / Manufacturing	Sensor alert streams, equipment status CDC	Seconds	Equipment status and troubleshooting

4. Architecture Overview

Streaming RAG introduces an event-driven ingestion pipeline that runs continuously alongside (or instead of) the batch ingestion pipeline. The two pipelines share the same vector database but differ in their triggering, throughput optimisation, and quality guarantees.

Event-Driven Ingestion Pipeline

The event-driven pipeline is triggered by document change events rather than a schedule. Events are consumed from a streaming platform (Kafka topic, Kinesis stream, or event hub). Each event carries the document content (or a reference to it) and a change type (CREATE, UPDATE, DELETE). The pipeline processes events through the standard chunking → embedding → upsert sequence but is optimised for low per-event latency rather than high batch throughput.

Key design decisions for the event-driven pipeline:

Exactly-once semantics: document CREATE and UPDATE events must be processed exactly once to avoid duplicate vectors. Use idempotent upserts (upsert by document ID, not insert) and transactional offset commits in Kafka.
DELETE propagation: when a document is deleted, all its chunks must be deleted from the vector index. This requires a chunk-to-document mapping in a metadata store (the document store serves this purpose).
Partial document updates: if only a section of a document changes, a partial update strategy re-chunks and re-embeds only the changed sections, significantly reducing embedding compute cost for large documents.
Event ordering: out-of-order events (UPDATE before CREATE due to partition skew) must be handled. Idempotent upserts naturally handle this; version numbers in event payloads allow stale-write detection.

Freshness-Aware Retrieval

The retrieval layer must incorporate document freshness as a ranking signal alongside semantic relevance. Two approaches are common:

Hard time-window filtering: pre-retrieval metadata filter restricts results to documents updated within a configurable time window (e.g., last 24 hours for market data; last 7 days for policies). This is simple and guarantees freshness but may exclude relevant older documents.
Freshness-decay scoring: the retrieval score is multiplied by a time-decay function exp(-λ × age_in_days) where λ controls the decay rate. Freshness is a soft preference, not a hard cut-off. Documents updated in the last hour score near-full relevance; documents updated a month ago have discounted relevance.

The appropriate strategy depends on the use case: real-time market data use cases benefit from hard time-window filtering; policy and procedure use cases (where a 3-year-old policy is still valid) benefit from freshness-decay scoring.

Index Consistency During Real-Time Upserts

Real-time vector upserts create a brief window of index inconsistency while the HNSW index is being updated. Most enterprise vector databases handle concurrent reads and writes gracefully (MVCC-style), but very high-throughput streaming scenarios may require index partitioning strategies (e.g., separate "hot" index for recent documents, "cold" index for stable historical documents) with a merge retrieval strategy. The hot/cold partition approach also enables different HNSW tuning parameters: the hot index is tuned for high write throughput; the cold index is tuned for high recall.

Quality Considerations for Streaming Data

Raw event data is often lower quality than batch-processed documents: events may be incomplete, may contain duplicate or near-duplicate content, or may represent intermediate states (a draft document, an unreviewed news post). A lightweight quality filter at the event consumer should enforce minimum quality thresholds (minimum document length, no-duplicate check via content hash, optional classification check) before triggering embedding and indexing.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Events["Event Sources"] A[Kafka / Kinesis Stream] B[Batch Ingestion] end subgraph Ingestion["Streaming Ingestion"] C[Stream Consumer] D[Hot Vector Index] E[Cold Vector Index] end subgraph Query["Freshness-Aware Query"] F[User Query] G[Freshness Filter] H[Merge + Rerank] I[LLM Generation] end A -->|exactly-once upsert| C B --> E C --> D D --> H E --> H F --> G --> H --> I --> F style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#fef9c3,stroke:#eab308 style E fill:#fef9c3,stroke:#eab308 style F fill:#dbeafe,stroke:#3b82f6 style G fill:#f0fdf4,stroke:#22c55e style H fill:#f0fdf4,stroke:#22c55e style I fill:#d1fae5,stroke:#10b981

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Event Streaming Platform	Infrastructure	Durable, ordered event log for document change events	Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub	Critical
CDC Connector	Integration	Capture row-level changes from databases as events	Debezium, AWS DMS, Azure Data Factory CDC mode	High
Stream Consumer	Application	Consume events; commit offsets with exactly-once semantics	Kafka Consumer (Python), Faust, Apache Flink	Critical
Quality Filter	Data Quality	Reject malformed, duplicate, or below-threshold events	Custom Python; great_expectations; pydantic validators	High
Real-Time Chunker	Data Processing	Chunk document on arrival; support partial update chunking	LangChain splitters; custom partial-update diff algorithm	High
Embedding Service	ML Inference	Low-latency embedding API (< 100ms per chunk)	OpenAI embeddings API, Google Vertex AI embeddings, self-hosted sentence-transformers	Critical
Hot Vector Index	Storage	Receive real-time upserts; optimised for write throughput	Weaviate (live index), Qdrant (real-time segment), Pinecone upsert	Critical
Cold Vector Index	Storage	Stable archival index for historical documents; optimised for recall	pgvector, OpenSearch, Pinecone (separate index)	High
Merge Retrieval Layer	Orchestration	Query both hot and cold indexes; combine and deduplicate results	Custom Python; LangChain custom retriever	High
Freshness Decay Scorer	Algorithm	Apply time-decay function to re-weight retrieved chunks by age	Custom Python (configurable λ parameter)	Medium
As-Of Timestamp Tracker	Observability	Record and surface the knowledge cutoff timestamp for each response	Custom metadata; surfaced in response header	High

7. Data Flow

Primary Flow

Step	Actor	Action	Output
1	Source System	Publish document change event (CREATE/UPDATE/DELETE)	Kafka message: `{event_type, doc_id, content, metadata, timestamp}`
2	Stream Consumer	Consume event; validate schema; deduplicate via content hash	Validated event or SKIP (duplicate)
3	Quality Filter	Check minimum length, classification, and formatting	Quality-passed event
4	Real-Time Chunker	Chunk document; for UPDATE events, diff against last-known chunk set	New/modified chunks with doc_id and metadata
5	Embedding Service	Embed each chunk (< 100ms per chunk target)	Chunk vectors
6	Hot Vector Index	Idempotent upsert: if chunk_id exists, replace; else insert	Vector index updated; stale chunks removed
7	Hot Index Aging Job	Periodic job: move documents older than 7 days from hot to cold index	Cold index updated; hot index cleaned
8	User	Submit query	Query string
9	Query Processor	Expand query	Enhanced query
10	Freshness Strategy	Determine retrieval strategy: hard window or decay parameter	Filter expression or decay λ
11	Merge Retrieval	Query both hot and cold indexes; combine results with freshness weighting	Merged, freshness-weighted candidate list
12	Re-ranker	Apply cross-encoder on freshness × relevance combined score	Top-N re-ranked candidates
13	Context Assembler	Assemble prompt; include "as-of" timestamp in system context	Assembled prompt with freshness indicator
14	LLM	Generate answer	Response
15	Response Delivery	Return answer with "Knowledge current as of [timestamp]" indicator	Final response

Error Flow

Error Condition	Detection	Recovery
Embedding service latency spike (> 500ms)	Per-event latency monitoring	Increase consumer lag; alert; route to backup embedding service
Consumer lag growing (events accumulating)	Kafka consumer group lag monitoring	Scale consumer instances; alert on lag threshold breach
Hot index write conflict	Vector DB error response	Retry with exponential backoff (3 retries); dead-letter queue
Stale DELETE event arriving after new CREATE	Version check in event metadata	Reject stale DELETE using version comparison; log discarded event

8. Security Considerations

Real-Time ACL Update Propagation

In streaming RAG, documents can be ingested faster than ACL metadata is normalised. The stream consumer must reject document events that lack ACL metadata rather than ingesting with empty ACLs (which would make the document accessible to everyone). A document with no ACL metadata is ingested into a "pending ACL" queue and indexed only once ACL metadata is resolved.

Event Stream Security

The event streaming platform must enforce topic-level access control (Kafka ACLs or IAM policies for managed services). The embedding service API key must be stored in a secrets manager and rotated regularly. Event payloads containing highly sensitive content should be encrypted in transit with envelope encryption.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Streaming-Specific Concern	Mitigation
LLM01: Prompt Injection	Malicious content injected into a high-priority event stream to influence AI responses	Content sanitisation in quality filter; treat all event content as untrusted
LLM04: Model Denial of Service	Event flood (high-rate update events) causes embedding service overload	Rate limiting at event consumer; embedding service auto-scaling with circuit breaker
LLM09: Overreliance	User assumes response reflects very recent events; system may still have lag	Surface "as-of" timestamp in every response; document the ingestion lag SLA

9. Governance Considerations

Freshness SLA Governance

The ingestion lag SLA (time from event publication to document retrievability) must be formally agreed and monitored. For each source type, the SLA should be documented in the Source Inventory (EAAPL-KNW003). Breach of the freshness SLA for Tier 1 sources must trigger an automated alert.

Governance Artefacts

Artefact	Owner	Frequency	Purpose
Ingestion Lag SLA	Data Governance	Per source type; reviewed quarterly	Agreed freshness commitments per source
Consumer Lag Dashboard	AI Operations	Continuous	Real-time view of ingestion pipeline health
Stale Response Incident Log	AI Operations	Per incident	Track cases where stale knowledge led to incorrect responses
Hot/Cold Index Merge Audit	Data Engineering	Weekly	Verify documents correctly migrated from hot to cold index

10. Operational Considerations

Monitoring

Metric	Alert Threshold	Notes
Event consumer lag (messages behind)	> 1,000 messages (Tier 1)	Scale consumer; alert ops team
Embedding API P99 latency	> 200ms	Check embedding service; scale if needed
Hot index write latency P99	> 50ms	Check vector DB cluster health
As-of timestamp lag (latest indexed doc age)	> 5 min (Tier 1 sources)	Freshness SLA breach; alert
Dead-letter queue depth	> 10 messages	Failed events require investigation

Service Level Objectives

SLO	Target	Measurement
End-to-end ingestion lag (event → retrievable)	≤ 60 seconds (Tier 1), ≤ 5 minutes (Tier 2)	Synthetic event latency probe
Hot index availability	≥ 99.95%	Health check
Consumer availability (no lag growth)	≥ 99.9%	Lag monitoring

Disaster Recovery

Component	RTO	RPO	DR Strategy
Event Streaming Platform	15 min (managed)	0 (durable log)	Managed service with multi-AZ; retain messages for 7 days
Hot Vector Index	30 min	1 hour	Snapshot to cold storage; replay events from Kafka from last snapshot offset
Stream Consumer	5 min	0 (Kafka offset replay)	Auto-scaling group; Kubernetes pod restart

11. Cost Considerations

Cost Drivers

Cost Driver	Notes	Optimisation
Embedding API (real-time, per-event)	Real-time calls are more expensive than batch (no batch discount)	Batch micro-events (group events within a 5-second window)
Kafka / streaming platform	Managed streaming services ($0.02–$0.10 per GB throughput)	Right-size retention period; use tiered storage for cold Kafka data
Hot index write throughput	High write-throughput vector DB configuration costs more	Tune hot index for write throughput; migrate to cold index promptly
Embedding for deleted documents	Deletion events do not require embedding but do require chunk lookup	Maintain chunk-to-document mapping for O(1) deletion

Indicative Cost Range

Event Volume	Incremental Cost vs. Batch RAG	Notes
< 1,000 events/day	+$200–$500/month	Low-volume streaming; streaming infra fixed cost dominates
1,000–50,000 events/day	+$500–$2,000/month	Moderate streaming; embedding cost becomes significant
> 50,000 events/day	+$2,000–$10,000/month	High-volume; batch micro-event grouping essential

12. Trade-Off Analysis

Freshness vs. Retrieval Quality

Approach	Freshness	Retrieval Quality Risk	Complexity
Hard time-window filter	Guaranteed (only recent docs)	High (may miss relevant older docs)	Low
Freshness-decay scoring	Soft preference for recent docs	Low (older docs still retrievable)	Medium
Hot/cold index merge	Balanced (recent = hot; historical = cold)	Low	High

Event Granularity for Streaming

Granularity	Ingestion Cost	Index Churn	Recommended For
Full document on every update	High (re-embed all chunks)	High	Small documents (< 1,000 tokens)
Section-level diff (changed sections only)	Low	Low	Large documents (policies, manuals)
Event field payload (structured data update)	Very Low	Very Low	Database record changes (pricing, inventory)

Architectural Tensions

Tension	Trade-off	Recommendation
Real-time freshness vs. quality validation	Instant ingestion: fresh but unvalidated; delayed ingestion: validated but stale	Ingest immediately; tag as "unreviewed" in metadata; surface review status to users
Hot index write throughput vs. recall	High write throughput: lower HNSW ef_construction → lower recall; rebuild pauses index	Separate hot/cold indexes with different HNSW parameters

13. Failure Modes

Failure Mode	Likelihood	Impact	Detection	Recovery
Embedding service rate limit during event surge	Medium	High	Consumer lag + embedding error rate	Pre-provision embedding service quota; circuit breaker routes excess to batch queue
Out-of-order DELETE before CREATE	Low	Medium	Version check in consumer	Idempotent upsert handles gracefully; DELETE is no-op if doc not present
Hot index capacity exhaustion	Medium	High	Index storage utilisation monitoring	Automated migration of old docs to cold index; alert at 80% capacity
Duplicate events causing duplicate vectors	Medium	Low	Content hash deduplication in quality filter	Idempotent upsert prevents duplicate vectors even if filter fails
Consumer reprocessing on offset reset causes full re-index	Low	Medium	Consumer offset monitoring	Idempotent upserts make reprocessing safe; monitor for unexpected full-corpus re-index

14. Regulatory Considerations

Regulation	Requirement	Streaming RAG Response
MiFID II (EU)	Financial data must be current; obligations around data timestamping	As-of timestamp surfaced in every response; ingestion lag SLA ≤ 5 minutes for rate data
APRA CPS 230	Operational resilience; recovery time for critical services	Streaming platform multi-AZ; Kafka message retention enables replay on outage
Privacy Act 1988	Real-time CDC may capture personal information in database changes	CDC PII filter applied before Kafka topic ingestion; personal data changes not ingested to RAG

15. Reference Implementations

AWS

CDC: AWS DMS → Amazon Kinesis Data Streams
Streaming consumer: AWS Lambda (Kinesis trigger) + Amazon Bedrock Titan Embeddings
Hot index: Amazon OpenSearch Service (real-time upsert)
Cold index: Amazon Aurora pgvector
Freshness decay: Lambda post-processor

Azure

CDC: Azure Data Factory CDC → Azure Event Hubs
Streaming consumer: Azure Functions (Event Hub trigger) + Azure OpenAI embeddings
Hot/cold index: Azure AI Search (hot namespace + cold namespace)
Merge layer: Azure Function combining hot/cold results

GCP

CDC: Datastream → Google Pub/Sub
Streaming consumer: Cloud Run (Pub/Sub push) + Vertex AI embeddings
Hot index: Vertex AI Vector Search (real-time upsert supported)
Cold index: AlloyDB pgvector

Pattern ID	Pattern Name	Relationship
EAAPL-RAG001	Enterprise RAG	Foundation; RAG006 replaces the ingestion pipeline component
EAAPL-RAG003	Secure RAG	ACL enforcement applied at streaming ingestion; mandatory overlay
EAAPL-RAG007	Agentic RAG	Agentic loops may trigger targeted real-time indexing for discovered sources
EAAPL-KNW003	AI Knowledge Corpus Management	Corpus management policies apply to streaming corpora; freshness SLA is a corpus management concern

17. Maturity Assessment

Overall Maturity: Proven — Event-driven RAG ingestion is deployed in production at financial services and e-commerce scale; freshness-aware retrieval patterns are well-understood; hot/cold index partitioning is operationally mature.

Dimension	Score (1–5)	Rationale
Technology Readiness	4	All components production-ready; partial-update chunking is the least mature component
Tooling Ecosystem	4	Kafka, Kinesis, and vector DB upsert are mature; Flink/Faust for stream processing are production-grade
Operational Guidance	3	Consumer lag management and hot/cold migration tuning require Kafka and vector DB expertise
Security & Compliance	3	CDC PII filtering and ACL propagation in streaming are complex; less tooling than batch equivalents
Scalability Evidence	4	Financial services deployments at millions of events/day are documented
Cost Predictability	3	Real-time embedding costs are variable; event volume spikes can cause cost overruns

18. Revision History

Version	Date	Author	Changes
1.0	2024-05-01	EAAPL Working Group	Initial publication
1.1	2025-03-15	EAAPL Working Group	Hot/cold index partitioning formalised; partial-update chunking strategy added; freshness-decay formula documented

Track this pattern for APRA/ASIC review

← Back to Library More Retrieval-Augmented Generation →