[EAAPL-RAG006] Streaming Retrieval-Augmented Generation
Category: Artificial Intelligence / Retrieval-Augmented Generation
Sub-category: Real-Time and Event-Driven RAG
Version: 1.1
Maturity: Proven
Tags: rag streaming real-time event-driven freshness kafka time-windowed-retrieval change-data-capture
Regulatory Relevance: APRA CPS230 (operational resilience for real-time systems), MiFID II (real-time data obligations in financial services), ISO/IEC 42001 Section 8.4
1. Executive Summary
Streaming RAG extends the foundational RAG architecture to continuously ingest and index data from real-time event streams, enabling language model responses to be grounded in current, up-to-the-minute knowledge. While standard RAG operates on a periodically refreshed static corpus, Streaming RAG treats the knowledge index as a continuously evolving data product where events (new documents, document updates, market data changes, operational alerts) are reflected in retrievable context within seconds to minutes of occurrence.
For business leaders in time-sensitive domains — financial trading, real-time customer operations, industrial IoT, news and media — the distinction between a RAG system that knows what happened yesterday versus one that knows what happened five minutes ago can determine whether the AI assistant is fit for operational use. A customer service AI grounded only in yesterday's product data will give incorrect answers about today's promotional pricing. A risk management AI that cannot access this morning's regulatory alerts may miss a material compliance event. Streaming RAG closes this gap by making knowledge freshness a first-class architectural concern, traded off explicitly against retrieval quality and operational complexity.
2. Problem Statement
Business Problem
Standard batch-ingested RAG systems have knowledge that is hours or days stale. For many enterprise use cases this is acceptable; for many others it is not. Customer service assistants need to know today's pricing, promotions, and service status. Operations assistants need to know the current status of incidents, deployments, and SLA breaches. Financial services assistants need to know this morning's rate changes and regulatory updates. A RAG system that cannot reflect these changes in real-time is operationally unusable for these use cases.
Technical Problem
The batch ingestion architecture of standard RAG introduces inherent latency between when a document is created or updated and when it becomes retrievable. Document crawling, format parsing, chunking, embedding, and index upsert are sequential pipeline steps that typically take minutes to hours per document in a batch schedule. Event-driven and streaming systems demand a different architecture: an ingestion pipeline triggered by events rather than schedules, optimised for low-latency single-document processing rather than high-throughput batch processing.
Additionally, real-time data creates a freshness/quality tension: a document ingested seconds ago has not undergone the quality validation, editorial review, or metadata enrichment that a batch-ingested document has. The retrieval strategy must account for this — freshness and quality are both relevant signals for ranking, and the appropriate balance depends on the use case.
Symptoms
- AI assistant gives incorrect answers because its knowledge is stale (e.g., cites a policy that was updated this morning)
- Customer service AI provides yesterday's pricing to customers who have received today's promotional offer
- Operations runbook AI cannot answer questions about the current production incident because the incident report was just created
- News and media AI continues to reference an old version of a story after a material retraction was published
Cost of Inaction
- Customer complaints and churn from incorrect information delivery in real-time service channels
- Regulatory exposure: financial services AI providing stale rate or compliance data
- Operational risk: incident response AI grounded in stale runbooks during a live incident
- Trust erosion: users who notice knowledge staleness lose confidence in the AI assistant and revert to manual lookup
3. Context
When to Apply
- AI assistants for customer-facing channels where product, pricing, or service data changes frequently (daily or faster)
- Operations and incident response assistants where runbook and incident data must be current
- Financial services AI where market data, rate sheets, and regulatory alerts must be reflected within minutes
- News, media, and content platforms where articles, corrections, and editorial updates are continuous
- IoT and industrial operations where equipment status, alerts, and sensor thresholds change continuously
When NOT to Apply
- Knowledge corpora that change less frequently than once per day (standard batch ingestion is simpler and less expensive)
- Use cases where answer freshness is not a user requirement (historical research, legal document review)
- Environments lacking a mature event streaming infrastructure (Kafka, Kinesis) — the operational overhead of deploying streaming infrastructure solely for RAG is rarely justified
- Use cases where raw streaming data quality is insufficient for grounded AI generation (malformed events, incomplete records)
Prerequisites
- An event streaming platform (Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub) producing document change events
- Change Data Capture (CDC) tooling for database-originated changes (Debezium, AWS DMS)
- A vector database that supports low-latency upsert operations (Weaviate, Qdrant, Pinecone — all support real-time upsert)
- An embedding inference service with low-latency API (< 100ms per chunk) to avoid ingestion lag
- Freshness metadata (created_at, updated_at timestamps) that must be indexed alongside vectors
Industry Applicability
| Industry |
Event Source |
Acceptable Freshness Lag |
Primary Use Case |
| Financial Services |
Market data feeds, regulatory alert APIs |
Seconds to minutes |
Rate and compliance queries |
| E-commerce / Retail |
Product catalogue events, inventory CDC |
Minutes |
Pricing and availability Q&A |
| Operations / DevOps |
Incident management webhooks, monitoring alerts |
Seconds |
Live incident runbook assistance |
| Healthcare |
Clinical note CDC, pharmacy formulary updates |
Minutes to hours |
Clinical decision support |
| News / Media |
CMS publish webhooks |
Seconds |
News article Q&A and summarisation |
| IoT / Manufacturing |
Sensor alert streams, equipment status CDC |
Seconds |
Equipment status and troubleshooting |
4. Architecture Overview
Streaming RAG introduces an event-driven ingestion pipeline that runs continuously alongside (or instead of) the batch ingestion pipeline. The two pipelines share the same vector database but differ in their triggering, throughput optimisation, and quality guarantees.
Event-Driven Ingestion Pipeline
The event-driven pipeline is triggered by document change events rather than a schedule. Events are consumed from a streaming platform (Kafka topic, Kinesis stream, or event hub). Each event carries the document content (or a reference to it) and a change type (CREATE, UPDATE, DELETE). The pipeline processes events through the standard chunking → embedding → upsert sequence but is optimised for low per-event latency rather than high batch throughput.
Key design decisions for the event-driven pipeline:
- Exactly-once semantics: document CREATE and UPDATE events must be processed exactly once to avoid duplicate vectors. Use idempotent upserts (upsert by document ID, not insert) and transactional offset commits in Kafka.
- DELETE propagation: when a document is deleted, all its chunks must be deleted from the vector index. This requires a chunk-to-document mapping in a metadata store (the document store serves this purpose).
- Partial document updates: if only a section of a document changes, a partial update strategy re-chunks and re-embeds only the changed sections, significantly reducing embedding compute cost for large documents.
- Event ordering: out-of-order events (UPDATE before CREATE due to partition skew) must be handled. Idempotent upserts naturally handle this; version numbers in event payloads allow stale-write detection.
Freshness-Aware Retrieval
The retrieval layer must incorporate document freshness as a ranking signal alongside semantic relevance. Two approaches are common:
- Hard time-window filtering: pre-retrieval metadata filter restricts results to documents updated within a configurable time window (e.g., last 24 hours for market data; last 7 days for policies). This is simple and guarantees freshness but may exclude relevant older documents.
- Freshness-decay scoring: the retrieval score is multiplied by a time-decay function
exp(-λ × age_in_days) where λ controls the decay rate. Freshness is a soft preference, not a hard cut-off. Documents updated in the last hour score near-full relevance; documents updated a month ago have discounted relevance.
The appropriate strategy depends on the use case: real-time market data use cases benefit from hard time-window filtering; policy and procedure use cases (where a 3-year-old policy is still valid) benefit from freshness-decay scoring.
Index Consistency During Real-Time Upserts
Real-time vector upserts create a brief window of index inconsistency while the HNSW index is being updated. Most enterprise vector databases handle concurrent reads and writes gracefully (MVCC-style), but very high-throughput streaming scenarios may require index partitioning strategies (e.g., separate "hot" index for recent documents, "cold" index for stable historical documents) with a merge retrieval strategy. The hot/cold partition approach also enables different HNSW tuning parameters: the hot index is tuned for high write throughput; the cold index is tuned for high recall.
Quality Considerations for Streaming Data
Raw event data is often lower quality than batch-processed documents: events may be incomplete, may contain duplicate or near-duplicate content, or may represent intermediate states (a draft document, an unreviewed news post). A lightweight quality filter at the event consumer should enforce minimum quality thresholds (minimum document length, no-duplicate check via content hash, optional classification check) before triggering embedding and indexing.
5. Architecture Diagram
flowchart TD
subgraph Events["Event Sources"]
A[Kafka / Kinesis Stream]
B[Batch Ingestion]
end
subgraph Ingestion["Streaming Ingestion"]
C[Stream Consumer]
D[Hot Vector Index]
E[Cold Vector Index]
end
subgraph Query["Freshness-Aware Query"]
F[User Query]
G[Freshness Filter]
H[Merge + Rerank]
I[LLM Generation]
end
A -->|exactly-once upsert| C
B --> E
C --> D
D --> H
E --> H
F --> G --> H --> I --> F
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#fef9c3,stroke:#eab308
style E fill:#fef9c3,stroke:#eab308
style F fill:#dbeafe,stroke:#3b82f6
style G fill:#f0fdf4,stroke:#22c55e
style H fill:#f0fdf4,stroke:#22c55e
style I fill:#d1fae5,stroke:#10b981
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Event Streaming Platform |
Infrastructure |
Durable, ordered event log for document change events |
Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub |
Critical |
| CDC Connector |
Integration |
Capture row-level changes from databases as events |
Debezium, AWS DMS, Azure Data Factory CDC mode |
High |
| Stream Consumer |
Application |
Consume events; commit offsets with exactly-once semantics |
Kafka Consumer (Python), Faust, Apache Flink |
Critical |
| Quality Filter |
Data Quality |
Reject malformed, duplicate, or below-threshold events |
Custom Python; great_expectations; pydantic validators |
High |
| Real-Time Chunker |
Data Processing |
Chunk document on arrival; support partial update chunking |
LangChain splitters; custom partial-update diff algorithm |
High |
| Embedding Service |
ML Inference |
Low-latency embedding API (< 100ms per chunk) |
OpenAI embeddings API, Google Vertex AI embeddings, self-hosted sentence-transformers |
Critical |
| Hot Vector Index |
Storage |
Receive real-time upserts; optimised for write throughput |
Weaviate (live index), Qdrant (real-time segment), Pinecone upsert |
Critical |
| Cold Vector Index |
Storage |
Stable archival index for historical documents; optimised for recall |
pgvector, OpenSearch, Pinecone (separate index) |
High |
| Merge Retrieval Layer |
Orchestration |
Query both hot and cold indexes; combine and deduplicate results |
Custom Python; LangChain custom retriever |
High |
| Freshness Decay Scorer |
Algorithm |
Apply time-decay function to re-weight retrieved chunks by age |
Custom Python (configurable λ parameter) |
Medium |
| As-Of Timestamp Tracker |
Observability |
Record and surface the knowledge cutoff timestamp for each response |
Custom metadata; surfaced in response header |
High |
7. Data Flow
Primary Flow
| Step |
Actor |
Action |
Output |
| 1 |
Source System |
Publish document change event (CREATE/UPDATE/DELETE) |
Kafka message: {event_type, doc_id, content, metadata, timestamp} |
| 2 |
Stream Consumer |
Consume event; validate schema; deduplicate via content hash |
Validated event or SKIP (duplicate) |
| 3 |
Quality Filter |
Check minimum length, classification, and formatting |
Quality-passed event |
| 4 |
Real-Time Chunker |
Chunk document; for UPDATE events, diff against last-known chunk set |
New/modified chunks with doc_id and metadata |
| 5 |
Embedding Service |
Embed each chunk (< 100ms per chunk target) |
Chunk vectors |
| 6 |
Hot Vector Index |
Idempotent upsert: if chunk_id exists, replace; else insert |
Vector index updated; stale chunks removed |
| 7 |
Hot Index Aging Job |
Periodic job: move documents older than 7 days from hot to cold index |
Cold index updated; hot index cleaned |
| 8 |
User |
Submit query |
Query string |
| 9 |
Query Processor |
Expand query |
Enhanced query |
| 10 |
Freshness Strategy |
Determine retrieval strategy: hard window or decay parameter |
Filter expression or decay λ |
| 11 |
Merge Retrieval |
Query both hot and cold indexes; combine results with freshness weighting |
Merged, freshness-weighted candidate list |
| 12 |
Re-ranker |
Apply cross-encoder on freshness × relevance combined score |
Top-N re-ranked candidates |
| 13 |
Context Assembler |
Assemble prompt; include "as-of" timestamp in system context |
Assembled prompt with freshness indicator |
| 14 |
LLM |
Generate answer |
Response |
| 15 |
Response Delivery |
Return answer with "Knowledge current as of [timestamp]" indicator |
Final response |
Error Flow
| Error Condition |
Detection |
Recovery |
| Embedding service latency spike (> 500ms) |
Per-event latency monitoring |
Increase consumer lag; alert; route to backup embedding service |
| Consumer lag growing (events accumulating) |
Kafka consumer group lag monitoring |
Scale consumer instances; alert on lag threshold breach |
| Hot index write conflict |
Vector DB error response |
Retry with exponential backoff (3 retries); dead-letter queue |
| Stale DELETE event arriving after new CREATE |
Version check in event metadata |
Reject stale DELETE using version comparison; log discarded event |
8. Security Considerations
Real-Time ACL Update Propagation
In streaming RAG, documents can be ingested faster than ACL metadata is normalised. The stream consumer must reject document events that lack ACL metadata rather than ingesting with empty ACLs (which would make the document accessible to everyone). A document with no ACL metadata is ingested into a "pending ACL" queue and indexed only once ACL metadata is resolved.
Event Stream Security
The event streaming platform must enforce topic-level access control (Kafka ACLs or IAM policies for managed services). The embedding service API key must be stored in a secrets manager and rotated regularly. Event payloads containing highly sensitive content should be encrypted in transit with envelope encryption.
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk |
Streaming-Specific Concern |
Mitigation |
| LLM01: Prompt Injection |
Malicious content injected into a high-priority event stream to influence AI responses |
Content sanitisation in quality filter; treat all event content as untrusted |
| LLM04: Model Denial of Service |
Event flood (high-rate update events) causes embedding service overload |
Rate limiting at event consumer; embedding service auto-scaling with circuit breaker |
| LLM09: Overreliance |
User assumes response reflects very recent events; system may still have lag |
Surface "as-of" timestamp in every response; document the ingestion lag SLA |
9. Governance Considerations
Freshness SLA Governance
The ingestion lag SLA (time from event publication to document retrievability) must be formally agreed and monitored. For each source type, the SLA should be documented in the Source Inventory (EAAPL-KNW003). Breach of the freshness SLA for Tier 1 sources must trigger an automated alert.
Governance Artefacts
| Artefact |
Owner |
Frequency |
Purpose |
| Ingestion Lag SLA |
Data Governance |
Per source type; reviewed quarterly |
Agreed freshness commitments per source |
| Consumer Lag Dashboard |
AI Operations |
Continuous |
Real-time view of ingestion pipeline health |
| Stale Response Incident Log |
AI Operations |
Per incident |
Track cases where stale knowledge led to incorrect responses |
| Hot/Cold Index Merge Audit |
Data Engineering |
Weekly |
Verify documents correctly migrated from hot to cold index |
10. Operational Considerations
Monitoring
| Metric |
Alert Threshold |
Notes |
| Event consumer lag (messages behind) |
> 1,000 messages (Tier 1) |
Scale consumer; alert ops team |
| Embedding API P99 latency |
> 200ms |
Check embedding service; scale if needed |
| Hot index write latency P99 |
> 50ms |
Check vector DB cluster health |
| As-of timestamp lag (latest indexed doc age) |
> 5 min (Tier 1 sources) |
Freshness SLA breach; alert |
| Dead-letter queue depth |
> 10 messages |
Failed events require investigation |
Service Level Objectives
| SLO |
Target |
Measurement |
| End-to-end ingestion lag (event → retrievable) |
≤ 60 seconds (Tier 1), ≤ 5 minutes (Tier 2) |
Synthetic event latency probe |
| Hot index availability |
≥ 99.95% |
Health check |
| Consumer availability (no lag growth) |
≥ 99.9% |
Lag monitoring |
Disaster Recovery
| Component |
RTO |
RPO |
DR Strategy |
| Event Streaming Platform |
15 min (managed) |
0 (durable log) |
Managed service with multi-AZ; retain messages for 7 days |
| Hot Vector Index |
30 min |
1 hour |
Snapshot to cold storage; replay events from Kafka from last snapshot offset |
| Stream Consumer |
5 min |
0 (Kafka offset replay) |
Auto-scaling group; Kubernetes pod restart |
11. Cost Considerations
Cost Drivers
| Cost Driver |
Notes |
Optimisation |
| Embedding API (real-time, per-event) |
Real-time calls are more expensive than batch (no batch discount) |
Batch micro-events (group events within a 5-second window) |
| Kafka / streaming platform |
Managed streaming services ($0.02–$0.10 per GB throughput) |
Right-size retention period; use tiered storage for cold Kafka data |
| Hot index write throughput |
High write-throughput vector DB configuration costs more |
Tune hot index for write throughput; migrate to cold index promptly |
| Embedding for deleted documents |
Deletion events do not require embedding but do require chunk lookup |
Maintain chunk-to-document mapping for O(1) deletion |
Indicative Cost Range
| Event Volume |
Incremental Cost vs. Batch RAG |
Notes |
| < 1,000 events/day |
+$200–$500/month |
Low-volume streaming; streaming infra fixed cost dominates |
| 1,000–50,000 events/day |
+$500–$2,000/month |
Moderate streaming; embedding cost becomes significant |
| > 50,000 events/day |
+$2,000–$10,000/month |
High-volume; batch micro-event grouping essential |
12. Trade-Off Analysis
Freshness vs. Retrieval Quality
| Approach |
Freshness |
Retrieval Quality Risk |
Complexity |
| Hard time-window filter |
Guaranteed (only recent docs) |
High (may miss relevant older docs) |
Low |
| Freshness-decay scoring |
Soft preference for recent docs |
Low (older docs still retrievable) |
Medium |
| Hot/cold index merge |
Balanced (recent = hot; historical = cold) |
Low |
High |
Event Granularity for Streaming
| Granularity |
Ingestion Cost |
Index Churn |
Recommended For |
| Full document on every update |
High (re-embed all chunks) |
High |
Small documents (< 1,000 tokens) |
| Section-level diff (changed sections only) |
Low |
Low |
Large documents (policies, manuals) |
| Event field payload (structured data update) |
Very Low |
Very Low |
Database record changes (pricing, inventory) |
Architectural Tensions
| Tension |
Trade-off |
Recommendation |
| Real-time freshness vs. quality validation |
Instant ingestion: fresh but unvalidated; delayed ingestion: validated but stale |
Ingest immediately; tag as "unreviewed" in metadata; surface review status to users |
| Hot index write throughput vs. recall |
High write throughput: lower HNSW ef_construction → lower recall; rebuild pauses index |
Separate hot/cold indexes with different HNSW parameters |
13. Failure Modes
| Failure Mode |
Likelihood |
Impact |
Detection |
Recovery |
| Embedding service rate limit during event surge |
Medium |
High |
Consumer lag + embedding error rate |
Pre-provision embedding service quota; circuit breaker routes excess to batch queue |
| Out-of-order DELETE before CREATE |
Low |
Medium |
Version check in consumer |
Idempotent upsert handles gracefully; DELETE is no-op if doc not present |
| Hot index capacity exhaustion |
Medium |
High |
Index storage utilisation monitoring |
Automated migration of old docs to cold index; alert at 80% capacity |
| Duplicate events causing duplicate vectors |
Medium |
Low |
Content hash deduplication in quality filter |
Idempotent upsert prevents duplicate vectors even if filter fails |
| Consumer reprocessing on offset reset causes full re-index |
Low |
Medium |
Consumer offset monitoring |
Idempotent upserts make reprocessing safe; monitor for unexpected full-corpus re-index |
14. Regulatory Considerations
| Regulation |
Requirement |
Streaming RAG Response |
| MiFID II (EU) |
Financial data must be current; obligations around data timestamping |
As-of timestamp surfaced in every response; ingestion lag SLA ≤ 5 minutes for rate data |
| APRA CPS 230 |
Operational resilience; recovery time for critical services |
Streaming platform multi-AZ; Kafka message retention enables replay on outage |
| Privacy Act 1988 |
Real-time CDC may capture personal information in database changes |
CDC PII filter applied before Kafka topic ingestion; personal data changes not ingested to RAG |
15. Reference Implementations
AWS
- CDC: AWS DMS → Amazon Kinesis Data Streams
- Streaming consumer: AWS Lambda (Kinesis trigger) + Amazon Bedrock Titan Embeddings
- Hot index: Amazon OpenSearch Service (real-time upsert)
- Cold index: Amazon Aurora pgvector
- Freshness decay: Lambda post-processor
Azure
- CDC: Azure Data Factory CDC → Azure Event Hubs
- Streaming consumer: Azure Functions (Event Hub trigger) + Azure OpenAI embeddings
- Hot/cold index: Azure AI Search (hot namespace + cold namespace)
- Merge layer: Azure Function combining hot/cold results
GCP
- CDC: Datastream → Google Pub/Sub
- Streaming consumer: Cloud Run (Pub/Sub push) + Vertex AI embeddings
- Hot index: Vertex AI Vector Search (real-time upsert supported)
- Cold index: AlloyDB pgvector
| Pattern ID |
Pattern Name |
Relationship |
| EAAPL-RAG001 |
Enterprise RAG |
Foundation; RAG006 replaces the ingestion pipeline component |
| EAAPL-RAG003 |
Secure RAG |
ACL enforcement applied at streaming ingestion; mandatory overlay |
| EAAPL-RAG007 |
Agentic RAG |
Agentic loops may trigger targeted real-time indexing for discovered sources |
| EAAPL-KNW003 |
AI Knowledge Corpus Management |
Corpus management policies apply to streaming corpora; freshness SLA is a corpus management concern |
17. Maturity Assessment
Overall Maturity: Proven — Event-driven RAG ingestion is deployed in production at financial services and e-commerce scale; freshness-aware retrieval patterns are well-understood; hot/cold index partitioning is operationally mature.
| Dimension |
Score (1–5) |
Rationale |
| Technology Readiness |
4 |
All components production-ready; partial-update chunking is the least mature component |
| Tooling Ecosystem |
4 |
Kafka, Kinesis, and vector DB upsert are mature; Flink/Faust for stream processing are production-grade |
| Operational Guidance |
3 |
Consumer lag management and hot/cold migration tuning require Kafka and vector DB expertise |
| Security & Compliance |
3 |
CDC PII filtering and ACL propagation in streaming are complex; less tooling than batch equivalents |
| Scalability Evidence |
4 |
Financial services deployments at millions of events/day are documented |
| Cost Predictability |
3 |
Real-time embedding costs are variable; event volume spikes can cause cost overruns |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2024-05-01 |
EAAPL Working Group |
Initial publication |
| 1.1 |
2025-03-15 |
EAAPL Working Group |
Hot/cold index partitioning formalised; partial-update chunking strategy added; freshness-decay formula documented |