Proven

EAAPL-INT004 — Real-Time AI Stream Processing

Tags: streaming real-time high-availability high-complexity Status: Proven | Version: 1.0 | Domain: Integration

1. Executive Summary

Real-Time AI Stream Processing applies AI inference directly to streaming data pipelines, enabling intelligent classification, anomaly detection, and contextual analysis at the speed of the data flow. Where batch AI processing applies intelligence after data has accumulated, stream processing applies intelligence as each event arrives — enabling real-time fraud detection, live sentiment analysis on customer interactions, and continuous compliance monitoring across operational data flows.

The pattern addresses the architectural tension inherent in AI stream processing: AI inference is orders of magnitude slower than typical stream processing operations. A fraud detection model may take 50–500ms to evaluate a single transaction; a stream processing engine designed for microsecond throughput creates a fundamental impedance mismatch. This pattern describes the architectural strategies to bridge that gap — stateless per-event inference, stateful context window accumulation, micro-batch grouping, and SLO-aligned architecture selection — enabling enterprises to get AI-quality intelligence at data-stream speed.

For CIOs and CTOs, the business case is clear: AI inference running 500ms after an event arrives prevents fraud, enables personalisation, and catches compliance violations. AI inference running 24 hours later, in a batch, is too late for any of those use cases.

2. Problem Statement

Business Problem

Business processes generate data that contains actionable signals in real time: a fraudulent transaction occurring right now, a customer expressing anger right now, a compliance violation happening right now. Extracting those signals after the fact — in overnight batch jobs — means the opportunity to act has passed. The fraud has completed. The customer has churned. The regulatory breach has occurred.

Technical Problem

AI inference and stream processing operate at incompatible throughput and latency profiles. Stream processing engines (Kafka, Flink, Spark) are designed to process millions of events per second. AI inference — even the fastest embedding or classification models — processes hundreds to thousands of events per second per worker. Naively inserting AI inference into a streaming pipeline creates a processing bottleneck that backs up the entire stream.

Symptoms

Fraud detection models run overnight on the previous day's transaction data; intraday fraud is caught hours or days after occurrence.
Customer sentiment analysis is available in weekly reports rather than during active service interactions.
Compliance monitoring relies on manual sampling of transaction logs rather than continuous automated coverage.
Real-time data pipelines exist but bypass AI inference due to latency concerns, then feed AI inference separately in batch.

Cost of Inaction

Fraud losses: Payment fraud detected in real-time vs. overnight has a 60–80% difference in recoverable loss rate. For a mid-size bank processing $10B in annual transactions, this is a $50M–$150M annual impact.
Customer experience: Sentiment-triggered escalation during a live service call reduces customer churn at a rate of 15–30% for escalated cases resolved positively.
Compliance: Continuous compliance monitoring eliminates the "discovery lag" — the time between a compliance violation occurring and it being detected in batch review. AUSTRAC, FCA, and SEC have all levied significant fines for institutions where compliance monitoring operated in batch rather than real-time.

3. Context

When to Apply

The business value of AI insights is time-sensitive — actionability decays rapidly after event occurrence.
The data source is a stream or event log (Kafka, Kinesis, Pub/Sub, IoT data, financial market feeds, clickstream, log streams).
The SLO for AI inference can be expressed in seconds or sub-seconds (not minutes or hours).
AI inference cost per event is justified by the value of real-time action.

When NOT to Apply

Batch-compatible use cases: document classification, overnight risk reporting, periodic data enrichment — use EAAPL-INT005 instead.
Ultra-low-latency requirements (< 10ms) — no AI inference technology currently meets this SLO; use rule-based approaches or pre-computed scores.
AI models requiring large context windows that cannot be populated from available stream data.
Event streams with extreme volume (> 1M events/second) where AI inference cost is prohibitive — apply selective sampling or tiered processing.

Prerequisites

A mature streaming platform (Kafka, Kinesis, Pub/Sub, Azure Event Hubs).
An AI inference serving infrastructure capable of meeting the target SLO (model serving, batching, GPU access if needed).
A stream processing framework (Flink, Spark Structured Streaming, Kafka Streams).
Observability platform supporting event-level latency tracing.

Industry Applicability

Industry	Applicability	Use Case	Latency SLO
Financial Services	Very High	Real-time fraud scoring, AML transaction monitoring, market abuse detection	< 200ms (fraud), < 5s (AML)
Telecommunications	High	Network anomaly detection, churn signal detection from usage patterns	< 1s (anomaly), < 10s (churn)
eCommerce / Retail	High	Real-time personalisation, cart abandonment detection, dynamic pricing signals	< 500ms (personalisation)
Healthcare	Medium	Continuous patient monitoring (ICU), real-time adverse event detection	< 5s (clinical)
Cybersecurity	Very High	Threat detection from log streams, lateral movement detection, SIEM enrichment	< 500ms
Manufacturing	Medium	Predictive quality control from sensor streams, equipment anomaly detection	< 2s

4. Architecture Overview

Real-Time AI Stream Processing requires selecting an architecture topology matched to the inference SLO, the nature of the AI task (stateless vs. stateful), and the throughput characteristics of the input stream. Four architectures are defined.

Architecture 1 — Stateless Per-Event Inference (Kafka Streams). Each event is classified independently without reference to prior events. The stream processor reads from the input topic, invokes AI inference on the event payload, and writes the enriched event to the output topic. Kafka Streams handles partition assignment, offset management, and consumer group coordination. AI inference is called synchronously within the stream processor topology — the event moves forward only when inference is complete. This architecture suits high-throughput classification tasks where each event carries sufficient context: spam classification, document categorisation, content moderation. The critical design decision is the AI inference call: a blocking HTTP call within Kafka Streams is fatal for throughput. The inference call must be batched within the processor to amortise HTTP overhead across multiple events.

Architecture 2 — Stateful Context Window Inference (Flink). Many AI tasks require context accumulated across multiple events: sentiment trend analysis across a customer service interaction, fraud pattern detection across a user's transaction sequence, log anomaly detection requiring baseline comparison. Flink's stateful processing model maintains per-key state (per customer, per session, per device) across events. The AI inference is triggered when sufficient context has accumulated (time window, event count, or state change threshold). Flink manages state checkpointing, exactly-once semantics, and fault recovery. The AI inference is invoked asynchronously using Flink's AsyncIO operator — the event processing topology does not block waiting for the inference result; other events continue processing while the async inference call is in flight. This is the highest-quality AI stream processing architecture but requires the most operational maturity to run in production.

Architecture 3 — Micro-Batch Inference (Spark Structured Streaming). Events are accumulated into micro-batches of configurable size (N events or T seconds, whichever comes first). The entire micro-batch is submitted to the AI model as a batch inference call, amortising per-call overhead and often qualifying for batch pricing tiers. Spark manages the micro-batch trigger, partition management, and output writing. This architecture accepts a latency floor equal to the micro-batch trigger interval (typically 1–30 seconds) in exchange for significantly higher throughput and lower cost per event. It is appropriate where near-real-time (not sub-second) intelligence is sufficient.

Architecture 4 — Tiered Processing (Lambda Architecture for AI). High-throughput streams benefit from tiered processing: a fast, cheap rule-based or small-model tier processes all events at stream speed, flagging events for the expensive deep AI inference tier. Only flagged events (typically 1–10% of total volume) receive full AI inference. The fast tier uses lightweight classifiers, regex patterns, or pre-computed embedding similarity; the deep tier uses LLMs or complex models. This reduces AI inference cost by 10–100× while maintaining deep analysis quality for the cases that matter most.

Exactly-Once Semantics for AI Inference. For regulated use cases (fraud detection, AML, compliance monitoring), duplicate AI inference on the same event is not merely inefficient — it is a compliance risk if different inference results are produced for the same event. Exactly-once semantics require: (a) idempotency key on each event (unique event ID); (b) inference result deduplication at the output sink (check event ID before writing); (c) transactional output where the inference result and the consumer offset commit are atomic.

Watermarking and Late-Arriving Events. Real-time streams always include late-arriving events — network delays, mobile client buffering, out-of-order message delivery. The stream processor must define a watermark — the maximum expected event delay — and a policy for events arriving after the watermark. For AI inference, late events that fall outside a completed inference window must be reprocessed against a reopened window or rejected with logging. For fraud detection, an event arriving 5 seconds late is still relevant; an event arriving 5 hours late may not be.

Model Serving Infrastructure. The AI inference serving layer must be sized for stream throughput. GPU-based model serving (Triton Inference Server, vLLM, TorchServe) provides highest throughput for neural models. Horizontal scaling of model serving replicas, with load balancing across the stream processing topology, ensures no single inference bottleneck. Serving latency (p99) must be < 70% of the inference SLO to leave headroom for stream processing overhead.

5. Architecture Diagram

ARCHITECTURE DIAGRAM

flowchart TD subgraph Sources["Stream Sources"] A[Event Streams] B[Streaming Platform Topics] end subgraph Processing["Tiered AI Processing"] C[Fast Tier Classifier] D[Deep AI Inference Worker] E[Model Serving Cluster] end subgraph Output["Output and Consumers"] F[Enriched Output Topics] G[Downstream Consumers] H[Stream DLQ] end A --> B B --> C C -->|flagged events| D D --> E E -->|inference result| D D -->|enriched event| F D -->|failure| H F --> G style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#f0fdf4,stroke:#22c55e style F fill:#fef9c3,stroke:#eab308 style G fill:#d1fae5,stroke:#10b981 style H fill:#fee2e2,stroke:#ef4444

6. Components

Component	Type	Responsibility	Technology Options	Criticality
Streaming Platform	Infrastructure	Durable event log, partitioned topics, consumer group coordination	Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub	Critical
Fast Tier Processor	Service	Lightweight classification; triage events for deep AI tier	Kafka Streams (stateless), Flink (stateful), AWS Lambda Kinesis consumer	High
Flink Job (Stateful)	Processing	Stateful per-key context accumulation; async AI inference via AsyncIO operator	Apache Flink 1.18+, AWS Kinesis Data Analytics, Confluent Flink	High
Spark Structured Streaming	Processing	Micro-batch accumulation and batch AI inference submission	Apache Spark 3.5+, Azure HDInsight, Databricks	High
AI Model Serving Cluster	Infrastructure	Serve AI inference requests at streaming throughput SLO	NVIDIA Triton Inference Server, vLLM, TorchServe, Ollama (dev only)	Critical
Inference Cache	Service	Semantic deduplication of near-identical inference requests	Redis with vector similarity, custom bloom filter	High
Idempotency Store	Service	Deduplicate inference results on event ID to enforce exactly-once output semantics	Redis, DynamoDB, PostgreSQL with unique constraint	Critical (regulated)
Flink State Backend	Storage	Persistent state for stateful inference context per key	RocksDB (embedded), S3 for checkpoints	High
Stream DLQ	Infrastructure	Capture events where inference failed after max retries with full context	Kafka DLQ topic, SQS DLQ	High
Watermark Manager	Component	Define event time watermarks; handle late arrivals per use case policy	Flink Watermark Strategies, Spark event-time windows	Medium

7. Data Flow

Primary Flow — Stateful Flink with AsyncIO

Step	Actor	Action	Output
1	Source System	Publishes event to raw topic with event_id, event_time, partition key (customer_id/device_id)	Event persisted in Kafka partition
2	Fast Tier Classifier	Reads event; applies lightweight classifier or rule engine	Event tagged: ROUTE_TO_DEEP_AI or PASS_THROUGH
3	Flink AsyncIO Operator	Receives ROUTE_TO_DEEP_AI events; non-blocking async call to AI inference serving	Async inference request dispatched
4	AI Model Serving	Receives event (or accumulated context window); executes inference	Inference result with confidence, model_version, latency
5	Flink Output	Receives async result; updates Flink state; emits enriched event to output topic	Enriched event: original payload + ai_result + ai_confidence + ai_model_version
6	Idempotency Store	Checks event_id before writing to output topic	Write (new event_id) or discard (duplicate)
7	Output Topic	Enriched event available for consumer applications	Consumers read enriched event
8	Flink Checkpoint	Periodic state checkpoint to durable storage	State recoverable after failure

Error Flow

Step	Error Condition	Detection	Recovery
3	AI inference serving unavailable	AsyncIO timeout or connection failure	Retry twice with backoff; after max retries, route event to DLQ with error context
3	AI inference SLO exceeded (latency > target)	AsyncIO timeout per configured threshold	Log SLO breach; emit event with `ai_slo_breach: true`; consumer can choose to act on partial result
4	Model returns error (invalid input)	Non-2xx HTTP response from serving	Log specific error; route event to DLQ for manual investigation
5	Flink state backend failure	Checkpoint fails; state unavailable	Flink restores from last successful checkpoint; events between checkpoint and failure reprocessed — idempotency store prevents duplicate output
6	Output topic unavailable	Kafka write failure	Flink retries write; if persistent failure, job pauses — events accumulate in Kafka (durable); alert fires

8. Security Considerations

Authentication and Authorisation

Stream processors authenticate to Kafka using mTLS or SASL/SCRAM service accounts.
AI inference serving cluster accessible only from stream processor network segment; not exposed externally.
Consumer applications read from output topics using topic ACLs — read access only to subscribed topics.

Secrets Management

AI model serving API keys (for external model providers) stored in secrets manager; injected at job startup.
Kafka service account credentials rotated on 90-day schedule; Flink jobs support dynamic secret refresh without job restart via Flink's Secrets Provider interface.

Data Classification

Events containing PII (transaction data, customer interaction data) classified at ingestion.
PII events processed by AI inference must stay within the data residency boundary for which PII processing has been approved.
Inference results containing inferred sensitive attributes (credit risk score, health condition flags) classified at the highest sensitivity level and access-controlled on output topics.

Encryption

All event data in transit encrypted (TLS 1.3 for Kafka, mTLS for inference serving).
Flink state backend (RocksDB) encrypted at rest using filesystem encryption.
Inference results on output topics encrypted at rest (broker-level encryption).

Auditability

Every inference event logged with: event_id, partition_key, inference_model, result, confidence, latency, processing_timestamp.
Inference SLO breaches logged as separate audit events — enables compliance reporting on AI processing timeliness.
For regulated use cases, inference audit log retained for the same period as the underlying transaction record.

OWASP LLM Top 10 Mitigations

OWASP LLM Risk	Relevance	Mitigation in This Pattern
LLM01 — Prompt Injection	Medium	Stream events are structured data, not free text; field extraction before prompt construction; free-text fields (customer messages) sanitised
LLM02 — Insecure Output Handling	High	AI inference result parsed from structured response; enriched event schema validated before writing to output topic
LLM03 — Training Data Poisoning	Low	Inference only (not training); stream data is not used to update model weights in this pattern
LLM04 — Model Denial of Service	High	Per-consumer-group rate limits on inference serving; exactly-once semantics prevent retry amplification storms
LLM05 — Supply Chain Vulnerabilities	Medium	Model serving infrastructure uses pinned container image versions; inference SDK versions pinned; SBOM per job release
LLM06 — Sensitive Information Disclosure	High	PII in event payloads masked before including in LLM prompts for customer interaction analysis; transaction amounts may be binned rather than exact
LLM07 — Insecure Plugin Design	Low	Stream processing pattern does not use function-calling or plugins; inference is classification/scoring only
LLM08 — Excessive Agency	Low	Stream processors emit enriched events; action on those events is taken by consumer applications, not the processing pipeline
LLM09 — Overreliance	High	Confidence score in every inference result; consumers configured with minimum confidence thresholds below which AI result is not acted upon automatically
LLM10 — Model Theft	Medium	Model serving cluster in private network; no external access; for proprietary fine-tuned models, access-controlled inference endpoint with auth

9. Governance Considerations

Responsible AI

Real-time AI inference on customer data requires documented basis for processing under privacy law (AU Privacy Act, GDPR, EU AI Act).
Confidence score monitoring enables detection of model quality degradation on live traffic — governance team reviews weekly confidence score distributions.
Outcome feedback loops: consumer applications should publish outcome events (fraud confirmed/cleared, churn detected/averted) back to a feedback topic so model performance can be measured against ground truth.

Model Risk Management

Real-time fraud scoring models are high-risk AI systems; initial validation, regular backtesting, and champion/challenger deployment are required.
Model version tracked in every inference result event; enables retrospective performance analysis by model version using event archive.
Champion/challenger: Flink topology supports parallel consumer groups reading the same input topic — shadow mode sends 10% of traffic to challenger model for comparison before promotion.

Human Approval Gates

High-value automated actions (block a transaction > $10,000, suspend a customer account) triggered by AI stream inference require human review queue before execution.
Real-time AI scores are available immediately for human review interfaces; automated action requires meeting a high-confidence threshold AND a human review window (typically 15–60 seconds for fraud).

Policy and Traceability

Every automated action taken based on a real-time AI inference result must be traceable back to: the AI model version, the event_id, the inference confidence, and the downstream consumer that took the action.

Governance Artefacts

Artefact	Owner	Update Frequency	Storage Location
Model Risk Assessment (Real-Time Models)	Model Risk Team	Per model version change	MRM register
Inference SLO Compliance Report	Platform Engineering	Monthly	Observability platform
Confidence Score Distribution Report	AI Governance	Weekly	Data platform
Privacy Impact Assessment (Stream AI)	Privacy Officer	Per new data type in stream	Privacy register
Champion/Challenger Comparison Report	Data Science	Per model evaluation cycle	ML platform
Outcome Feedback Report	Data Science	Monthly	ML platform

10. Operational Considerations

Monitoring and SLOs

SLO	Target	Measurement	Alert Threshold
End-to-end stream AI latency (p99)	Use-case specific (see table below)	Event_time to inference result in output topic	> 2× SLO target
AI inference serving latency (p99)	< 70% of end-to-end SLO	Serving endpoint latency metric	> 80% of SLO target
Stream processing lag	< 1000 events	Consumer group lag metric	> 5000 events
Inference error rate	< 0.5%	Errors / total inference requests	> 2% in any 15-min window
DLQ growth rate	0 net new (steady state)	DLQ message count delta	Any sustained growth
Flink checkpoint success rate	100%	Flink checkpoint metrics	Any checkpoint failure

Use-Case SLO Reference:

Use Case	p99 Latency SLO	Architecture Recommended
Payment fraud detection	< 200ms	Stateless (Architecture 1) with GPU serving
AML transaction monitoring	< 5s	Stateful Flink (Architecture 2)
Customer interaction sentiment	< 2s	Stateful Flink (Architecture 2)
Content moderation	< 1s	Stateless (Architecture 1)
Document stream classification	< 30s	Micro-batch Spark (Architecture 3)

Logging

Flink task manager logs: job ID, task name, event processing rate, state size, checkpoint duration.
Inference serving logs: request ID, model, input size, latency, result category, confidence.
DLQ events: full event payload, error reason, retry count, last failure timestamp.

Incident Response

AI inference serving degradation: stream processor consumer lag accumulates; auto-scaler adds serving replicas; SLO breach events logged; alert fires within 60 seconds.
Flink job failure: Flink restarts from last checkpoint; events between checkpoint and failure reprocessed; idempotency store prevents duplicate output.
Model quality degradation: confidence score distribution alert fires; AI governance team reviews; champion/challenger evaluation triggered; model rollback if warranted.

Disaster Recovery

Scenario	RTO	RPO	Recovery Procedure
Flink job failure	3 minutes	0 (checkpoint-based recovery)	Flink automatic restart from last checkpoint; events replayed from Kafka offset
Inference serving cluster failure	5 minutes	0	Kubernetes replacement pods; stream processing continues when serving recovers
Kafka broker node failure	10 minutes	0 (replicated partitions)	Kafka partition leader election; stream processors reconnect automatically
Flink state backend corruption	30 minutes	Up to last checkpoint interval	Restore state from S3 checkpoint; reprocess events from checkpoint offset

Capacity Planning

AI inference serving: (peak events/second × AI fraction) / (inference serving throughput per replica) = minimum replicas; add 50% headroom.
Flink parallelism: number of Kafka partitions = maximum Flink parallelism; start at Kafka partition count, scale based on consumer lag.
State backend memory: (average state size per key) × (number of active keys) × (1.5 overhead factor) = RocksDB memory budget.

11. Cost Considerations

Cost Drivers

Cost Driver	Description	Typical Proportion
AI Model Serving Compute (GPU)	GPU instance costs for inference serving; dominant cost driver for neural models	40–65%
Streaming Platform	Kafka/Kinesis partition-hours, storage, data transfer	15–25%
Stream Processing Compute	Flink/Spark worker node compute; scales with throughput	10–20%
Flink State Storage	RocksDB memory; S3 for checkpoint storage	3–8%
Audit and Output Storage	Output topic retention, audit log storage	3–7%

Scaling Risks

GPU serving costs are fixed per instance — paying for GPU capacity 24/7 even during low-traffic hours is expensive. Auto-scaling with scale-to-zero (where SLO allows cold start) or scheduled scaling for predictable traffic patterns reduces waste.
Stateful Flink jobs with large per-key state (long context windows for millions of keys) can exhaust RocksDB memory leading to job failure; state TTL configuration and state size monitoring are mandatory.

Cost Optimisations

Tiered processing (Architecture 4): route only flagged events (1–10% of volume) to expensive deep AI inference; apply cheap classifiers to all events.
Micro-batching (Architecture 3): amortise per-call overhead; batch inference pricing often 30–50% cheaper than per-request pricing.
Inference result caching: deduplicate semantically identical events within a short time window before calling the model.
Spot/preemptible GPU instances for batch inference tier; on-demand only for real-time latency-sensitive tier.

Indicative Cost Range

Scale	Monthly Streaming Platform	AI Serving Compute	Total Monthly
Small (100K events/day, fraud detection)	$500–$1,500	$2,000–$8,000	$2,500–$9,500
Medium (10M events/day, multi-use-case)	$5,000–$15,000	$20,000–$60,000	$25,000–$75,000
Large (1B+ events/day, financial services)	$40,000–$100,000	$150,000–$500,000	$190,000–$600,000

12. Trade-Off Analysis

Architectural Options Comparison

Option	Latency	Throughput	AI Quality	Complexity	Cost	Best For
Architecture 1 — Stateless (Kafka Streams)	< 500ms	High	Medium (no context)	Low	Medium	Per-event classification; content moderation; spam filtering
Architecture 2 — Stateful (Flink AsyncIO)	500ms–5s	Medium	High (context window)	Very High	High	Fraud detection; sentiment analysis; AML monitoring
Architecture 3 — Micro-Batch (Spark)	5s–60s	Very High	Medium-High	Medium	Low-Medium	Document stream classification; batch-tolerant enrichment
Architecture 4 — Tiered Processing	200ms–5s	Very High	High (for flagged events)	High	Low	High-volume streams where deep AI on all events is cost-prohibitive

Architectural Tensions

Tension	Trade-Off	Resolution
Latency vs. Context Quality	Stateless per-event inference is fastest but has no context; stateful window inference is higher quality but adds accumulation latency	Match architecture to use case SLO: fraud = stateless; sentiment trend = stateful
Exactly-once vs. Throughput	Exactly-once semantics add 20–40% overhead from idempotency store lookups	Apply exactly-once only in regulated use cases (fraud, AML, compliance); use at-least-once for analytics enrichment
AI Cost vs. Coverage	Deep AI on all events is costly; but fast-tier routing may miss edge cases	Tune fast-tier false-negative rate; accept small miss rate for cost savings; periodic full-coverage sampling for model evaluation

13. Failure Modes

Failure	Likelihood	Impact	Detection	Recovery
AI inference serving GPU OOM	Medium	High — serving cluster restarts; stream lag accumulates	GPU memory utilisation alert; HTTP 500 from serving	Auto-scaling adds serving replicas; stream catches up from lag
Flink state backend exceeds memory	Medium	Critical — Flink job fails; requires full restart	RocksDB metrics; JVM heap alert	Restart from last checkpoint; review state TTL configuration
Late-arriving events after watermark	High	Low-Medium — events not included in correct inference window	Watermark late-event counter metric	Per-use-case policy: reopen window, route to DLQ, or discard with logging
Model quality cliff after version update	Medium	High — incorrect inferences at scale before detection	Confidence score distribution alert; confusion matrix degradation	Rollback to previous model version via serving configuration; Flink job restart not required
Kafka consumer offset corruption	Very Low	High — events may be reprocessed or skipped	Consumer group monitoring; unusual lag patterns	Reset offset to last known-good position; idempotency store handles duplicate reprocessing
AI serving SLO breach during traffic spike	High	Medium — events processed late; stream lag grows	p99 latency exceeds SLO alert	Auto-scaling; micro-batching temporarily increases batch size to absorb volume

Cascading Failure Scenarios

Flink state size explosion + no TTL + model quality event: Confidence score distribution shifts → governance team launches investigation → triggers re-processing of all events from last 24h → Flink state doubles in size → RocksDB OOM → Flink fails → stream events accumulate in Kafka → backpressure builds → producer events delayed → business operations degraded. Mitigation: state TTL mandatory; reprocessing jobs run on separate Flink cluster, not production job.
Inference serving cold start + real-time fraud SLO: Spot GPU instances reclaimed during peak traffic → new GPU instances take 5–10 minutes to start → during cold start, all fraud inference exceeds SLO → fraud decisions default to "allow" (graceful degradation) → fraud ring exploits window. Mitigation: minimum 2 on-demand GPU instances for fraud SLO; spot instances supplemental only.

14. Regulatory Considerations

APRA CPS 230 — Operational Risk

Clause 36: Real-time AI inference in payment fraud detection is a critical business service; the stream processing infrastructure must be included in the Business Continuity Plan with documented RTO/RPO matching the fraud detection SLO.
Clause 52: AI model providers and managed streaming platforms (Kafka SaaS, Kinesis) are material third-party service providers under CPS 230.

APRA CPS 234 — Information Security

Clause 15: mTLS between all stream processing components, encrypted state backends, and topic ACLs address the proportional information security control requirement for streaming PII.

AUSTRAC AML/CTF Act 2006 (AU) / FATF Recommendations

Real-time AML monitoring via stream AI must produce the equivalent of a transaction monitoring system alert; the inference result event must be structured to feed the AUSTRAC suspicious matter report workflow with full audit trail.

EU AI Act (2024)

Article 6 (High-Risk): AI systems used for fraud detection in financial services are high-risk AI systems under Annex III; this pattern must include the conformity assessment, logging, and human oversight requirements.
Article 12 (Record-keeping): Inference audit log with 5-year retention (GDPR financial data minimum) satisfies the logging obligation for high-risk AI.
Article 9 (Risk Management): Exactly-once semantics, confidence thresholds, and human review queues are required risk management measures for high-risk AI systems.

ISO 42001 — AI Management System

Clause 8.3 (AI System Design): Architecture selection (stateless vs. stateful vs. micro-batch) must be documented as an AI system design decision with justification against the SLO and use-case requirements.

NIST AI RMF (2023)

MEASURE 2.6: Performance metrics (inference latency p99, confidence score distribution, SLO compliance rate) constitute the ongoing performance measurement required under NIST AI RMF.
MANAGE 4.1: Incident response procedures for model quality degradation and stream processing failure implement the AI risk incident response requirements.

15. Reference Implementations

AWS

Streaming Platform: Amazon Kinesis Data Streams or Amazon MSK (Kafka-compatible)
Stateful Processing: Amazon Kinesis Data Analytics (Apache Flink managed)
Micro-Batch Processing: AWS Glue Streaming ETL (Spark)
AI Inference Serving: Amazon SageMaker Real-Time Inference with auto-scaling; or Amazon Bedrock for LLM inference
Idempotency Store: Amazon DynamoDB (conditional writes for deduplication)
State Backend: S3 for Flink checkpoint storage
Audit Logger: Kinesis Firehose → S3 → Athena for compliance queries

Azure

Streaming Platform: Azure Event Hubs (Kafka surface) or Azure Service Bus Premium
Stateful Processing: Azure HDInsight (Flink) or Azure Stream Analytics (built-in ML scoring)
Micro-Batch Processing: Azure Databricks Structured Streaming
AI Inference Serving: Azure Machine Learning Real-Time Endpoints or Azure OpenAI Service
Idempotency Store: Azure Cosmos DB (conditional upsert)
Audit Logger: Azure Event Hubs Capture → ADLS Gen2 → Synapse Analytics

GCP

Streaming Platform: Google Cloud Pub/Sub
Stateful Processing: Google Cloud Dataflow (Apache Beam with Flink runner)
Micro-Batch Processing: Google Cloud Dataproc (Spark)
AI Inference Serving: Vertex AI Online Prediction or Cloud Run with vLLM
Idempotency Store: Cloud Firestore or Cloud Bigtable (row-level transactions)
Audit Logger: Pub/Sub → BigQuery Streaming Insert for real-time query

On-Premises / Private Cloud

Streaming Platform: Apache Kafka (Strimzi Operator on Kubernetes)
Stateful Processing: Apache Flink (Flink Kubernetes Operator)
Micro-Batch Processing: Apache Spark on Kubernetes (Spark Operator)
AI Inference Serving: NVIDIA Triton Inference Server on GPU nodes
Idempotency Store: PostgreSQL with unique constraint on event_id
Audit Logger: Fluentd → Elasticsearch with 7-year retention policy

Pattern	Relationship	Notes
EAAPL-INT001 — Enterprise AI Service Bus	Specialises	Real-time stream processing is a high-throughput consumer topology for the AI Service Bus
EAAPL-INT005 — Batch AI Processing	Complementary	Batch processing handles the non-real-time tier; together they form a complete Lambda architecture for AI
EAAPL-INT007 — AI Circuit Breaker	Enables	Circuit breaker wraps the AI inference serving calls within stream processing topologies
EAAPL-INT008 — Bidirectional AI Sync	Complementary	Inference results written to output topics may feed the sync pattern to update enterprise data stores

17. Maturity Assessment

Overall Maturity: Proven

Dimension	Score (1–5)	Justification
Architectural Completeness	5	Four architecture topologies; exactly-once semantics; watermarking; model serving; all failure modes covered
Operational Readiness	4	Comprehensive SLOs; Flink operational maturity requires specialist skill not universally available
Security Coverage	5	mTLS, PII masking, topic ACLs, OWASP LLM Top 10 addressed
Governance Coverage	5	Model risk management, outcome feedback loops, champion/challenger, regulatory citations all included
Cost Predictability	3	GPU serving costs are volatile; traffic-driven AI costs are inherently variable
Implementation Complexity	2	Very high — requires streaming platform expertise + AI serving expertise + stream processing expertise simultaneously
Industry Validation	5	Production deployments at tier-1 banks, telecommunications providers, and payment networks globally

18. Revision History

Version	Date	Author	Changes
1.0	2026-06-12	EAAPL Working Group	Initial publication — integration patterns series

Track this pattern for APRA/ASIC review

← Back to Library More AI Integration →

EAAPL-INT004 — Real-Time AI Stream Processing

EAAPL-INT004 — Real-Time AI Stream Processing

1. Executive Summary

2. Problem Statement

Business Problem

Technical Problem

Symptoms

Cost of Inaction

3. Context

When to Apply

When NOT to Apply

Prerequisites

Industry Applicability

4. Architecture Overview

5. Architecture Diagram

6. Components

7. Data Flow

Primary Flow — Stateful Flink with AsyncIO

Error Flow

8. Security Considerations

Authentication and Authorisation

Secrets Management

Data Classification

Encryption

Auditability

OWASP LLM Top 10 Mitigations

9. Governance Considerations

Responsible AI

Model Risk Management

Human Approval Gates

Policy and Traceability

Governance Artefacts

10. Operational Considerations

Monitoring and SLOs

Logging

Incident Response

Disaster Recovery

Capacity Planning

11. Cost Considerations

Cost Drivers

Scaling Risks

Cost Optimisations

Indicative Cost Range

12. Trade-Off Analysis

Architectural Options Comparison

Architectural Tensions

13. Failure Modes

Cascading Failure Scenarios

14. Regulatory Considerations

APRA CPS 230 — Operational Risk

APRA CPS 234 — Information Security

AUSTRAC AML/CTF Act 2006 (AU) / FATF Recommendations

EU AI Act (2024)

ISO 42001 — AI Management System

NIST AI RMF (2023)

15. Reference Implementations

AWS

Azure

GCP

On-Premises / Private Cloud

16. Related Patterns

17. Maturity Assessment

18. Revision History