EAAPL-INT004 — Real-Time AI Stream Processing
Tags: streaming real-time high-availability high-complexity
Status: Proven | Version: 1.0 | Domain: Integration
1. Executive Summary
Real-Time AI Stream Processing applies AI inference directly to streaming data pipelines, enabling intelligent classification, anomaly detection, and contextual analysis at the speed of the data flow. Where batch AI processing applies intelligence after data has accumulated, stream processing applies intelligence as each event arrives — enabling real-time fraud detection, live sentiment analysis on customer interactions, and continuous compliance monitoring across operational data flows.
The pattern addresses the architectural tension inherent in AI stream processing: AI inference is orders of magnitude slower than typical stream processing operations. A fraud detection model may take 50–500ms to evaluate a single transaction; a stream processing engine designed for microsecond throughput creates a fundamental impedance mismatch. This pattern describes the architectural strategies to bridge that gap — stateless per-event inference, stateful context window accumulation, micro-batch grouping, and SLO-aligned architecture selection — enabling enterprises to get AI-quality intelligence at data-stream speed.
For CIOs and CTOs, the business case is clear: AI inference running 500ms after an event arrives prevents fraud, enables personalisation, and catches compliance violations. AI inference running 24 hours later, in a batch, is too late for any of those use cases.
2. Problem Statement
Business Problem
Business processes generate data that contains actionable signals in real time: a fraudulent transaction occurring right now, a customer expressing anger right now, a compliance violation happening right now. Extracting those signals after the fact — in overnight batch jobs — means the opportunity to act has passed. The fraud has completed. The customer has churned. The regulatory breach has occurred.
Technical Problem
AI inference and stream processing operate at incompatible throughput and latency profiles. Stream processing engines (Kafka, Flink, Spark) are designed to process millions of events per second. AI inference — even the fastest embedding or classification models — processes hundreds to thousands of events per second per worker. Naively inserting AI inference into a streaming pipeline creates a processing bottleneck that backs up the entire stream.
Symptoms
- Fraud detection models run overnight on the previous day's transaction data; intraday fraud is caught hours or days after occurrence.
- Customer sentiment analysis is available in weekly reports rather than during active service interactions.
- Compliance monitoring relies on manual sampling of transaction logs rather than continuous automated coverage.
- Real-time data pipelines exist but bypass AI inference due to latency concerns, then feed AI inference separately in batch.
Cost of Inaction
- Fraud losses: Payment fraud detected in real-time vs. overnight has a 60–80% difference in recoverable loss rate. For a mid-size bank processing $10B in annual transactions, this is a $50M–$150M annual impact.
- Customer experience: Sentiment-triggered escalation during a live service call reduces customer churn at a rate of 15–30% for escalated cases resolved positively.
- Compliance: Continuous compliance monitoring eliminates the "discovery lag" — the time between a compliance violation occurring and it being detected in batch review. AUSTRAC, FCA, and SEC have all levied significant fines for institutions where compliance monitoring operated in batch rather than real-time.
3. Context
When to Apply
- The business value of AI insights is time-sensitive — actionability decays rapidly after event occurrence.
- The data source is a stream or event log (Kafka, Kinesis, Pub/Sub, IoT data, financial market feeds, clickstream, log streams).
- The SLO for AI inference can be expressed in seconds or sub-seconds (not minutes or hours).
- AI inference cost per event is justified by the value of real-time action.
When NOT to Apply
- Batch-compatible use cases: document classification, overnight risk reporting, periodic data enrichment — use EAAPL-INT005 instead.
- Ultra-low-latency requirements (< 10ms) — no AI inference technology currently meets this SLO; use rule-based approaches or pre-computed scores.
- AI models requiring large context windows that cannot be populated from available stream data.
- Event streams with extreme volume (> 1M events/second) where AI inference cost is prohibitive — apply selective sampling or tiered processing.
Prerequisites
- A mature streaming platform (Kafka, Kinesis, Pub/Sub, Azure Event Hubs).
- An AI inference serving infrastructure capable of meeting the target SLO (model serving, batching, GPU access if needed).
- A stream processing framework (Flink, Spark Structured Streaming, Kafka Streams).
- Observability platform supporting event-level latency tracing.
Industry Applicability
| Industry |
Applicability |
Use Case |
Latency SLO |
| Financial Services |
Very High |
Real-time fraud scoring, AML transaction monitoring, market abuse detection |
< 200ms (fraud), < 5s (AML) |
| Telecommunications |
High |
Network anomaly detection, churn signal detection from usage patterns |
< 1s (anomaly), < 10s (churn) |
| eCommerce / Retail |
High |
Real-time personalisation, cart abandonment detection, dynamic pricing signals |
< 500ms (personalisation) |
| Healthcare |
Medium |
Continuous patient monitoring (ICU), real-time adverse event detection |
< 5s (clinical) |
| Cybersecurity |
Very High |
Threat detection from log streams, lateral movement detection, SIEM enrichment |
< 500ms |
| Manufacturing |
Medium |
Predictive quality control from sensor streams, equipment anomaly detection |
< 2s |
4. Architecture Overview
Real-Time AI Stream Processing requires selecting an architecture topology matched to the inference SLO, the nature of the AI task (stateless vs. stateful), and the throughput characteristics of the input stream. Four architectures are defined.
Architecture 1 — Stateless Per-Event Inference (Kafka Streams). Each event is classified independently without reference to prior events. The stream processor reads from the input topic, invokes AI inference on the event payload, and writes the enriched event to the output topic. Kafka Streams handles partition assignment, offset management, and consumer group coordination. AI inference is called synchronously within the stream processor topology — the event moves forward only when inference is complete. This architecture suits high-throughput classification tasks where each event carries sufficient context: spam classification, document categorisation, content moderation. The critical design decision is the AI inference call: a blocking HTTP call within Kafka Streams is fatal for throughput. The inference call must be batched within the processor to amortise HTTP overhead across multiple events.
Architecture 2 — Stateful Context Window Inference (Flink). Many AI tasks require context accumulated across multiple events: sentiment trend analysis across a customer service interaction, fraud pattern detection across a user's transaction sequence, log anomaly detection requiring baseline comparison. Flink's stateful processing model maintains per-key state (per customer, per session, per device) across events. The AI inference is triggered when sufficient context has accumulated (time window, event count, or state change threshold). Flink manages state checkpointing, exactly-once semantics, and fault recovery. The AI inference is invoked asynchronously using Flink's AsyncIO operator — the event processing topology does not block waiting for the inference result; other events continue processing while the async inference call is in flight. This is the highest-quality AI stream processing architecture but requires the most operational maturity to run in production.
Architecture 3 — Micro-Batch Inference (Spark Structured Streaming). Events are accumulated into micro-batches of configurable size (N events or T seconds, whichever comes first). The entire micro-batch is submitted to the AI model as a batch inference call, amortising per-call overhead and often qualifying for batch pricing tiers. Spark manages the micro-batch trigger, partition management, and output writing. This architecture accepts a latency floor equal to the micro-batch trigger interval (typically 1–30 seconds) in exchange for significantly higher throughput and lower cost per event. It is appropriate where near-real-time (not sub-second) intelligence is sufficient.
Architecture 4 — Tiered Processing (Lambda Architecture for AI). High-throughput streams benefit from tiered processing: a fast, cheap rule-based or small-model tier processes all events at stream speed, flagging events for the expensive deep AI inference tier. Only flagged events (typically 1–10% of total volume) receive full AI inference. The fast tier uses lightweight classifiers, regex patterns, or pre-computed embedding similarity; the deep tier uses LLMs or complex models. This reduces AI inference cost by 10–100× while maintaining deep analysis quality for the cases that matter most.
Exactly-Once Semantics for AI Inference. For regulated use cases (fraud detection, AML, compliance monitoring), duplicate AI inference on the same event is not merely inefficient — it is a compliance risk if different inference results are produced for the same event. Exactly-once semantics require: (a) idempotency key on each event (unique event ID); (b) inference result deduplication at the output sink (check event ID before writing); (c) transactional output where the inference result and the consumer offset commit are atomic.
Watermarking and Late-Arriving Events. Real-time streams always include late-arriving events — network delays, mobile client buffering, out-of-order message delivery. The stream processor must define a watermark — the maximum expected event delay — and a policy for events arriving after the watermark. For AI inference, late events that fall outside a completed inference window must be reprocessed against a reopened window or rejected with logging. For fraud detection, an event arriving 5 seconds late is still relevant; an event arriving 5 hours late may not be.
Model Serving Infrastructure. The AI inference serving layer must be sized for stream throughput. GPU-based model serving (Triton Inference Server, vLLM, TorchServe) provides highest throughput for neural models. Horizontal scaling of model serving replicas, with load balancing across the stream processing topology, ensures no single inference bottleneck. Serving latency (p99) must be < 70% of the inference SLO to leave headroom for stream processing overhead.
5. Architecture Diagram
flowchart TD
subgraph Sources["Stream Sources"]
A[Event Streams]
B[Streaming Platform Topics]
end
subgraph Processing["Tiered AI Processing"]
C[Fast Tier Classifier]
D[Deep AI Inference Worker]
E[Model Serving Cluster]
end
subgraph Output["Output and Consumers"]
F[Enriched Output Topics]
G[Downstream Consumers]
H[Stream DLQ]
end
A --> B
B --> C
C -->|flagged events| D
D --> E
E -->|inference result| D
D -->|enriched event| F
D -->|failure| H
F --> G
style A fill:#dbeafe,stroke:#3b82f6
style B fill:#dbeafe,stroke:#3b82f6
style C fill:#f0fdf4,stroke:#22c55e
style D fill:#f0fdf4,stroke:#22c55e
style E fill:#f0fdf4,stroke:#22c55e
style F fill:#fef9c3,stroke:#eab308
style G fill:#d1fae5,stroke:#10b981
style H fill:#fee2e2,stroke:#ef4444
6. Components
| Component |
Type |
Responsibility |
Technology Options |
Criticality |
| Streaming Platform |
Infrastructure |
Durable event log, partitioned topics, consumer group coordination |
Apache Kafka, Amazon Kinesis, Azure Event Hubs, Google Pub/Sub |
Critical |
| Fast Tier Processor |
Service |
Lightweight classification; triage events for deep AI tier |
Kafka Streams (stateless), Flink (stateful), AWS Lambda Kinesis consumer |
High |
| Flink Job (Stateful) |
Processing |
Stateful per-key context accumulation; async AI inference via AsyncIO operator |
Apache Flink 1.18+, AWS Kinesis Data Analytics, Confluent Flink |
High |
| Spark Structured Streaming |
Processing |
Micro-batch accumulation and batch AI inference submission |
Apache Spark 3.5+, Azure HDInsight, Databricks |
High |
| AI Model Serving Cluster |
Infrastructure |
Serve AI inference requests at streaming throughput SLO |
NVIDIA Triton Inference Server, vLLM, TorchServe, Ollama (dev only) |
Critical |
| Inference Cache |
Service |
Semantic deduplication of near-identical inference requests |
Redis with vector similarity, custom bloom filter |
High |
| Idempotency Store |
Service |
Deduplicate inference results on event ID to enforce exactly-once output semantics |
Redis, DynamoDB, PostgreSQL with unique constraint |
Critical (regulated) |
| Flink State Backend |
Storage |
Persistent state for stateful inference context per key |
RocksDB (embedded), S3 for checkpoints |
High |
| Stream DLQ |
Infrastructure |
Capture events where inference failed after max retries with full context |
Kafka DLQ topic, SQS DLQ |
High |
| Watermark Manager |
Component |
Define event time watermarks; handle late arrivals per use case policy |
Flink Watermark Strategies, Spark event-time windows |
Medium |
7. Data Flow
Primary Flow — Stateful Flink with AsyncIO
| Step |
Actor |
Action |
Output |
| 1 |
Source System |
Publishes event to raw topic with event_id, event_time, partition key (customer_id/device_id) |
Event persisted in Kafka partition |
| 2 |
Fast Tier Classifier |
Reads event; applies lightweight classifier or rule engine |
Event tagged: ROUTE_TO_DEEP_AI or PASS_THROUGH |
| 3 |
Flink AsyncIO Operator |
Receives ROUTE_TO_DEEP_AI events; non-blocking async call to AI inference serving |
Async inference request dispatched |
| 4 |
AI Model Serving |
Receives event (or accumulated context window); executes inference |
Inference result with confidence, model_version, latency |
| 5 |
Flink Output |
Receives async result; updates Flink state; emits enriched event to output topic |
Enriched event: original payload + ai_result + ai_confidence + ai_model_version |
| 6 |
Idempotency Store |
Checks event_id before writing to output topic |
Write (new event_id) or discard (duplicate) |
| 7 |
Output Topic |
Enriched event available for consumer applications |
Consumers read enriched event |
| 8 |
Flink Checkpoint |
Periodic state checkpoint to durable storage |
State recoverable after failure |
Error Flow
| Step |
Error Condition |
Detection |
Recovery |
| 3 |
AI inference serving unavailable |
AsyncIO timeout or connection failure |
Retry twice with backoff; after max retries, route event to DLQ with error context |
| 3 |
AI inference SLO exceeded (latency > target) |
AsyncIO timeout per configured threshold |
Log SLO breach; emit event with ai_slo_breach: true; consumer can choose to act on partial result |
| 4 |
Model returns error (invalid input) |
Non-2xx HTTP response from serving |
Log specific error; route event to DLQ for manual investigation |
| 5 |
Flink state backend failure |
Checkpoint fails; state unavailable |
Flink restores from last successful checkpoint; events between checkpoint and failure reprocessed — idempotency store prevents duplicate output |
| 6 |
Output topic unavailable |
Kafka write failure |
Flink retries write; if persistent failure, job pauses — events accumulate in Kafka (durable); alert fires |
8. Security Considerations
Authentication and Authorisation
- Stream processors authenticate to Kafka using mTLS or SASL/SCRAM service accounts.
- AI inference serving cluster accessible only from stream processor network segment; not exposed externally.
- Consumer applications read from output topics using topic ACLs — read access only to subscribed topics.
Secrets Management
- AI model serving API keys (for external model providers) stored in secrets manager; injected at job startup.
- Kafka service account credentials rotated on 90-day schedule; Flink jobs support dynamic secret refresh without job restart via Flink's Secrets Provider interface.
Data Classification
- Events containing PII (transaction data, customer interaction data) classified at ingestion.
- PII events processed by AI inference must stay within the data residency boundary for which PII processing has been approved.
- Inference results containing inferred sensitive attributes (credit risk score, health condition flags) classified at the highest sensitivity level and access-controlled on output topics.
Encryption
- All event data in transit encrypted (TLS 1.3 for Kafka, mTLS for inference serving).
- Flink state backend (RocksDB) encrypted at rest using filesystem encryption.
- Inference results on output topics encrypted at rest (broker-level encryption).
Auditability
- Every inference event logged with: event_id, partition_key, inference_model, result, confidence, latency, processing_timestamp.
- Inference SLO breaches logged as separate audit events — enables compliance reporting on AI processing timeliness.
- For regulated use cases, inference audit log retained for the same period as the underlying transaction record.
OWASP LLM Top 10 Mitigations
| OWASP LLM Risk |
Relevance |
Mitigation in This Pattern |
| LLM01 — Prompt Injection |
Medium |
Stream events are structured data, not free text; field extraction before prompt construction; free-text fields (customer messages) sanitised |
| LLM02 — Insecure Output Handling |
High |
AI inference result parsed from structured response; enriched event schema validated before writing to output topic |
| LLM03 — Training Data Poisoning |
Low |
Inference only (not training); stream data is not used to update model weights in this pattern |
| LLM04 — Model Denial of Service |
High |
Per-consumer-group rate limits on inference serving; exactly-once semantics prevent retry amplification storms |
| LLM05 — Supply Chain Vulnerabilities |
Medium |
Model serving infrastructure uses pinned container image versions; inference SDK versions pinned; SBOM per job release |
| LLM06 — Sensitive Information Disclosure |
High |
PII in event payloads masked before including in LLM prompts for customer interaction analysis; transaction amounts may be binned rather than exact |
| LLM07 — Insecure Plugin Design |
Low |
Stream processing pattern does not use function-calling or plugins; inference is classification/scoring only |
| LLM08 — Excessive Agency |
Low |
Stream processors emit enriched events; action on those events is taken by consumer applications, not the processing pipeline |
| LLM09 — Overreliance |
High |
Confidence score in every inference result; consumers configured with minimum confidence thresholds below which AI result is not acted upon automatically |
| LLM10 — Model Theft |
Medium |
Model serving cluster in private network; no external access; for proprietary fine-tuned models, access-controlled inference endpoint with auth |
9. Governance Considerations
Responsible AI
- Real-time AI inference on customer data requires documented basis for processing under privacy law (AU Privacy Act, GDPR, EU AI Act).
- Confidence score monitoring enables detection of model quality degradation on live traffic — governance team reviews weekly confidence score distributions.
- Outcome feedback loops: consumer applications should publish outcome events (fraud confirmed/cleared, churn detected/averted) back to a feedback topic so model performance can be measured against ground truth.
Model Risk Management
- Real-time fraud scoring models are high-risk AI systems; initial validation, regular backtesting, and champion/challenger deployment are required.
- Model version tracked in every inference result event; enables retrospective performance analysis by model version using event archive.
- Champion/challenger: Flink topology supports parallel consumer groups reading the same input topic — shadow mode sends 10% of traffic to challenger model for comparison before promotion.
Human Approval Gates
- High-value automated actions (block a transaction > $10,000, suspend a customer account) triggered by AI stream inference require human review queue before execution.
- Real-time AI scores are available immediately for human review interfaces; automated action requires meeting a high-confidence threshold AND a human review window (typically 15–60 seconds for fraud).
Policy and Traceability
- Every automated action taken based on a real-time AI inference result must be traceable back to: the AI model version, the event_id, the inference confidence, and the downstream consumer that took the action.
Governance Artefacts
| Artefact |
Owner |
Update Frequency |
Storage Location |
| Model Risk Assessment (Real-Time Models) |
Model Risk Team |
Per model version change |
MRM register |
| Inference SLO Compliance Report |
Platform Engineering |
Monthly |
Observability platform |
| Confidence Score Distribution Report |
AI Governance |
Weekly |
Data platform |
| Privacy Impact Assessment (Stream AI) |
Privacy Officer |
Per new data type in stream |
Privacy register |
| Champion/Challenger Comparison Report |
Data Science |
Per model evaluation cycle |
ML platform |
| Outcome Feedback Report |
Data Science |
Monthly |
ML platform |
10. Operational Considerations
Monitoring and SLOs
| SLO |
Target |
Measurement |
Alert Threshold |
| End-to-end stream AI latency (p99) |
Use-case specific (see table below) |
Event_time to inference result in output topic |
> 2× SLO target |
| AI inference serving latency (p99) |
< 70% of end-to-end SLO |
Serving endpoint latency metric |
> 80% of SLO target |
| Stream processing lag |
< 1000 events |
Consumer group lag metric |
> 5000 events |
| Inference error rate |
< 0.5% |
Errors / total inference requests |
> 2% in any 15-min window |
| DLQ growth rate |
0 net new (steady state) |
DLQ message count delta |
Any sustained growth |
| Flink checkpoint success rate |
100% |
Flink checkpoint metrics |
Any checkpoint failure |
Use-Case SLO Reference:
| Use Case |
p99 Latency SLO |
Architecture Recommended |
| Payment fraud detection |
< 200ms |
Stateless (Architecture 1) with GPU serving |
| AML transaction monitoring |
< 5s |
Stateful Flink (Architecture 2) |
| Customer interaction sentiment |
< 2s |
Stateful Flink (Architecture 2) |
| Content moderation |
< 1s |
Stateless (Architecture 1) |
| Document stream classification |
< 30s |
Micro-batch Spark (Architecture 3) |
Logging
- Flink task manager logs: job ID, task name, event processing rate, state size, checkpoint duration.
- Inference serving logs: request ID, model, input size, latency, result category, confidence.
- DLQ events: full event payload, error reason, retry count, last failure timestamp.
Incident Response
- AI inference serving degradation: stream processor consumer lag accumulates; auto-scaler adds serving replicas; SLO breach events logged; alert fires within 60 seconds.
- Flink job failure: Flink restarts from last checkpoint; events between checkpoint and failure reprocessed; idempotency store prevents duplicate output.
- Model quality degradation: confidence score distribution alert fires; AI governance team reviews; champion/challenger evaluation triggered; model rollback if warranted.
Disaster Recovery
| Scenario |
RTO |
RPO |
Recovery Procedure |
| Flink job failure |
3 minutes |
0 (checkpoint-based recovery) |
Flink automatic restart from last checkpoint; events replayed from Kafka offset |
| Inference serving cluster failure |
5 minutes |
0 |
Kubernetes replacement pods; stream processing continues when serving recovers |
| Kafka broker node failure |
10 minutes |
0 (replicated partitions) |
Kafka partition leader election; stream processors reconnect automatically |
| Flink state backend corruption |
30 minutes |
Up to last checkpoint interval |
Restore state from S3 checkpoint; reprocess events from checkpoint offset |
Capacity Planning
- AI inference serving: (peak events/second × AI fraction) / (inference serving throughput per replica) = minimum replicas; add 50% headroom.
- Flink parallelism: number of Kafka partitions = maximum Flink parallelism; start at Kafka partition count, scale based on consumer lag.
- State backend memory: (average state size per key) × (number of active keys) × (1.5 overhead factor) = RocksDB memory budget.
11. Cost Considerations
Cost Drivers
| Cost Driver |
Description |
Typical Proportion |
| AI Model Serving Compute (GPU) |
GPU instance costs for inference serving; dominant cost driver for neural models |
40–65% |
| Streaming Platform |
Kafka/Kinesis partition-hours, storage, data transfer |
15–25% |
| Stream Processing Compute |
Flink/Spark worker node compute; scales with throughput |
10–20% |
| Flink State Storage |
RocksDB memory; S3 for checkpoint storage |
3–8% |
| Audit and Output Storage |
Output topic retention, audit log storage |
3–7% |
Scaling Risks
- GPU serving costs are fixed per instance — paying for GPU capacity 24/7 even during low-traffic hours is expensive. Auto-scaling with scale-to-zero (where SLO allows cold start) or scheduled scaling for predictable traffic patterns reduces waste.
- Stateful Flink jobs with large per-key state (long context windows for millions of keys) can exhaust RocksDB memory leading to job failure; state TTL configuration and state size monitoring are mandatory.
Cost Optimisations
- Tiered processing (Architecture 4): route only flagged events (1–10% of volume) to expensive deep AI inference; apply cheap classifiers to all events.
- Micro-batching (Architecture 3): amortise per-call overhead; batch inference pricing often 30–50% cheaper than per-request pricing.
- Inference result caching: deduplicate semantically identical events within a short time window before calling the model.
- Spot/preemptible GPU instances for batch inference tier; on-demand only for real-time latency-sensitive tier.
Indicative Cost Range
| Scale |
Monthly Streaming Platform |
AI Serving Compute |
Total Monthly |
| Small (100K events/day, fraud detection) |
$500–$1,500 |
$2,000–$8,000 |
$2,500–$9,500 |
| Medium (10M events/day, multi-use-case) |
$5,000–$15,000 |
$20,000–$60,000 |
$25,000–$75,000 |
| Large (1B+ events/day, financial services) |
$40,000–$100,000 |
$150,000–$500,000 |
$190,000–$600,000 |
12. Trade-Off Analysis
Architectural Options Comparison
| Option |
Latency |
Throughput |
AI Quality |
Complexity |
Cost |
Best For |
| Architecture 1 — Stateless (Kafka Streams) |
< 500ms |
High |
Medium (no context) |
Low |
Medium |
Per-event classification; content moderation; spam filtering |
| Architecture 2 — Stateful (Flink AsyncIO) |
500ms–5s |
Medium |
High (context window) |
Very High |
High |
Fraud detection; sentiment analysis; AML monitoring |
| Architecture 3 — Micro-Batch (Spark) |
5s–60s |
Very High |
Medium-High |
Medium |
Low-Medium |
Document stream classification; batch-tolerant enrichment |
| Architecture 4 — Tiered Processing |
200ms–5s |
Very High |
High (for flagged events) |
High |
Low |
High-volume streams where deep AI on all events is cost-prohibitive |
Architectural Tensions
| Tension |
Trade-Off |
Resolution |
| Latency vs. Context Quality |
Stateless per-event inference is fastest but has no context; stateful window inference is higher quality but adds accumulation latency |
Match architecture to use case SLO: fraud = stateless; sentiment trend = stateful |
| Exactly-once vs. Throughput |
Exactly-once semantics add 20–40% overhead from idempotency store lookups |
Apply exactly-once only in regulated use cases (fraud, AML, compliance); use at-least-once for analytics enrichment |
| AI Cost vs. Coverage |
Deep AI on all events is costly; but fast-tier routing may miss edge cases |
Tune fast-tier false-negative rate; accept small miss rate for cost savings; periodic full-coverage sampling for model evaluation |
13. Failure Modes
| Failure |
Likelihood |
Impact |
Detection |
Recovery |
| AI inference serving GPU OOM |
Medium |
High — serving cluster restarts; stream lag accumulates |
GPU memory utilisation alert; HTTP 500 from serving |
Auto-scaling adds serving replicas; stream catches up from lag |
| Flink state backend exceeds memory |
Medium |
Critical — Flink job fails; requires full restart |
RocksDB metrics; JVM heap alert |
Restart from last checkpoint; review state TTL configuration |
| Late-arriving events after watermark |
High |
Low-Medium — events not included in correct inference window |
Watermark late-event counter metric |
Per-use-case policy: reopen window, route to DLQ, or discard with logging |
| Model quality cliff after version update |
Medium |
High — incorrect inferences at scale before detection |
Confidence score distribution alert; confusion matrix degradation |
Rollback to previous model version via serving configuration; Flink job restart not required |
| Kafka consumer offset corruption |
Very Low |
High — events may be reprocessed or skipped |
Consumer group monitoring; unusual lag patterns |
Reset offset to last known-good position; idempotency store handles duplicate reprocessing |
| AI serving SLO breach during traffic spike |
High |
Medium — events processed late; stream lag grows |
p99 latency exceeds SLO alert |
Auto-scaling; micro-batching temporarily increases batch size to absorb volume |
Cascading Failure Scenarios
- Flink state size explosion + no TTL + model quality event: Confidence score distribution shifts → governance team launches investigation → triggers re-processing of all events from last 24h → Flink state doubles in size → RocksDB OOM → Flink fails → stream events accumulate in Kafka → backpressure builds → producer events delayed → business operations degraded. Mitigation: state TTL mandatory; reprocessing jobs run on separate Flink cluster, not production job.
- Inference serving cold start + real-time fraud SLO: Spot GPU instances reclaimed during peak traffic → new GPU instances take 5–10 minutes to start → during cold start, all fraud inference exceeds SLO → fraud decisions default to "allow" (graceful degradation) → fraud ring exploits window. Mitigation: minimum 2 on-demand GPU instances for fraud SLO; spot instances supplemental only.
14. Regulatory Considerations
APRA CPS 230 — Operational Risk
- Clause 36: Real-time AI inference in payment fraud detection is a critical business service; the stream processing infrastructure must be included in the Business Continuity Plan with documented RTO/RPO matching the fraud detection SLO.
- Clause 52: AI model providers and managed streaming platforms (Kafka SaaS, Kinesis) are material third-party service providers under CPS 230.
APRA CPS 234 — Information Security
- Clause 15: mTLS between all stream processing components, encrypted state backends, and topic ACLs address the proportional information security control requirement for streaming PII.
AUSTRAC AML/CTF Act 2006 (AU) / FATF Recommendations
- Real-time AML monitoring via stream AI must produce the equivalent of a transaction monitoring system alert; the inference result event must be structured to feed the AUSTRAC suspicious matter report workflow with full audit trail.
EU AI Act (2024)
- Article 6 (High-Risk): AI systems used for fraud detection in financial services are high-risk AI systems under Annex III; this pattern must include the conformity assessment, logging, and human oversight requirements.
- Article 12 (Record-keeping): Inference audit log with 5-year retention (GDPR financial data minimum) satisfies the logging obligation for high-risk AI.
- Article 9 (Risk Management): Exactly-once semantics, confidence thresholds, and human review queues are required risk management measures for high-risk AI systems.
ISO 42001 — AI Management System
- Clause 8.3 (AI System Design): Architecture selection (stateless vs. stateful vs. micro-batch) must be documented as an AI system design decision with justification against the SLO and use-case requirements.
NIST AI RMF (2023)
- MEASURE 2.6: Performance metrics (inference latency p99, confidence score distribution, SLO compliance rate) constitute the ongoing performance measurement required under NIST AI RMF.
- MANAGE 4.1: Incident response procedures for model quality degradation and stream processing failure implement the AI risk incident response requirements.
15. Reference Implementations
AWS
- Streaming Platform: Amazon Kinesis Data Streams or Amazon MSK (Kafka-compatible)
- Stateful Processing: Amazon Kinesis Data Analytics (Apache Flink managed)
- Micro-Batch Processing: AWS Glue Streaming ETL (Spark)
- AI Inference Serving: Amazon SageMaker Real-Time Inference with auto-scaling; or Amazon Bedrock for LLM inference
- Idempotency Store: Amazon DynamoDB (conditional writes for deduplication)
- State Backend: S3 for Flink checkpoint storage
- Audit Logger: Kinesis Firehose → S3 → Athena for compliance queries
Azure
- Streaming Platform: Azure Event Hubs (Kafka surface) or Azure Service Bus Premium
- Stateful Processing: Azure HDInsight (Flink) or Azure Stream Analytics (built-in ML scoring)
- Micro-Batch Processing: Azure Databricks Structured Streaming
- AI Inference Serving: Azure Machine Learning Real-Time Endpoints or Azure OpenAI Service
- Idempotency Store: Azure Cosmos DB (conditional upsert)
- Audit Logger: Azure Event Hubs Capture → ADLS Gen2 → Synapse Analytics
GCP
- Streaming Platform: Google Cloud Pub/Sub
- Stateful Processing: Google Cloud Dataflow (Apache Beam with Flink runner)
- Micro-Batch Processing: Google Cloud Dataproc (Spark)
- AI Inference Serving: Vertex AI Online Prediction or Cloud Run with vLLM
- Idempotency Store: Cloud Firestore or Cloud Bigtable (row-level transactions)
- Audit Logger: Pub/Sub → BigQuery Streaming Insert for real-time query
On-Premises / Private Cloud
- Streaming Platform: Apache Kafka (Strimzi Operator on Kubernetes)
- Stateful Processing: Apache Flink (Flink Kubernetes Operator)
- Micro-Batch Processing: Apache Spark on Kubernetes (Spark Operator)
- AI Inference Serving: NVIDIA Triton Inference Server on GPU nodes
- Idempotency Store: PostgreSQL with unique constraint on event_id
- Audit Logger: Fluentd → Elasticsearch with 7-year retention policy
| Pattern |
Relationship |
Notes |
| EAAPL-INT001 — Enterprise AI Service Bus |
Specialises |
Real-time stream processing is a high-throughput consumer topology for the AI Service Bus |
| EAAPL-INT005 — Batch AI Processing |
Complementary |
Batch processing handles the non-real-time tier; together they form a complete Lambda architecture for AI |
| EAAPL-INT007 — AI Circuit Breaker |
Enables |
Circuit breaker wraps the AI inference serving calls within stream processing topologies |
| EAAPL-INT008 — Bidirectional AI Sync |
Complementary |
Inference results written to output topics may feed the sync pattern to update enterprise data stores |
17. Maturity Assessment
Overall Maturity: Proven
| Dimension |
Score (1–5) |
Justification |
| Architectural Completeness |
5 |
Four architecture topologies; exactly-once semantics; watermarking; model serving; all failure modes covered |
| Operational Readiness |
4 |
Comprehensive SLOs; Flink operational maturity requires specialist skill not universally available |
| Security Coverage |
5 |
mTLS, PII masking, topic ACLs, OWASP LLM Top 10 addressed |
| Governance Coverage |
5 |
Model risk management, outcome feedback loops, champion/challenger, regulatory citations all included |
| Cost Predictability |
3 |
GPU serving costs are volatile; traffic-driven AI costs are inherently variable |
| Implementation Complexity |
2 |
Very high — requires streaming platform expertise + AI serving expertise + stream processing expertise simultaneously |
| Industry Validation |
5 |
Production deployments at tier-1 banks, telecommunications providers, and payment networks globally |
18. Revision History
| Version |
Date |
Author |
Changes |
| 1.0 |
2026-06-12 |
EAAPL Working Group |
Initial publication — integration patterns series |