EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryData ArchitectureEAAPL-DAT008
EAAPL-DAT008Proven
⇄ Compare

Real-Time Feature Engineering

🗄️ Data ArchitectureEU AI ActISO/IEC 42001

[EAAPL-DAT008] Real-Time Feature Engineering

Category: Data Architecture
Sub-category: Feature Engineering / Online ML Serving
Version: 1.3
Maturity: Proven
Tags: feature-engineering, feature-store, real-time, Kafka, Flink, Redis, point-in-time-correctness, feature-drift
Regulatory Relevance: EU AI Act Article 10, APRA CPS 234 (operational resilience), ISO 42001 §8.4


1. Executive Summary

AI models that make real-time decisions — fraud detection, personalised recommendations, credit pre-screening — require features computed from the most current available data. Batch feature engineering, where features are pre-computed overnight, introduces staleness that degrades model performance in time-sensitive use cases: fraud features computed 8 hours ago miss the most recent spending patterns that indicate fraud.

This pattern defines a production real-time feature engineering architecture that delivers features within milliseconds to inference endpoints, with freshness SLAs as tight as 10 seconds from event to serving. It covers the streaming pipeline (Kafka/Flink for feature computation), the dual-layer feature store (Redis for online serving, object store for offline training), point-in-time correctness for training data consistency, feature freshness SLA management, and feature drift detection.

Organisations that implement this pattern achieve fraud detection precision improvements of 15–25% through fresher features, and reduce time-to-feature from hours to seconds for real-time AI use cases.

Target audience: ML Platform leads, Data Engineering leads, Enterprise Architects.


2. Problem Statement

Business Problem

Real-time AI decisions are only as good as the data behind them. Stale features cause models to miss current context: a customer who spent $5,000 in the last 10 minutes looks identical to a normal customer on yesterday's features. Business consequences include missed fraud, poor real-time personalisation, and incorrect risk assessments.

Technical Problem

  • Batch feature pipelines (nightly ETL) produce features that are 8–24 hours stale by inference time.
  • Ad hoc feature computation at inference time introduces latency (50–500ms per lookup) and creates unmanaged technical debt.
  • Features computed differently for training (batch) vs. serving (real-time) introduce training-serving skew — one of the top causes of production model underperformance.
  • Point-in-time correctness is violated: training data uses features computed at the current time, not the time of the historical event, leaking future information into training.
  • Feature pipelines are duplicated across teams; no reuse; inconsistent feature definitions.

Symptoms

  • Fraud detection model AUC degrades during peak transaction periods (features are stale).
  • Model performs well in offline evaluation but poorly in production (training-serving skew).
  • Multiple teams computing the same features with slightly different logic.
  • Feature pipeline failures cause model fallback to default values with no alerting.
  • Training labels point-in-time valid but features are current-time (future data leakage detected in post-hoc audit).

Cost of Inaction

Dimension Impact
Model quality 15–40% degradation in time-sensitive use cases from feature staleness
Fraud / risk Missed fraud events during staleness window; direct financial loss
Engineering Feature duplication across teams; multiply estimated at 2–3× engineering waste
Compliance APRA operational resilience: feature pipeline failure = AI system degradation

3. Context

When to Apply

  • AI models making real-time decisions where feature freshness affects quality (fraud, recommendations, pricing, content ranking).
  • Models where latency budget for inference is <100ms total.
  • Organisations with >3 ML teams benefiting from shared feature definitions.
  • Use cases requiring training-serving consistency (training-serving skew is a documented issue).

When NOT to Apply

  • Batch inference where features can be pre-computed (nightly batch is sufficient).
  • Very simple AI with 1–3 features that can be trivially computed at request time.
  • Organisation at early AI maturity where a shared feature store adds operational complexity beyond team capability.

Prerequisites

Prerequisite Minimum Viable Preferred
Event streaming infrastructure Kafka (basic) Kafka + Schema Registry + Kafka Streams
Stream processing Flink (basic) Apache Flink + Flink SQL
Online serving store Redis (standalone) Redis Enterprise Cluster
ML platform MLflow Feast / Tecton / Vertex AI Feature Store
Feature ownership model Informal Formal feature ownership with SLA commitments

Industry Applicability

Industry Applicability Driver
Financial Services / Fintech Critical Real-time fraud; credit pre-screening; risk pricing
E-commerce High Real-time personalisation; dynamic pricing; recommendation
Ride-sharing / Delivery High Dynamic pricing; driver-rider matching; ETA prediction
Telecommunications High Real-time churn intervention; network anomaly
Gaming Medium Real-time player behaviour; churn prediction
Healthcare Medium Real-time clinical risk; patient monitoring

4. Architecture Overview

Design Philosophy

The real-time feature engineering architecture solves the fundamental tension between feature freshness and serving reliability by maintaining two feature stores simultaneously — an online store for low-latency serving and an offline store for training data — with a shared computation layer ensuring they produce identical feature values.

Stream-First, Batch-Fallback Design. Features are computed from streaming events (Kafka) using Flink streaming jobs. This provides freshness from seconds (aggregated windows) to sub-second (lookup features). For features that cannot be computed from streams (complex aggregations requiring full historical data), a batch top-up runs at configurable intervals (hourly, daily) to pre-compute the batch component, which is then merged with the streaming-computed component in the feature store.

Dual-Layer Feature Store. The architecture maintains two feature store layers with different characteristics:

  • Online store (Redis): Serves features at inference time with <10ms p99 latency. Stores only the most recent feature values per entity (customer_id, account_id, etc.). Redis sorted sets serve time-windowed aggregations. Optimised for point-in-time latest reads.
  • Offline store (object store + columnar format): Stores the full history of feature values with timestamps. Used for training dataset creation. Enables point-in-time correct feature retrieval: given a historical event at time T, retrieve the feature value that would have been available at time T — not the current value.

Training-Serving Consistency via Shared Computation Logic. The most critical architectural property: the Flink streaming job and the offline feature pipeline must use identical computation logic. This is achieved through a shared Feature Transformation Library — a versioned Python/Java library containing the feature computation logic, consumed by both the Flink streaming job and the batch offline pipeline. When feature logic changes, both pipelines are updated together, eliminating training-serving skew by construction.

Point-in-Time Correctness. When creating a training dataset, features must be retrieved at the time of the label event, not the current time. The offline store's timestamp index enables point-in-time queries: for a fraud label at 2024-03-15 14:32:00, retrieve the feature values that existed at that exact time. This requires the offline store to retain historical feature snapshots (not just the latest value) — a storage cost trade-off managed through tiered retention (recent history at full fidelity; older history at reduced fidelity or summarised).

Feature Freshness SLA Management. Each feature has a declared freshness SLA — the maximum acceptable age of a feature value when served at inference time. The feature store monitoring layer continuously tracks feature ages and alerts when staleness exceeds SLA. Inference services receive feature freshness metadata alongside feature values, enabling them to downgrade confidence or trigger fallback behaviour when features are stale beyond SLA.

Feature Drift Detection. Feature values are monitored for distribution drift using Population Stability Index (PSI) computed over rolling windows. A feature whose distribution shifts significantly (PSI > 0.1) may indicate upstream data pipeline issues, concept drift, or business context changes. Drift alerts are routed to the feature owner for investigation.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Input["Event Sources"] A[Streaming Events] B[Batch Sources] end subgraph Compute["Feature Computation"] C[Flink Streaming Pipeline] D[Batch Top-Up Pipeline] end subgraph Store["Dual-Layer Feature Store"] E[(Online Store Redis)] F[(Offline Store Parquet)] end subgraph Serving["Model Serving"] G{Freshness Check} H[Inference Service] I[Training Dataset Builder] end A --> C B --> D C -->|real-time writes| E C -->|time-indexed writes| F D -->|batch writes| E D -->|history writes| F E --> G G -->|fresh| H G -->|stale| H F --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#fef9c3,stroke:#eab308 style G fill:#f3e8ff,stroke:#a855f7 style H fill:#d1fae5,stroke:#10b981 style I fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Kafka Event Bus Messaging Ingests operational events; provides ordered, partitioned stream per entity Apache Kafka, AWS MSK, Confluent Cloud Critical
Flink Streaming Job Processing Computes real-time features from event streams; sliding window aggregations Apache Flink, AWS Kinesis Data Analytics, Azure Stream Analytics Critical
Feature Transform Library Library Shared versioned feature computation logic; consumed by both stream and batch pipelines Python library (pip), Java library (Maven), dbt macros Critical
Online Feature Store Storage + Serving Low-latency feature serving for inference; latest feature values per entity Redis Enterprise, DynamoDB, Aerospike, Feast online store Critical
Offline Feature Store Storage Historical time-indexed feature values; point-in-time correct training data creation Parquet on S3/GCS/ADLS, Delta Lake, Apache Hudi Critical
Feature Registry Metadata Store Defines feature schemas, ownership, freshness SLA, lineage, approved use cases Feast Registry, Tecton, Vertex AI Feature Store Registry, custom PostgreSQL High
Batch Top-Up Pipeline Processing Computes batch-only or computationally heavy features on schedule Apache Spark, dbt, AWS Glue High
Feature Retrieval Client Library Batch entity lookup from online store for inference; handles missing features Feast Python SDK, Tecton SDK, custom Redis client Critical
Point-in-Time Query Engine Processing Retrieves feature values at historical timestamp for training data creation Feast historical retrieval, custom SQL with time-travel, Delta Lake time travel High
Feature Monitoring Service Processing Tracks freshness, drift (PSI), null rate per feature; fires alerts Custom Python + Grafana, Evidently AI feature monitoring, WhyLabs High

7. Data Flow

Primary Flow

Step Actor Action Output
1 Operational system Emits business event to Kafka topic Kafka event record
2 Flink streaming job Consumes event; applies Feature Transform Library logic; computes aggregations over sliding windows Feature value update
3 Online store writer Writes updated feature values to Redis with timestamp Redis HSET per entity ID
4 Offline store writer Appends feature values with event timestamp to Parquet/Delta Time-indexed feature row
5 Inference service Receives prediction request with entity ID Feature retrieval request
6 Feature Retrieval Client Batch-reads feature values for entity from Redis Feature vector with freshness metadata
7 Freshness Checker Validates each feature age against SLA Pass (fresh) or degraded mode
8 Inference service Passes feature vector to model; returns prediction Prediction with confidence score
9 Training data builder Point-in-time query: for label events at time T, retrieve feature values at T Point-in-time correct training dataset
10 Feature Monitor Continuously computes PSI, null rate, freshness age per feature Drift alerts; staleness alerts

Error Flow

Error Condition Trigger Response Recovery
Flink job failure Stream processing crash Online store stops updating; features become stale; freshness alert raised Flink job restarted; replay from Kafka retention window; freshness alert until caught up
Redis unavailable Cache failure Inference service falls back to cached last-known value or model default Redis failover (sentinel/cluster); alert; degrade predictions rather than error
Feature freshness SLA breach Feature age > SLA Inference service receives stale flag; confidence reduced or fallback triggered Investigate Flink job; Kafka lag; upstream event source
Training-serving skew detected Model performance drops in production vs. offline eval Feature Transform Library version audit; verify stream and batch use same version Roll back feature library version; retrain on corrected features
Point-in-time query failure Offline store missing historical window Training dataset creation fails Increase offline store retention window; backfill missing history from source

8. Security Considerations

Authentication & Authorisation

  • Redis online store access requires authentication; separate read-only credentials for inference services vs. read-write for feature pipelines.
  • Feature Registry access control: feature ownership declaration restricts writes; reads are broadly accessible within organisation.

Secrets Management

  • Redis credentials, Kafka credentials stored in secrets manager; rotated quarterly.

Data Classification

  • Online feature store contains derived features from personal data; classified as Confidential minimum.
  • Offline feature store contains historical personal data features; strict access control; encryption.

Encryption

  • Redis at rest encrypted (Redis Enterprise AES-256); TLS for Redis connections.
  • Offline feature store (S3/GCS) encrypted at rest; in transit TLS 1.3.

Auditability

  • Feature access at inference time logged per entity ID (for right-to-explanation requests).
  • Feature schema changes logged in Feature Registry with version history.

OWASP LLM Top 10 Mapping

OWASP LLM Risk Relevance Mitigation
LLM04: Model Denial of Service Feature store overloaded by high-QPS inference could degrade Redis cluster horizontal scaling; rate limiting on feature retrieval API
LLM02: Insecure Output Handling Stale or incorrect features cause model to produce wrong outputs Freshness checking; feature validation before inference
LLM06: Sensitive Information Disclosure Feature store access exposes derived personal data Access control on feature store; encryption; access logging

9. Governance Considerations

Responsible AI

  • Feature definitions must document what personal data they are derived from; subject to privacy review.
  • Features based on protected attributes (age, ethnicity, gender, postcode as proxy) require explicit bias review before use in consequential AI.

Model Risk Management

  • Feature freshness SLA is a model risk parameter: breach triggers model risk review.
  • Training-serving skew detection is a model risk control; regular skew audits (quarterly) required.

Human Approval Checkpoints

  • New feature requiring personal data processing: Privacy Officer review.
  • Feature based on protected attribute proxy: Bias Review Board sign-off.
  • Freshness SLA reduction (making SLA tighter): capacity review by ML Platform.

Governance Artefacts

Artefact Owner Cadence Purpose
Feature Registry Entry Feature Owner Per feature version Schema, SLA, lineage, data source declaration, approved uses
Freshness SLA Compliance Report ML Platform Weekly Per-feature compliance with declared freshness SLA
Drift Report Feature Monitor Daily PSI and null rate per feature; trend over 30 days
Training-Serving Skew Audit ML Platform Quarterly Feature value comparison: training snapshot vs. current production feature values

10. Operational Considerations

Monitoring

Metric Alert Threshold Tooling
Online store write lag (event to Redis) >freshness SLA Flink job lag + Redis write latency
Online store read latency (p99) >10ms Redis metrics + Grafana
Feature null rate >2% for any feature Feature Monitor
Feature PSI >0.1 warning; >0.25 block training Feature Monitor
Kafka consumer group lag >10,000 messages Kafka metrics
Flink job backpressure >50% sustained for 5 min Flink monitoring

SLOs

SLO Target Measurement
Event to online store (end-to-end freshness) <30 seconds (standard); <10 seconds (high-priority) Event timestamp to Redis write timestamp
Online store read latency (p99) <10ms Redis latency histogram
Online store availability 99.99% Health check
Training dataset creation (point-in-time query) <4 hours for 12-month history, 10M entities Query execution time

Capacity Planning

  • Redis: size based on peak entity count × feature count × average feature value size × replication factor.
  • Flink: size based on peak event throughput × window size complexity.
  • Offline store: size based on entity count × feature count × retention period × snapshot frequency.

Disaster Recovery

Component RTO RPO Strategy
Online Store (Redis) 5 minutes 1 minute Redis Sentinel/Cluster failover; persistence enabled
Flink Streaming Job 10 minutes Kafka retention window Flink checkpoint recovery; Kafka replay
Offline Store 4 hours 24 hours Cross-region replication; incremental backup

11. Cost Considerations

Cost Drivers

Cost Driver Typical Range Notes
Redis Enterprise (online store) $1,000–$15,000/month Scales with data size × replication factor
Flink compute $500–$8,000/month Scales with event throughput × window complexity
Kafka / MSK $300–$5,000/month Scales with throughput × retention period
Offline store (S3 + query) $200–$3,000/month Scales with entity × feature × history
Managed feature store (Tecton, Vertex AI FS) $3,000–$30,000/month Includes all components; simpler ops
Feature Monitor $200–$2,000/month Custom or Evidently AI / WhyLabs

Optimisations

  • Cache frequently accessed entity feature vectors at inference service level (TTL = freshness SLA) to reduce Redis load.
  • Use Redis Cluster hash slots to partition entities across nodes; avoids hot-key issues for high-velocity entities.
  • Compress feature values in Redis using MessagePack or CBOR; typically 3–5× storage reduction.
  • Tier offline store: recent 90 days at full fidelity; older at reduced granularity.

Indicative Cost Range

Scale Monthly Cost Basis
Small (<1M entities, <50 features, <10K events/sec) $2,000–$8,000 Redis standalone + Flink on Kubernetes + S3
Medium (10M entities, 200 features, 100K events/sec) $10,000–$40,000 Redis Enterprise Cluster + managed Flink + managed Kafka
Large (100M+ entities, 500+ features, 1M+ events/sec) $40,000–$200,000 Multi-region feature store + managed streaming platform

12. Trade-Off Analysis

Option Comparison

Option Pros Cons Recommended When
A: Full real-time feature pipeline (this pattern) Freshest features; training-serving consistency; shared features High operational complexity; significant infrastructure cost Fraud, real-time pricing, personalisation; >3 ML teams
B: Managed feature store (Tecton, Vertex AI FS) Reduced ops; built-in monitoring; enterprise support High licence cost; vendor lock-in Large organisation; ML platform team lacks streaming expertise
C: Batch feature computation (nightly ETL) Simple; low cost; well-understood Features 8–24 hours stale; misses real-time signals Batch inference only; freshness not a business requirement
D: Compute features at inference time (ad hoc) Zero infrastructure; feature always fresh High per-request latency (50–500ms DB queries); no reuse; creates technical debt Simple models; 1–3 features; no team sharing required

Architectural Tensions

Tension Trade-Off Resolution
Freshness vs. cost Fresher features require more streaming compute and Redis memory Declare freshness SLA per feature; only high-priority features use <30s freshness
Streaming vs. batch consistency Streaming and batch paths must produce identical feature values Shared Feature Transform Library consumed by both; skew testing on schedule
Online store size vs. latency More features = more Redis memory = higher cost; but fewer features = worse model Feature importance analysis; only features with significant model impact in online store
Operational complexity vs. velocity Full feature store platform has high setup cost Start with managed platform (Feast hosted) or Vertex AI FS; build custom when scale justifies

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Training-serving skew (feature logic divergence) Medium High — model underperforms silently in production Skew audit; post-deployment performance monitoring Roll back Feature Transform Library version; retrain on corrected features
Hot key in Redis (all requests for same entity) Medium Medium — Redis latency spikes for affected entity Redis slow log; per-key latency monitoring Implement local inference-side cache; Redis cluster rebalancing
Kafka consumer lag spike Medium Medium — features become stale; SLA breach Kafka consumer lag metrics Scale Flink parallelism; increase Kafka partition count
Point-in-time query produces future-leaked features Low Critical — model trained with data leakage Post-hoc leakage detection in training data quality checks Rebuild offline store with correct timestamp indexing; retrain
Redis cluster split-brain Very Low High — inconsistent feature values Redis cluster health monitoring Redis Cluster automatic failover; read from primary only during split

14. Regulatory Considerations

Regulation Requirement Pattern Response
EU AI Act Article 10 Training data representativeness and freshness Freshness SLAs and point-in-time correct training address data currency requirements
APRA CPS 230 Operational resilience of AI systems Online store HA (99.99% SLO); Flink checkpoint recovery; DR targets defined
Privacy Act (Australia) APP 6 Personal data used for primary purpose Feature Registry declares data sources; purpose-limited feature access
GDPR Article 22 Right to explanation for automated decisions Feature access audit log per entity enables feature-level explanation

15. Reference Implementations

AWS

Component AWS Service
Event bus Amazon MSK (Managed Kafka)
Stream processing Amazon Kinesis Data Analytics (Flink)
Online store Amazon ElastiCache (Redis Cluster)
Offline store S3 + AWS Glue Catalog (Parquet/Delta)
Feature registry Amazon SageMaker Feature Store
Point-in-time query SageMaker Feature Store offline API

Azure

Component Azure Service
Event bus Azure Event Hubs (Kafka compatible)
Stream processing Azure Stream Analytics + HDInsight Flink
Online store Azure Cache for Redis Enterprise
Offline store ADLS Gen2 + Delta Lake
Feature registry Azure ML Feature Store (preview)

GCP

Component GCP Service
Event bus Cloud Pub/Sub + Dataflow (Apache Beam)
Stream processing Cloud Dataflow (Flink runner)
Online store Vertex AI Feature Store (online serving)
Offline store Vertex AI Feature Store (offline) + GCS
Feature registry Vertex AI Feature Store

On-Premises

Component Technology
Event bus Apache Kafka on Kubernetes
Stream processing Apache Flink on Kubernetes
Online store Redis Enterprise on Kubernetes
Offline store MinIO + Delta Lake
Feature registry Feast (self-hosted on Kubernetes)

Pattern ID Relationship Notes
AI Data Mesh Integration EAAPL-DAT001 Complements Online feature serving is a specialised data product
Data Quality for AI EAAPL-DAT002 Depends on Feature null rate and drift checks are quality dimensions
Data Lineage for AI EAAPL-DAT003 Complements Feature computation events captured in lineage graph
AI Training Data Governance EAAPL-DAT007 Depends on Feature definitions registered in training data governance framework
Shadow Model Deployment EAAPL-MDL002 Complements Shadow mode uses same feature pipeline; validates feature parity
Model Versioning EAAPL-MDL001 Complements Feature store version linked to model version

17. Maturity Assessment

Overall Maturity: Proven — Real-time feature engineering is production-proven at scale (Netflix, Uber, LinkedIn). Managed feature stores (Vertex AI, SageMaker) are GA. Operational complexity remains high for self-managed deployments.

Dimension Score (1–5) Notes
Architectural clarity 5 Well-defined dual-store architecture; point-in-time correctness well-understood
Tooling maturity 4 Redis, Kafka, Flink mature; feature store managed services maturing
Regulatory alignment 4 Freshness SLAs address EU AI Act data currency; APRA resilience covered
Operational complexity 2 High; requires streaming + Redis + feature store expertise
Cost efficiency 3 Significant infrastructure cost; justified for high-value real-time use cases
Security 4 Access controls, encryption, audit logging well-defined

18. Revision History

Version Date Author Changes
1.0 2023-08-01 EAAPL Working Group Initial publication
1.1 2024-01-15 EAAPL Working Group Added point-in-time correctness detail; training-serving skew section
1.2 2024-07-01 EAAPL Working Group Added managed feature store options; GCP Vertex AI FS reference
1.3 2025-03-01 EAAPL Working Group Updated cost ranges; added feature drift detection section
← Back to LibraryMore Data Architecture