EAAPLEnterprise AI Architecture Pattern Library
EAAPLLibraryPlatform EngineeringEAAPL-PLT009
EAAPL-PLT009Proven
⇄ Compare

Feature Store Integration

⚙️ Platform EngineeringEU AI ActISO/IEC 42001

[EAAPL-PLT009] Feature Store Integration

Category: Platform Engineering Sub-category: ML Infrastructure / Data Engineering Version: 1.1 Maturity: Proven Tags: feature-store, feature-serving, online-inference, offline-training, feature-pipeline, point-in-time, feature-monitoring, training-serving-skew Regulatory Relevance: EU AI Act Article 10 (Data Governance), ISO 42001 Clause 6, NIST AI RMF MAP 3.5


1. Executive Summary

Feature stores solve a deceptively simple problem: when an ML model needs a feature during inference, how does it get the right value, freshly computed, at low latency? And when training a new model, how does it get the exact same feature values that would have been available at prediction time in the past—preventing the data leakage that invalidates backtests and production evaluations?

The Feature Store Integration pattern establishes a shared infrastructure layer that decouples feature computation from feature consumption, enabling features to be computed once and reused across models, teams, and use cases. The online store serves low-latency feature retrieval for real-time inference; the offline store enables point-in-time correct training data generation. Feature pipelines manage computation and freshness; feature monitoring detects drift that would degrade model performance before it reaches users. For enterprises with multiple ML models consuming overlapping signals, the feature store is the difference between duplicated, inconsistent feature computation and a shared, governed, quality-assured data layer.


2. Problem Statement

Business Problem

Multiple ML models within the same organisation compute the same features independently, consuming redundant engineering effort and producing inconsistent values (e.g., "30-day spend" computed differently for fraud, recommendation, and credit risk models). Business decisions made on these models are implicitly inconsistent. When models are retrained, the historical features used for training may not match what would have been available at prediction time, leading to overoptimistic evaluation metrics and production performance gaps.

Technical Problem

Online inference requires feature values available in <10ms at the model API boundary; this requires a pre-computed, low-latency store. Training requires point-in-time correct historical feature values to avoid look-ahead bias. Without a feature store, teams either accept this bias or build expensive, fragile point-in-time joins from raw data. Feature pipelines are duplicated across teams with no shared infrastructure.

Symptoms

  • Same feature (e.g., customer 30-day transaction count) computed differently in 3 different model codebases
  • Production model performance consistently below offline evaluation metrics (training-serving skew)
  • Feature pipeline failures causing model inference to serve stale or missing features
  • No visibility into when a feature was last updated or what its current distribution is
  • Training datasets built from current feature values rather than the values available at the historical prediction time

Cost of Inaction

  • Training-serving skew causing production models underperforming by 5–20% vs offline evaluation
  • 30–50% of ML engineering time spent on feature engineering that duplicates existing work
  • Model regressions caused by undetected feature drift going undetected for weeks
  • Regulatory audits unable to reproduce model predictions due to no record of feature values at decision time

3. Context

When to Apply

  • Organisation has ≥2 ML models sharing overlapping input features
  • Real-time inference latency requirements (<50ms) demand pre-computed feature values
  • Training pipelines require point-in-time correct historical data
  • Feature reuse across teams is a stated engineering goal
  • Model performance monitoring requires feature drift detection

When NOT to Apply

  • Single simple model with unique features: feature store overhead not warranted
  • LLM-only organisation with no traditional ML models: most LLM use cases don't benefit from traditional feature stores (embeddings have their own infrastructure path)
  • Research experiments: use pandas and raw data; migrate to feature store when productionising

Prerequisites

  • Operational data sources (databases, event streams) producing features
  • Feature computation infrastructure (Spark, Flink, or dbt for offline; streaming processor for online)
  • Online store infrastructure (Redis or equivalent <10ms lookup)
  • Offline store infrastructure (data warehouse or object storage for point-in-time joins)
  • ML model serving infrastructure that can retrieve features at inference time

Industry Applicability

Industry Applicability Key Use Case
Financial Services Very High Credit risk, fraud detection, CLV, trading signals
E-commerce / Retail Very High Personalisation, recommendation, dynamic pricing
Technology / SaaS High User behaviour, churn prediction, abuse detection
Healthcare High Risk stratification, readmission prediction
Telecommunications High Churn, network anomaly, usage prediction
Media / Streaming High Content recommendation, engagement prediction

4. Architecture Overview

The feature store architecture is defined by the separation between its online and offline paths, each serving a different consumer with different latency and freshness characteristics.

The Online Store is a low-latency key-value store containing pre-computed feature values, indexed by entity key (e.g., customer_id, product_id, session_id). Lookup latency must be <10ms at P99 to be compatible with real-time inference SLAs. The online store is populated by the feature materialisation pipeline, which computes features from source data and writes them on a schedule (for batch features) or in near-real-time (for streaming features). Redis is the canonical technology for the online store; its GET operation with a compound key (entity_type:entity_id:feature_set) delivers sub-millisecond lookup at scale.

The online store does not store feature history—only the current value for each entity. This makes it fast and cheap. When a model is called for inference, the feature serving layer assembles the feature vector by looking up all required features for the request's entity IDs from the online store, combining them with request-time context (features that cannot be pre-computed because they depend on the current request), and passing the assembled feature vector to the model.

The Offline Store serves training data generation and batch inference. Unlike the online store, the offline store retains historical feature values—specifically, the feature value that was current at any given point in time. This enables point-in-time correct training data generation: given a set of training examples with timestamps, retrieve the feature values that were available just before each timestamp. This prevents look-ahead bias (using future data to predict the past), which is the most common source of training-serving skew. The offline store is implemented as a time-partitioned table in a data warehouse (BigQuery, Redshift, Snowflake) or as Parquet files in object storage, with a time dimension on every feature record.

Feature Pipelines compute and refresh feature values from source data. Batch pipelines run on a schedule (hourly, daily) using Spark or dbt and write to both the offline store (appending the new time-partitioned record) and the online store (overwriting the current value). Streaming pipelines consume event streams (Kafka, Kinesis) and compute features in near-real-time using Flink or Spark Streaming, writing to the online store with low latency. The choice between batch and streaming for a feature depends on its staleness tolerance: fraud detection features require seconds-old values; monthly customer metrics can be daily.

Feature Registry is the metadata layer for the feature store. It records: feature name, description, data type, computation logic (the transformation that produces the feature), data source, update frequency, entity type, business owner, and deprecation status. The feature registry is the discovery mechanism that enables engineers to find existing features before building new ones. It also serves as the configuration source for the feature materialisation pipeline and the feature serving layer.

Feature Monitoring is the operational quality layer. For each feature, monitoring tracks: distribution statistics (mean, std, percentile distribution) on a rolling basis, freshness (time since last update vs. configured threshold), null rate (unexpected nulls indicate pipeline failures), and drift (statistical distance between the current distribution and the training-time distribution, using measures like PSI or Jensen-Shannon divergence). Alerts on feature drift enable proactive model retraining before production performance degrades significantly.


5. Architecture Diagram

ARCHITECTURE DIAGRAM
flowchart TD subgraph Sources["Data Sources"] A[Operational Databases] B[Event Streams] end subgraph Pipelines["Feature Pipelines"] C[Batch Pipeline] D[Streaming Pipeline] end subgraph Store["Feature Store"] E[(Online Store Redis)] F[(Offline Store Point-in-Time)] G[Feature Registry] end subgraph Consumers["Consumers"] H[Real-Time Inference] I[Model Training] end A --> C B --> D C --> E C --> F D --> E G --> C G --> D E --> H F --> I style A fill:#dbeafe,stroke:#3b82f6 style B fill:#dbeafe,stroke:#3b82f6 style C fill:#f0fdf4,stroke:#22c55e style D fill:#f0fdf4,stroke:#22c55e style E fill:#fef9c3,stroke:#eab308 style F fill:#fef9c3,stroke:#eab308 style G fill:#fef9c3,stroke:#eab308 style H fill:#d1fae5,stroke:#10b981 style I fill:#d1fae5,stroke:#10b981

6. Components

Component Type Responsibility Technology Options Criticality
Online Store Infrastructure Sub-10ms feature lookup by entity key Redis, DynamoDB, Bigtable, Cassandra Critical
Offline Store Infrastructure Point-in-time correct historical feature retrieval BigQuery, Redshift, Snowflake, Hive, Parquet on S3 Critical
Feature Registry Service Metadata catalogue for all features Feast (open source), Tecton, Hopsworks, custom DB High
Batch Feature Pipeline Service Compute and materialise batch features Apache Spark, dbt + Airflow, DBT Cloud Critical
Streaming Feature Pipeline Service Compute and materialise near-real-time features Apache Flink, Spark Structured Streaming, Kafka Streams High
Feature Materialisation Orchestrator Service Schedule and coordinate pipeline execution Apache Airflow, Prefect, Dagster High
Feature Server Service Assemble multi-feature vectors for inference requests Feast Feature Server, Tecton Online Serving, custom FastAPI Critical
Point-in-Time Join Engine Service Generate point-in-time correct training datasets Feast point-in-time join, custom SQL High
Feature Monitor Service Track distribution, drift, freshness, null rate Evidently AI, WhyLogs, Great Expectations, custom High
Feature Discovery UI Service Search and explore feature registry Feast UI, Tecton portal, DataHub, custom Medium

7. Data Flow

Primary Flow — Real-Time Inference with Feature Store

Step Actor Action Output
1 Model API Receive inference request with entity IDs (customer_id: 12345, product_id: P789) Entity IDs extracted
2 Feature Server Look up required features from feature registry for this model version Required feature list: [customer_30d_spend, customer_risk_score, product_view_count_7d]
3 Feature Server Batch lookup: MGET customer:12345:spend_features, customer:12345:risk_features, product:P789:engagement Feature values retrieved from Redis in <5ms
4 Feature Server Combine pre-computed features with request-time context (e.g., current timestamp, request channel) Complete feature vector assembled
5 Model Inference Pass feature vector to model; receive prediction Prediction
6 Feature Monitor Log feature values and prediction for drift monitoring Monitoring record

Error Flow

Error Detection Response
Feature missing from online store (entity not materialised) Redis miss Return feature default value or null; log missing feature; alert if rate >1%
Stale feature (pipeline hasn't run) Freshness monitor Log staleness; serve stale value with staleness metadata; alert pipeline operator
Online store unavailable Feature server health check Serve null features or use fallback model without feature enrichment; alert
Feature schema mismatch (pipeline produced wrong type) Feature monitor type check Reject feature batch write; alert pipeline owner; serve last-known-good value

8. Security Considerations

  • Feature data may contain derived personal information (spending patterns, risk scores, health indicators); access to the online store must be restricted to authorised model serving infrastructure
  • The offline store contains historical PII-derived features; access requires the same data classification controls as the source data
  • Entity keys in the online store must not leak information about underlying entities; compound keys should use opaque IDs (UUIDs), not readable identifiers

OWASP LLM Controls

OWASP LLM Risk Feature Store Control
LLM03 Training Data Poisoning Feature registry enforces approved computation logic; point-in-time joins prevent future-data contamination
LLM09 Overreliance Feature monitoring detects when input data quality degrades, which would degrade model predictions

9. Governance Considerations

Data Governance

  • Every feature must have a registered owner responsible for pipeline health and data quality
  • Features derived from personal information must document the legal basis and retention policy in the feature registry
  • Deprecated features must be retained in the registry with deprecation date and migration guidance; never silently deleted

Model Risk

  • Point-in-time join methodology must be validated and documented as part of the model development process; incorrect point-in-time logic is a model risk event
  • Feature drift alerts must be routed to the model owner, not just the platform team; the model owner is accountable for model performance

Governance Artefacts

Artefact Owner Cadence Location
Feature registry Feature Owner + Data Team Continuous Feature registry service
Feature lineage documentation Data Engineering Per feature Feature registry
Feature monitoring thresholds Feature Owner Quarterly review Monitoring configuration
Privacy impact for PII-derived features Privacy Team Per feature with PII Privacy register
Feature drift incident log Model Owner Per incident Incident management

10. Operational Considerations

Monitoring

Signal Source Alert Threshold Owner
Online store cache miss rate Feature server metrics >5% miss (entities not materialised) Feature Owner
Feature pipeline SLA miss Pipeline orchestrator Any pipeline overdue by >2× schedule interval Feature Owner + Data Eng
Feature distribution drift (PSI) Feature monitor PSI > 0.2 (significant drift) Model Owner
Online store P99 latency Feature server metrics >20ms P99 Platform On-Call

SLOs

SLO Target Window
Online feature retrieval P99 latency <10ms Rolling 7 days
Feature freshness (batch features) <2× schedule interval Per feature
Feature pipeline success rate >99.5% Rolling 30 days
Online store availability 99.9% Rolling 30 days

Disaster Recovery

Component RPO RTO Strategy
Online store (Redis) 1 hour 5 min Redis Sentinel + persistence; rebuild from offline store
Offline store <1 hour 30 min Data warehouse replication
Feature pipelines N/A (stateless) 15 min Redeploy from IaC; re-run pipeline to catch up

11. Cost Considerations

Cost Drivers

Driver Description Relative Weight
Online store (Redis) memory Proportional to entity count × feature vector size Medium-High
Batch computation (Spark) Proportional to data volume and feature count Medium
Offline store (data warehouse) Storage + query compute for training data generation Medium
Streaming computation (Flink) Always-on for streaming features Medium

Indicative Cost Range

Scale Monthly Feature Store Infra Cost
Small (1M entities, 10 features) $500–$2,000
Medium (100M entities, 50 features) $5,000–$20,000
Large (1B+ entities, 200+ features) $30,000–$100,000+

12. Trade-Off Analysis

Feature Store Architecture Options

Option Description Pros Cons Best For
Open Source (Feast) Self-managed Feast with Redis + data warehouse Full control; no vendor lock-in; community support High operational overhead; less out-of-box tooling Strong engineering team; cloud-agnostic
Managed (Tecton, Hopsworks) SaaS feature store with managed pipelines Low ops overhead; strong tooling Vendor lock-in; cost at scale Organisations prioritising velocity over cost optimisation
Cloud-Native (Vertex AI Feature Store, AWS SageMaker Feature Store) Cloud provider native Deep integration with cloud ML stack Tied to cloud provider; variable feature richness Orgs committed to single cloud

Online Store Technology Options

Option Latency Cost Scalability Best For
Redis <1ms Medium High (cluster) Most deployments; canonical choice
DynamoDB 1–5ms Variable (high at scale) Very High AWS-native; serverless operations
Bigtable 1–5ms High Extremely High Google Cloud; very large entity counts

Architectural Tensions

Tension Option A Option B Resolution
Feature freshness vs. computation cost Streaming (fresh) Batch (cheap) Feature-level decision based on staleness tolerance; most features are batch
Centralised feature store vs. team-owned features Platform team owns all features Teams own their features in shared store Teams own features in shared store with platform managing infrastructure
Online store size vs. cost Store all features for all entities Store only high-usage features Tiered: hot features in Redis; warm features in DynamoDB; cold in offline only

13. Failure Modes

Failure Likelihood Impact Detection Recovery
Online store memory exhaustion (Redis OOM) Medium High — feature serving fails Redis memory metrics LRU eviction; increase Redis memory; audit feature set for unused features
Batch pipeline failure (features stale) Medium High — model consuming stale features Pipeline SLA monitor; freshness alert Re-run pipeline; serve stale with staleness flag; alert model owner
Training-serving skew (wrong PIT logic) Low Critical — model production performance << offline eval Production vs offline metric gap Audit PIT join logic; retrain with corrected data; model risk event
Feature leakage (future data in training) Low Critical — optimistic backtests; poor production performance PIT join timestamp validation Audit all PIT joins; retrain affected models
Feature drift undetected Medium High — gradual model degradation Production metric monitoring Improve drift monitoring coverage; lower alert thresholds

14. Regulatory Considerations

EU AI Act Article 10 (Data Governance)

  • Feature computation logic must be documented (in feature registry) as part of the training data governance requirements for high-risk AI systems
  • Point-in-time join methodology must be documented to demonstrate absence of data leakage in training data

Privacy Act / GDPR

  • PII-derived features (spending patterns, health indicators) must have a documented legal basis in the feature registry
  • Data subject deletion requests must propagate to the online store (delete entity's feature values) and be documented in the offline store (mark as deleted rather than hard delete, to preserve training data integrity)

NIST AI RMF MAP 3.5

  • Feature monitoring and drift detection implement MAP 3.5's requirement for ongoing monitoring of AI system inputs

15. Reference Implementations

AWS

Component AWS Service
Online store Amazon ElastiCache Redis or DynamoDB
Offline store Amazon Redshift or S3 Parquet
Feature registry Amazon SageMaker Feature Store (metadata)
Batch pipeline AWS Glue / EMR (Spark)
Streaming pipeline Amazon Kinesis Data Analytics (Flink)
Orchestration Amazon MWAA (Managed Airflow)

GCP

Component GCP Service
Online store Vertex AI Feature Store (Online) or Memorystore
Offline store Vertex AI Feature Store (Offline) or BigQuery
Batch pipeline Dataflow or BigQuery ML
Streaming pipeline Dataflow (Apache Beam)

On-Premises / Open Source

Component Technology
Feature store framework Feast (open source)
Online store Redis Enterprise
Offline store Apache Hive or Delta Lake on MinIO
Batch pipeline Apache Spark + Apache Airflow
Streaming pipeline Apache Flink

Pattern ID Name Relationship
EAAPL-PLT001 Enterprise AI Platform Parent — feature store is a platform ML infrastructure component
EAAPL-PLT008 AI Experiment Tracking Complementary — training datasets generated via feature store feed experiment tracking
EAAPL-INT004 Real-Time AI Stream Processing Integration — streaming feature pipelines share infrastructure with real-time inference
EAAPL-INT005 Batch AI Processing Integration — batch feature pipelines share scheduling infrastructure

17. Maturity Assessment

Overall Maturity: Proven Feature stores are production-proven at major technology and financial services companies. Open-source tooling (Feast) and managed services (Tecton, SageMaker Feature Store) are both mature. Point-in-time joins are well-understood. Feature monitoring is less standardised.

Scoring Matrix

Dimension Score (1–5) Rationale
Pattern Completeness 5 All sections documented
Implementation Evidence 5 Deployed at Netflix, Uber, LinkedIn, major banks at scale
Tooling Maturity 4 Feast/Tecton/SageMaker mature; feature monitoring less so
Regulatory Alignment 4 EU AI Act Article 10 mapping; privacy patterns documented
Operational Complexity High Requires data engineering expertise; streaming pipelines operationally demanding

18. Revision History

Version Date Author Changes
1.0 2024-09-01 EAAPL Working Group Initial publication
1.1 2025-06-12 EAAPL Working Group Feature monitoring section expanded; privacy Act data deletion patterns added; Vertex AI Feature Store reference updated
← Back to LibraryMore Platform Engineering