Feature Stores and Training-Serving Consistency
Most production ML regressions are not caused by model architecture. They are caused by feature mismatch: training saw one definition, serving used another.
Feature stores exist to solve this systematically.
What a Feature Store Should Solve
A feature platform should provide:
- shared feature definitions
- point-in-time correct training datasets
- low-latency online feature retrieval
- lineage and ownership metadata
- quality/freshness monitoring
If it only stores feature tables but does not enforce contracts, it is not solving the core problem.
The Real Problem: Training-Serving Skew
Training-serving skew appears when:
- code paths differ between offline and online transforms
- timestamp semantics are inconsistent
- categorical encoding dictionaries diverge
- missing-value handling differs
Symptoms:
- strong offline metrics
- weak or unstable production behavior
Skew is a systems issue, not a model-tuning issue.
Offline vs Online Feature Planes
Offline Store
Used for:
- training datasets
- backfills
- large scans
Optimized for throughput and historical correctness.
Online Store
Used for:
- request-time inference
- low-latency keyed lookups
Optimized for availability and latency.
Both planes must use the same feature definitions.
Point-in-Time Correctness
This is the most critical concept.
A training row for event time t may only include feature values available at or before t.
Without this rule, future information leaks into training and inflates evaluation.
Point-in-time joins are non-negotiable for trustworthy model performance.
Feature Definition Contract
Each production feature should include:
- semantic definition
- entity keys
- timestamp semantics
- transformation logic reference
- owner and SLA
- allowed null/default behavior
Think of features as APIs. Undocumented features create silent compatibility failures.
Feature Quality Monitoring
Monitor feature health continuously:
- null/empty rates
- range violations
- distribution drift
- freshness lag
- online lookup miss rates
Feature quality incidents should page owners before model quality incidents escalate.
Materialization Patterns
Common strategies:
- batch materialization for slow-moving aggregates
- streaming updates for near-real-time signals
- hybrid approach for mixed latency requirements
Design for graceful degradation when a feature source is delayed.
Governance at Scale
As feature count grows, governance matters more. Needed controls:
- naming conventions
- discovery catalog
- deprecation lifecycle
- access controls for sensitive attributes
- usage telemetry (to remove unused features)
Ungoverned feature growth becomes platform debt.
Example Failure Scenario
Churn model trained with sessions_7d computed nightly in UTC.
Serving pipeline computes same metric in local timezone and excludes late events.
Result:
- score drift
- threshold misbehavior
- retention campaign misallocation
Root cause is feature contract mismatch, not model retraining frequency.
Common Mistakes
- duplicating transformation logic across teams
- no point-in-time join guarantees
- missing owner/SLA for critical features
- no freshness and drift alerts
- no versioning of feature definitions
Adoption Strategy
- centralize top critical features first
- enforce definition and ownership metadata
- add point-in-time dataset generation tooling
- integrate online serving parity checks
- scale governance with catalog + policy automation
Start with high-value features, not full migration of everything.
Key Takeaways
- Feature stores are reliability infrastructure for ML systems.
- Point-in-time correctness is the cornerstone of valid training data.
- Training-serving consistency requires shared contracts, not just shared storage.
- Governance, monitoring, and ownership are essential for long-term platform health.