Feature Stores and Training-Serving Consistency

Most production ML regressions are not caused by model architecture. They are caused by feature mismatch: training saw one definition, serving used another.

Feature stores exist to solve this systematically.


What a Feature Store Should Solve

A feature platform should provide:

  • shared feature definitions
  • point-in-time correct training datasets
  • low-latency online feature retrieval
  • lineage and ownership metadata
  • quality/freshness monitoring

If it only stores feature tables but does not enforce contracts, it is not solving the core problem.


The Real Problem: Training-Serving Skew

Training-serving skew appears when:

  • code paths differ between offline and online transforms
  • timestamp semantics are inconsistent
  • categorical encoding dictionaries diverge
  • missing-value handling differs

Symptoms:

  • strong offline metrics
  • weak or unstable production behavior

Skew is a systems issue, not a model-tuning issue.


Offline vs Online Feature Planes

Offline Store

Used for:

  • training datasets
  • backfills
  • large scans

Optimized for throughput and historical correctness.

Online Store

Used for:

  • request-time inference
  • low-latency keyed lookups

Optimized for availability and latency.

Both planes must use the same feature definitions.


Point-in-Time Correctness

This is the most critical concept. A training row for event time t may only include feature values available at or before t.

Without this rule, future information leaks into training and inflates evaluation.

Point-in-time joins are non-negotiable for trustworthy model performance.


Feature Definition Contract

Each production feature should include:

  • semantic definition
  • entity keys
  • timestamp semantics
  • transformation logic reference
  • owner and SLA
  • allowed null/default behavior

Think of features as APIs. Undocumented features create silent compatibility failures.


Feature Quality Monitoring

Monitor feature health continuously:

  • null/empty rates
  • range violations
  • distribution drift
  • freshness lag
  • online lookup miss rates

Feature quality incidents should page owners before model quality incidents escalate.


Materialization Patterns

Common strategies:

  • batch materialization for slow-moving aggregates
  • streaming updates for near-real-time signals
  • hybrid approach for mixed latency requirements

Design for graceful degradation when a feature source is delayed.


Governance at Scale

As feature count grows, governance matters more. Needed controls:

  • naming conventions
  • discovery catalog
  • deprecation lifecycle
  • access controls for sensitive attributes
  • usage telemetry (to remove unused features)

Ungoverned feature growth becomes platform debt.


Example Failure Scenario

Churn model trained with sessions_7d computed nightly in UTC. Serving pipeline computes same metric in local timezone and excludes late events.

Result:

  • score drift
  • threshold misbehavior
  • retention campaign misallocation

Root cause is feature contract mismatch, not model retraining frequency.


Common Mistakes

  1. duplicating transformation logic across teams
  2. no point-in-time join guarantees
  3. missing owner/SLA for critical features
  4. no freshness and drift alerts
  5. no versioning of feature definitions

Adoption Strategy

  1. centralize top critical features first
  2. enforce definition and ownership metadata
  3. add point-in-time dataset generation tooling
  4. integrate online serving parity checks
  5. scale governance with catalog + policy automation

Start with high-value features, not full migration of everything.


Key Takeaways

  • Feature stores are reliability infrastructure for ML systems.
  • Point-in-time correctness is the cornerstone of valid training data.
  • Training-serving consistency requires shared contracts, not just shared storage.
  • Governance, monitoring, and ownership are essential for long-term platform health.