Dimensionality Reduction with PCA, UMAP, and t-SNE

High-dimensional data introduces noise, sparsity, and computational cost. Dimensionality reduction can improve model stability, speed, and interpretability when used correctly.


Why Reduce Dimensions?

Main reasons:

  • denoise correlated or weak features
  • reduce training and inference cost
  • improve visualization for exploratory analysis
  • mitigate curse of dimensionality in distance-based methods

Dimensionality reduction is a means to better downstream performance, not a goal by itself.


PCA: Linear Workhorse

PCA finds orthogonal directions (principal components) that capture maximum variance.

Strengths:

  • fast and stable
  • deterministic (given same preprocessing)
  • useful preprocessing for linear and distance-based models

Limitations:

  • linear assumption
  • variance-maximizing directions may not align with task label signal

Use explained-variance curves to choose component count pragmatically.


t-SNE: Visualization-Focused Technique

t-SNE preserves local neighborhoods for 2D/3D plots. It is excellent for visual cluster exploration, not for faithful global geometry.

Important cautions:

  • distances between far clusters may be misleading
  • layout changes with perplexity and seed
  • not ideal as production feature transformation

Treat t-SNE as exploratory visualization tool.


UMAP: Modern Nonlinear Embedding

UMAP often preserves local structure better at scale and can retain more global organization than t-SNE in practice.

Key parameters:

  • n_neighbors: local vs global emphasis
  • min_dist: compactness of embedding clusters

UMAP is useful for both visualization and some downstream embedding workflows, but still needs validation.


Supervised vs Unsupervised Reduction

Some methods can use labels (supervised variants) to preserve class-separating directions. This may improve downstream task performance but risks overfitting if evaluation is weak.

Always fit reducers inside training folds only to avoid leakage.


Practical Workflow

  1. standardize numeric features
  2. run PCA baseline and downstream model evaluation
  3. explore UMAP/t-SNE for structure diagnosis
  4. compare model performance with and without reduction
  5. monitor stability across seeds and time slices

If reduced representation does not improve quality or efficiency, skip it.


Common Mistakes

  1. fitting PCA/UMAP on full dataset before split
  2. selecting embedding by visual appeal only
  3. overinterpreting t-SNE distances
  4. ignoring reproducibility settings

Key Takeaways

  • PCA is a strong first choice for robust linear compression
  • UMAP/t-SNE are powerful for structure exploration, with interpretation limits
  • validate dimensionality reduction by downstream metrics and stability