Concept: PCA and SVD

Core Intuition

PCA identifies the most meaningful basis to re-express data by maximizing the signal-to-noise ratio (SNR): .

SNR
  • Maximize signal: measured by variance magnitude; large variance encodes interesting structure.
  • Minimize redundancy (noise): measured by covariance magnitude.

Mathematical Foundation

Setup

Let the observed data be

where each row represents an example on -dimensional features:

For notation simplicity, we also denote each row by . Each column

is normalized to zero mean and unit variance .

The covariance matrix of is

Goal: find such that the covariance of is a diagonal matrix,

PCA Assumptions

  • Linear transformation: let ,
  • Large variance encodes important structure.
  • Principal components are orthogonal.

The last assumption provides an intuitive simplification that makes PCA soluble with linear algebra decomposition techniques.

Eigendecomposition

Suppose . Its covariance matrix is

Since is symmetric, it admits the spectral decomposition:

where with , and has orthonormal columns ().

Sketch of proof: From the spectral decomposition, for all :

Then , which implies .

Choosing :

The eigenvalue is the variance of projected feature .

Alternative derivation via Lagrangian: PCA finds a unit vector (with ) maximizing the variance of the projection of all examples onto :

Why does the variance take this form? If the angle between and is , the projection of onto is

since is a unit vector. The length of this projection is

since . So the projection distance from the origin is .

The optimization problem is then

Using Lagrange multipliers:

Taking the partial derivative with respect to and setting to zero:

Hence , recovering the eigenvector equation.

SVD

Recall . Define:

  • Singular values: for .
  • Left singular vectors with

Then:

  • (1) is orthonormal:
  • (2)

Sketch of proof:

The first result follows from . The second follows similarly.

By rewriting the definition of :

That is, normalized multiplied by eigenvector of equals scalar times .

Matrix Version of SVD

Constructing with , and stacking columns , :

Any matrix decomposes into:

  • Rotation:
  • Stretch:
  • Second rotation:

Connection Between PCA and SVD

Squared singular values equal PCA variances: .

How Many Principal Components to Use

Total variance: .

Scree plot: choose the number of principal components by the elbow method.

Scree plot

Reduced Rank Approximation by SVD

From SVD:

Let . The reduced rank- least-squares approximation is

which minimizes the Frobenius norm

over all matrices of rank no greater than (Eckart-Young theorem).

PCA Algorithm Summary

  1. Standardize each feature:
  2. Compute eigenvectors of
  3. The -th principal component is ; its explained variance is
  4. Select number of components via scree plot (elbow on ); total variance is

Key Equation

Analogy

PCA is like finding the natural axes of an ellipse fitted to a point cloud: the major axis captures the most spread (highest variance), the minor axis captures the least. SVD is the same operation written in terms of the data matrix directly rather than its covariance.

Component of

Insights

  • SVD and PCA are two views of the same decomposition: .
  • Rank- SVD is the optimal least-squares approximation to (Eckart-Young theorem).
  • PCA is scale-sensitive: normalization is mandatory before computing the covariance.

Pitfalls

  • PCA assumes linear structure; nonlinear geometry requires Kernel PCA or manifold methods.
  • PCA is sensitive to outliers since covariance is not robust.

Connections

Implementation Notes

  • Prefer np.linalg.svd over explicit eigendecomposition for numerical stability.
  • For large , use randomized SVD (e.g., sklearn.utils.extmath.randomized_svd) to avoid the cost.

References

[1] Shlens (arXiv, 2014). A Tutorial on Principal Component Analysis. [2] Johnson & Wichern (2002). Applied Multivariate Statistical Analysis.