Concept: PCA and SVD

Core Intuition

PCA identifies the most meaningful basis to re-express data by maximizing the signal-to-noise ratio (SNR): $σ_{s i g na l}^{2} / σ_{n o i se}^{2}$ .

Maximize signal: measured by variance magnitude; large variance encodes interesting structure.
Minimize redundancy (noise): measured by covariance magnitude.

Mathematical Foundation

Setup

Let the observed data be

(m \times n) X = {x_{ij}} = x_{11} ⋮ x_{m 1} \dots \dots x_{1 n} ⋮ x_{mn} = x_{1.}^{T} ⋮ x_{m .}^{T} = [x_{.1}, \dots, x_{. n}]

where each row $x_{i .}^{T}$ represents an example on $n$ -dimensional features:

x_{i .}^{T} = [x_{i 1} \dots x_{in}]

For notation simplicity, we also denote each row $x_{i .}^{T}$ by $x_{i}^{T}$ . Each column

x_{. j} = x_{1 j} ⋮ x_{mj}

is normalized to zero mean $E (x_{. j}) = 0$ and unit variance $Var (x_{. j}) = 1$ .

The covariance matrix of $X$ is

(n \times n) Σ_{X} = \frac{1}{m} X^{T} X = \frac{1}{m} x_{.1}^{T} ⋮ x_{. n}^{T} [x_{.1}, \dots, x_{. n}] = \frac{1}{m} x_{.1}^{T} x_{.1} ⋮ x_{. n}^{T} x_{.1} \dots \dots x_{.1}^{T} x_{. n} ⋮ x_{. n}^{T} x_{. n}

Goal: find $Y = f (X)$ such that the covariance of $Y$ is a diagonal matrix,

(n \times n) Σ_{Y} = \frac{1}{m} Y^{T} Y

PCA Assumptions

Linear transformation: let $n \times n E = [e_{1}, \dots, e_{n}]$ ,

Y = XE

Large variance encodes important structure.
Principal components are orthogonal.

The last assumption provides an intuitive simplification that makes PCA soluble with linear algebra decomposition techniques.

Eigendecomposition

Suppose $Y = XE$ . Its covariance matrix is

Σ_{Y} = \frac{1}{m} Y^{T} Y = \frac{1}{m} (XE)^{T} (XE) = E^{T} (\frac{1}{m} X^{T} X) E

Since $\frac{1}{m} X^{T} X$ is symmetric, it admits the spectral decomposition:

\frac{1}{m} X^{T} X = V D V^{T}

where $D = diag (d_{1}, \dots, d_{n})$ with $d_{1} \geq \dots \geq d_{n}$ , and $V = [v_{1}, \dots, v_{n}]$ has orthonormal columns ( $V^{T} V = V V^{T} = I$ ).

Sketch of proof: From the spectral decomposition, for all $j$ :

\frac{1}{m} X^{T} X v_{j} = d_{j} v_{j}

Then $\frac{1}{m} X^{T} X V = V D$ , which implies $\frac{1}{m} X^{T} X = V D V^{T}$ .

Choosing $E = V$ :

Σ_{Y} = E^{T} (\frac{1}{m} X^{T} X) E = V^{T} (V D V^{T}) V = D

The eigenvalue $d_{j}$ is the variance of projected feature $y_{j}$ .

Alternative derivation via Lagrangian: PCA finds a unit vector $u$ (with $∥ u ∥ = 1$ ) maximizing the variance of the projection of all examples $x_{i}$ onto $u$ :

\frac{1}{m} i = 1 \sum m (x_{i}^{T} u)^{2} = \frac{1}{m} i = 1 \sum m (x_{i}^{T} u)^{T} (x_{i}^{T} u) = u^{T} (\frac{1}{m} i = 1 \sum m x_{i} x_{i}^{T}) u

Why does the variance take this form? If the angle between $x_{i}$ and $u$ is $θ$ , the projection of $x_{i}$ onto $u$ is

x_{i}^{'} = x_{i} cos θ = x_{i} \frac{x _{i}^{T} u}{∣ x _{i} ∣∣ u ∣} = x_{i} \frac{x _{i}^{T} u}{∣ x _{i} ∣}

since $u$ is a unit vector. The length of this projection is

[x_{i}^{' T} x_{i}^{'}]^{1/2} = [(x_{i} \frac{x _{i}^{T} u}{∣ x _{i} ∣})^{T} (x_{i} \frac{x _{i}^{T} u}{∣ x _{i} ∣})]^{1/2} = [(\frac{x _{i}^{T} u}{∣ x _{i} ∣}) x_{i}^{T} x_{i} (\frac{x _{i}^{T} u}{∣ x _{i} ∣})]^{1/2} = [(x_{i}^{T} u)^{2}]^{1/2} = x_{i}^{T} u

since $x_{i}^{T} x_{i} = ∥ x_{i} ∥^{2}$ . So the projection distance from the origin is $x_{i}^{T} u$ .

The optimization problem is then

u max u^{T} (\frac{1}{m} i = 1 \sum m x_{i} x_{i}^{T}) u s.t. u^{T} u = 1

Using Lagrange multipliers:

u max L (u, d) = u^{T} (\frac{1}{m} X^{T} X) u - d (u^{T} u - 1)

Taking the partial derivative with respect to $u$ and setting to zero:

\frac{\partial}{\partial u} L (u, d) = \frac{1}{m} X^{T} X u - d u = 0

Hence $\frac{1}{m} X^{T} X u = d u$ , recovering the eigenvector equation.

SVD

Recall $\frac{1}{m} X^{T} X v_{j} = d_{j} v_{j}$ . Define:

Singular values: $λ_{j} = d_{j}$ for $j = 1, \dots, n$ .
Left singular vectors $U = [u_{1}, \dots, u_{n}]$ with

u_{j} = \frac{1}{λ _{j}} (\frac{1}{m} X) v_{j}

Then:

(1) $U$ is orthonormal: $u_{i}^{T} u_{j} = {1, 0, i = j otherwise$
(2) $∥ X v_{i} ∥ = λ_{i}$

Sketch of proof:

u_{i}^{T} u_{j} = \frac{1}{m} (\frac{1}{λ _{i}} X v_{i})^{T} (\frac{1}{λ _{j}} X v_{j}) = \frac{1}{λ _{i} λ _{j}} v_{i}^{T} \frac{1}{m} X^{T} X v_{j} = \frac{1}{λ _{i} λ _{j}} v_{i}^{T} d_{j} v_{j} = \frac{λ _{j}}{λ _{i}} v_{i}^{T} v_{j}

The first result follows from $v_{i}^{T} v_{j} = 1 [i = j]$ . The second follows similarly.

By rewriting the definition of $u_{j}$ :

(\frac{1}{m} X) v_{j} = λ_{j} u_{j}

That is, normalized $X$ multiplied by eigenvector $v_{j}$ of $\frac{1}{m} X^{T} X$ equals scalar $λ_{j}$ times $u_{j}$ .

Matrix Version of SVD

Constructing $Σ = diag (λ_{1}, \dots, λ_{n})$ with $λ_{1} \geq \dots \geq λ_{n}$ , and stacking columns $V = [v_{1}, \dots, v_{n}]$ , $U = [u_{1}, \dots, u_{n}]$ :

(\frac{1}{m} X) V = U Σ ⟹ \frac{1}{m} X = U Σ V^{T}

Any matrix $\frac{1}{m} X$ decomposes into:

Rotation: $V^{T}$
Stretch: $Σ$
Second rotation: $U$

Connection Between PCA and SVD

\frac{1}{m} X^{T} X = (\frac{1}{m} X)^{T} (\frac{1}{m} X) = (U Σ V^{T})^{T} (U Σ V^{T}) = V Σ U^{T} U Σ V^{T} = V Σ^{2} V^{T} \equiv V D V^{T}

Squared singular values equal PCA variances: $λ_{j}^{2} = d_{j}$ .

How Many Principal Components to Use

Total variance: $\sum_{j = 1}^{n} λ_{j}^{2}$ .

Scree plot: choose the number of principal components by the elbow method.

Reduced Rank Approximation by SVD

From SVD:

\frac{1}{m} X = U Σ V^{T} = j = 1 \sum n λ_{j} u_{j} v_{j}^{T}

Let $s < n = rank (X)$ . The reduced rank- $s$ least-squares approximation is

\frac{1}{m} X = j = 1 \sum s λ_{j} u_{j} v_{j}^{T}

which minimizes the Frobenius norm

\frac{1}{m} i = 1 \sum m j = 1 \sum n (x_{ij} - x_{ij})^{2} = tr [(\frac{1}{m} (X - X)) (\frac{1}{m} (X - X))^{T}]

over all matrices $X$ of rank no greater than $s$ (Eckart-Young theorem).

PCA Algorithm Summary

Standardize each feature: $x_{j} \leftarrow (x_{j} - \overset{x}{ˉ}_{j}) / σ_{j}$
Compute eigenvectors of $Σ_{X} = \frac{1}{m} X^{T} X$
The $i$ -th principal component is $v_{i}$ ; its explained variance is $d_{i}$
Select number of components $s$ via scree plot (elbow on ${λ_{j}^{2}}$ ); total variance is $\sum_{j} λ_{j}^{2}$

Key Equation

\frac{1}{m} X^{T} X = V D V^{T} (eigendecomposition / PCA)

\frac{1}{m} X = U Σ V^{T} (SVD)

\frac{1}{m} X = j = 1 \sum s λ_{j} u_{j} v_{j}^{T} (rank- s approximation, Eckart-Young)

Analogy

PCA is like finding the natural axes of an ellipse fitted to a point cloud: the major axis captures the most spread (highest variance), the minor axis captures the least. SVD is the same operation written in terms of the data matrix directly rather than its covariance.

Component of

Insights

SVD and PCA are two views of the same decomposition: $λ_{j}^{2} = d_{j}$ .
Rank- $s$ SVD is the optimal least-squares approximation to $X$ (Eckart-Young theorem).
PCA is scale-sensitive: normalization is mandatory before computing the covariance.

Pitfalls

PCA assumes linear structure; nonlinear geometry requires Kernel PCA or manifold methods.
PCA is sensitive to outliers since covariance is not robust.

Connections

Correspondence Analysis: generalized SVD applied to categorical count data.
Kernel PCA todo
Eckart-Young Theorem todo

Implementation Notes

Prefer np.linalg.svd over explicit eigendecomposition for numerical stability.
For large $n$ , use randomized SVD (e.g., sklearn.utils.extmath.randomized_svd) to avoid the $O (n^{3})$ cost.

References

[1] Shlens (arXiv, 2014). A Tutorial on Principal Component Analysis. [2] Johnson & Wichern (2002). Applied Multivariate Statistical Analysis.

Quartz 4

Explorer

PCA and SVD