Concept: PCA and SVD
Core Intuition
PCA identifies the most meaningful basis to re-express data by maximizing the signal-to-noise ratio (SNR): .

- Maximize signal: measured by variance magnitude; large variance encodes interesting structure.
- Minimize redundancy (noise): measured by covariance magnitude.
Mathematical Foundation
Setup
Let the observed data be
where each row represents an example on -dimensional features:
For notation simplicity, we also denote each row by . Each column
is normalized to zero mean and unit variance .
The covariance matrix of is
Goal: find such that the covariance of is a diagonal matrix,
PCA Assumptions
- Linear transformation: let ,
- Large variance encodes important structure.
- Principal components are orthogonal.
The last assumption provides an intuitive simplification that makes PCA soluble with linear algebra decomposition techniques.
Eigendecomposition
Suppose . Its covariance matrix is
Since is symmetric, it admits the spectral decomposition:
where with , and has orthonormal columns ().
Sketch of proof: From the spectral decomposition, for all :
Then , which implies .
Choosing :
The eigenvalue is the variance of projected feature .
Alternative derivation via Lagrangian: PCA finds a unit vector (with ) maximizing the variance of the projection of all examples onto :
Why does the variance take this form? If the angle between and is , the projection of onto is
since is a unit vector. The length of this projection is
since . So the projection distance from the origin is .
The optimization problem is then
Using Lagrange multipliers:
Taking the partial derivative with respect to and setting to zero:
Hence , recovering the eigenvector equation.
SVD
Recall . Define:
- Singular values: for .
- Left singular vectors with
Then:
- (1) is orthonormal:
- (2)
Sketch of proof:
The first result follows from . The second follows similarly.
By rewriting the definition of :
That is, normalized multiplied by eigenvector of equals scalar times .
Matrix Version of SVD
Constructing with , and stacking columns , :
Any matrix decomposes into:
- Rotation:
- Stretch:
- Second rotation:
Connection Between PCA and SVD
Squared singular values equal PCA variances: .
How Many Principal Components to Use
Total variance: .
Scree plot: choose the number of principal components by the elbow method.

Reduced Rank Approximation by SVD
From SVD:
Let . The reduced rank- least-squares approximation is
which minimizes the Frobenius norm
over all matrices of rank no greater than (Eckart-Young theorem).
PCA Algorithm Summary
- Standardize each feature:
- Compute eigenvectors of
- The -th principal component is ; its explained variance is
- Select number of components via scree plot (elbow on ); total variance is
Key Equation
Analogy
PCA is like finding the natural axes of an ellipse fitted to a point cloud: the major axis captures the most spread (highest variance), the minor axis captures the least. SVD is the same operation written in terms of the data matrix directly rather than its covariance.
Component of
Insights
- SVD and PCA are two views of the same decomposition: .
- Rank- SVD is the optimal least-squares approximation to (Eckart-Young theorem).
- PCA is scale-sensitive: normalization is mandatory before computing the covariance.
Pitfalls
- PCA assumes linear structure; nonlinear geometry requires Kernel PCA or manifold methods.
- PCA is sensitive to outliers since covariance is not robust.
Connections
- Correspondence Analysis: generalized SVD applied to categorical count data.
- Kernel PCA todo
- Eckart-Young Theorem todo
Implementation Notes
- Prefer
np.linalg.svdover explicit eigendecomposition for numerical stability. - For large , use randomized SVD (e.g.,
sklearn.utils.extmath.randomized_svd) to avoid the cost.
References
[1] Shlens (arXiv, 2014). A Tutorial on Principal Component Analysis. [2] Johnson & Wichern (2002). Applied Multivariate Statistical Analysis.