Back

PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique, and transforms correlated features into a new set of uncorrelated variables called Principal Components (PCs)

The Data Matrix as Random Variables

Note: PCA is an unsupervised method, so when we say ‘data matrix’, it is the same as ‘feature matrix’ in the supervised method

Imagine we have a real data matrix $X$ with $n$ rows and $d$ columns (shape $n\times d$) , where each row $i\in[n]$ is a sample from experience(sampling process), each column $j\in[d]$ is a feature which can be seen as a (one dimensional) random variable over $n$ samples.

What this means?

Recall that a random variable is a function that maps the sample space to the real numbers.

\[Z: \Omega \longrightarrow \mathbb{R}\]

In this case, each row is a sample from the sample space: $X_i \in \Omega$ , each column maps rows to a real number (the entries in the data matrix) :

\[Z_j\left(X_i\right)\in\mathbb{R}\]

The entries in the data matrix describe the behavior of each random variable (column). With infinitely many samples, we recover the distribution of each random variable.

A $n \times d$ data matrix is multiple realizations of $d$ random variables; the sample size $n$ can be large or small; what matters is that the samples are governed by the underlying random variables. The $j$-th column then collects the realizations of the random variable $Z_j$.

Each random variable has a mean and a variance. If two random variables are dependent, we quantify their linear relationship using covariance.

Consider $n \times d$ centered data matrix $X$, the covariance matrix of these $d$ features is given by

\[C=\frac{1}{n-1} X^T X\]

Note: we use $\frac{1}{n-1}$ (Bessel’s correction) for the unbiased sample covariance; some references use $\frac{1}{n}$ instead.

It has shape $d\times d$, which means the covariance of pairwise features across all samples

Principal Component Analysis (PCA)

PCA finds orthogonal directions $\mathbf{w}_1, \ldots, \mathbf{w}_d$ that maximise the projected variance of $C$. These directions are the eigenvectors of $C$, and the corresponding eigenvalues give the variance captured by each principal component:

\[C{w}_i=\lambda_i {w}_i, \quad \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0\]

PCA reduces to computing the eigendecomposition of $C$.

Singular Value Decomposition (SVD)

The method used to implement PCA in practice

Decompose data using SVD:

\[X=U \Sigma V^T\]

The columns of $V$ are the eigenvectors of $X^T X$, or the matrix of principal axes.

The connection to PCA is direct: since $C = \frac{1}{n-1} X^T X$ and $X = U \Sigma V^T$, we have $C = \frac{1}{n-1} V \Sigma^2 V^T$. So the right singular vectors of $X$ are the eigenvectors of $C$, and the eigenvalues of $C$ are $\lambda_i = \sigma_i^2 / (n-1)$. SVD of $X$ gives us the PCA decomposition directly.

In PCA, each column of $V$ is a direction in the feature space $\mathbb{R}^d$. These directions are the principal axes along which the data $X$ has maximum variance.

When we calculate $X V$, we take each row of $X$ (sample, or an observation in $\mathbb{R}^d$ ) and express it in the new coordinate system defined by the columns of $V$. This is precisely the principal component projection $ T=XV$.

\[T=X V=U \Sigma V^T V=U \Sigma\]

The matrix $T$ is called the score matrix, it contains principal component scores.

Principal component scores are the coordinates of each observation in the new (principal component) basis. The first column of $T$ is the coordinate along the first principal component (the direction of greatest variance)

For dimension reduction, we consider the matrix $V_L$ that contains the first $L$ columns of $V$. The first column of $V$ is the direction where data have maximal variance, the second column of $V$ is the direction of the second maximal variance, perpendicular to the first direction.

The truncated score matrix is

\[T_L=X V_L\]

$T_L$ has a lower dimension than the original data matrix $X$; this is what we expect from PCA. In machine learning, we can use it to train the model. For test dataset $X_{test}$, we project using the same $V_L$ from the training data: $X_{test}V_L$. (Important: $V_L$ is computed from the training set only, not recomputed on the test set. This is a common source of bugs.)

Why it works?

PCA maps data to its maximal variance direction. But why use the largest variance direction?

If data is not random, it means it has large eigenvalues and small eigenvalues, a large eigenvalue corresponding to a large variance.

Assumption: PCA implicitly assumes that signal lives in high-variance directions and noise lives in low-variance directions. Our true data live in a lower dimensional subspace. This is a strong assumption that fails when, e.g., a low-variance feature is highly predictive of the label.

PCA drops low-variance features and keeps high-variance features. For high-variance features, PCA also mixes (linearly combines) these features into principal components, making them orthogonal.

Note: when PCA drops low-variance dimensions, it is only dropping directions that appear less informative in terms of $X$; there is no guarantee about how those dimensions relate to the label $y$. For supervised dimensionality reduction that accounts for labels, see Linear Discriminant Analysis (LDA).

If features are nearly independent, the covariance matrix is approximately diagonal. Each eigenvalue then roughly equals the corresponding feature’s variance, and each principal component aligns with an individual feature. In this case, PCA contributes little (but independence is a good property).

Conversely, PCA is most useful when features are highly correlated, because it can compress redundant information into fewer components. When the covariance matrix has significant off-diagonal entries, PCA merges linearly dependent features (for example, if feature 3 is roughly $2 \times \text{feature 1} + 3 \times \text{feature 2}$) into a principal component. It cannot capture nonlinear relationships.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Part2.5-ASharpGeneralizationBound
  • Part3.5-PERecoveryProof
  • PML-1 MAP MLE KL
  • Part5-PENHolonomyandSingleTangentFallacy
  • TheWorldFromWithinAndWithout