Dimensionality Reduction

Principal Component Analysis, or PCA, is a fundamental dimensionality reduction technique using in Machine Learning. The general goal is to transform a ${d}$ -dimensional problem into a ${k}$ -dimensional one, where ${k}$ is smaller than ${d}$ . This can be used, for instance, to plot large dimensional data on a 2D or 3D plot, or to reduce the number of features in an effort to eliminate random noise in a machine learning problem (made precise below). In this post, we will discuss PCA in mathematical terms. This is adopted from a first course in Linear Algebra, the principle of maximum likelihood estimation, and the Eckhart-Young theorem from 1936.

${\smallskip}$

Example: Suppose we are presented with ${m}$ samples ${x_1 , \ldots x_{m}}$ , where each is a 100 dimensional vector. After inspection, we realize that each data point is of the form ${c_i z}$ , where ${c_i}$ is a real number and ${z}$ is a 100-dimensional vector. Thus, to describe each ${x_i}$ , we simply need to specify ${c_i}$ , rather than the 100 coordinates of ${x_i}$ . This effectively reduces the dimensionality of our data from 100 to 1. ${\spadesuit}$ ${\smallskip}$

The goal of PCA is to provide a methodical way of accomplishing this in general. There are a few issues to overcome.

The data may not be of the form ${c_i z}$ , but rather ${c_i z_1 + d_i z_2}$ , or even higher dimensional.
The data might only be approximately of the form ${c_i z}$ , i.e. of the form ${c_i z + e_i }$ , where ${e_i}$ is a small error term.
We may not know, a priori, the form of the data. How do we determine the best ${k}$ -dimensional representation of the data?

Let us address these issues below.

1. PCA in Mathematical Terms

Suppose we are presented with a matrix ${X}$ of size ${m \times d}$ , where each row is a sample and each column is a feature. We denote the rows of ${X}$ are given by ${x_1 , \ldots , x_m}$ . We suppose further, without loss of generality after scaling, that each column of ${X}$ has zero mean and unit variance.

Our goal is to find the best approximation of the form

$\displaystyle x_i \approx z_i^T W ,$

for all ${i = 1 , \ldots , m}$ , where ${W}$ is a matrix of size ${k \times d}$ and ${z_i}$ is a vector of size ${k}$ . Thus, each ${x_i}$ is generated from lower dimensional data, via the ${z_i}$ .

We suppose, more precisely, that

$\displaystyle x_i =z_i^T W + \epsilon_i,$

where ${\epsilon_i \sim N(0, I)}$ is a error term. That is each coordinate of ${\epsilon_i}$ is normally distributed with mean zero and variance 1. From here, we plan to derive PCA. One can wonder if the assumption on ${\epsilon_i}$ is reasonable, which can be addressed on an individual basis. The assumption that ${x_i}$ is generated by lower dimensional data is sometimes referred to as a latent representation, and the interested reader may consult this blog post for applications of this idea into generative models, such as image generation.

From this assumption, how do we find ${W}$ and the ${z_i}$ ? The answer comes from the principal of Maximum Likelihood Estimation (MLE). Utilizing the formula for the probability density function of a multivariate normal distribution, for a fixed ${W}$ , and ${z_1 , \ldots , z_m}$ , the probability of observing ${x_1 , \ldots , x_m}$ is given by

$\displaystyle P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = \prod_{i=1}^m \frac{1}{(2 \pi )^{d/2}} \exp \left( - \frac{1}{2 } \| x_i - z_i^TW \|^2 \right).$

The norm in the above equation is the classic euclidean norm. The principal of MLE states that we should choose ${W}$ and the ${z_i}$ to maximize this probability. It is typically easier to maximize a sum rather than a product. For instance, in Calculus, it is much easier to take the derivative of a sum than a product, as the product rule of derivatives is more complex than the sum rule. A classic mathematical trick that converts a product into a sum is to take logs. As the logarithm is a monotonically increasing function, maximizing the log of a function is equivalent to maximizing the function itself. Thus, we consider the log likelihood

$\displaystyle \log P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = - \frac{m}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^m \| x_i - z_i^T W \|^2.$

As our goal is to maximize the above, we can equivalently minimize the sum

$\displaystyle \sum_{i=1}^m \| x_i - z_i^T W \|^2,$

after removing the constant term that does not depend on ${W}$ or the ${z_i}$ and scaling the remaining summand. We can rewrite the above as

$\displaystyle || X - ZW||^2,$

where ${X}$ is the matrix of size ${m \times d}$ whose rows are ${x_1 , \ldots , x_m}$ , ${Z}$ is the matrix of size ${m \times k}$ whose rows are ${z_1 , \ldots , z_m}$ , and ${W}$ is the matrix of size ${d \times k}$ .

2. The Eckhart-Young Theorem

The solution to this problem comes from the Eckart-Young theorem. Unfortunately, the original paper is behind a paywall, but we will provide the ideas here. Recall the rank of a matrix is the dimension of its column space.

Let us also recall the Singular Value Decomposition (SVD) of a matrix ${X}$ of size ${m \times d}$ is given by

$\displaystyle X = U \Sigma V^T,$

where ${U}$ is an ${m \times m}$ orthogonal matrix, ${\Sigma}$ is a ${m \times d}$ diagonal matrix, and ${V}$ is a ${d \times d}$ orthogonal matrix. The diagonal entries of ${\Sigma}$ are called the singular values of ${X}$ . It turns out that the columns of ${U}$ are the eigenvectors of ${X X^T}$ and the columns of ${V}$ are the eigenvectors of ${X^T X}$ .

${\smallskip}$

Theorem: (Eckhart-Young) Let ${X}$ be any matrix of size ${m \times d}$ . Let ${k \leq d}$ . Then

$\displaystyle \min_{X_k} ||X - X_k||_{F}^2 \ \ {\rm s.t.} \ \ {\rm rank}(X_k) \leq k,$

is given by

$\displaystyle X_k = U \Sigma_{k}V^T,$

where

$\displaystyle X = U \Sigma V^T,$

is the Singular Value Decomposition of ${X}$ and ${\Sigma_k}$ is ${m\times d}$ matrix which has is zero except the first ${k}$ columns are the same as ${\Sigma}$ . ${\spadesuit}$ ${\smallskip}$

There is a lot to unpack here:

How does this solve our original problem?
Why is this theorem true?

Let us address these in order. First note that ${{\rm rank}(ZW) \leq {\rm rank}(Z) \leq k}$ , as ${Z}$ has ${k}$ columns. Thus, we were seeking a rank ${k}$ approximation to ${X}$ above. Here the rows of ${U\Sigma_k}$ are precisely ${z_1 , \ldots , z_m}$ from above. We also have ${W}$ taking a particularly simple form, of a mostly zero matrix. This has the nice generalization of our original one dimensional example, where

$\displaystyle x_i = \sum_{i=1}^k z_i \sigma_i,$

where ${\sigma_i}$ is the ${i}$ th singular value of ${X}$ .

${\smallskip}$

Proof of Eckhart Young By SVD, it is enough to find ${X_k}$ that minimizes

$\displaystyle ||U \Sigma V^T - X_k||_{F}^2.$

Since the Frobenius norm is invariant under orthogonal transformations, we may equivalently minimize

$\displaystyle ||\Sigma - U^T X_k V||_{F}^2.$

But ${U^T X_k V}$ is a rank ${k}$ matrix, as it is the product of two rank ${k}$ matrices. Since ${\Sigma}$ is a diagonal matrix, it is straightforward (or see here) to see that the best rank ${k}$ approximation to ${\Sigma}$ is to keep the first ${k}$ columns of ${\Sigma}$ and set the rest to zero, via ${\Sigma_k}$ . Thus,

$\displaystyle \Sigma_k = U^T X_k V,$

and the result follows. ${\spadesuit}$

3. A Nice After Effect

Thanks to the singular value decomposition, we are equipped with an algorithm for reducing the dimension on non-training data. To see this, we may compute the latent variables, via

$\displaystyle X_k = X V_k,$

where ${V_k}$ is the matrix of size ${d \times k}$ whose columns are the first ${k}$ columns of ${V}$ . Thus, given a new data point ${x}$ , we may compute the latent representation via

$\displaystyle z = x^T V_k.$

This is a nice feature of PCA, which is not afforded by other dimensionality reduction techniques, such as t-SNE. This allows the transform method to be available in sklearn’s PCA implementation, which affords the opportunity for PCA to be used in the feature engineering portion of a machine learning pipeline.

4. An Alternate Perspective

The more classical way to approach PCA is to search for a vector ${v \in \mathbb{R}^d}$ satisfying

$\displaystyle v = \arg \max_{u \in \mathbb{R}^d} || Xu||.$

Thus, ${v}$ is the vector that maximizes the variance of the data projected onto the line generated by ${v}$ . However, the MLE approach above is more natural from a machine learning perspective.

That being said, I answered a question on math stackexchange awhile back that provides a derivation of PCA from this perspective. I was curious if the Lagrange Multiplier, outlined in that post, approach could be used to derive the Eckhart-Young theorem.

To see how to approach this, suppose we want to prove Eckhart-Young for ${k=1}$ , i.e. we would like to find a rank 1 matrix ${X_1}$ , that minimizes

$\displaystyle ||X - X_1||_{F}^2.$

As ${X_1}$ is rank 1, it can be written as ${uv^T}$ , where ${u}$ and ${v}$ are vectors. Thus, we are looking for ${u}$ and ${v}$ that minimize

$\displaystyle ||X - uv^T||_{F}^2.$

This is an optimization problem in ${u}$ and ${v}$ . Let us take a gradient with respect to ${u}$ and ${v}$ and set it equal to zero. By the definition of the Frobenius norm, and the multivariate chain rule, we have

$\displaystyle \nabla_u ||X - uv^T||_{F}^2 = -2 (X - uv^T)v = 0,$

and

$\displaystyle \nabla_v ||X - uv^T||_{F}^2 = -2 (X - uv^T)^T u = 0.$

Thus,

$\displaystyle Xv = uv^Tv, \ \ \ X^T u = v u^T u.$

These are precisely the equations satisfied by the SVD of ${X}$ .

I’d like to see a cleaned up version of this argument, in the general case.

	The (near) Future of… on My experience with Computer Pr…
	PCA like a Mathemati… on Singular Value Decomposition a…
	George Shakan on Can Chat-GPT Do Math?
	J.C Vega on Can Chat-GPT Do Math?
	Carnival of Mathemat… on Carnival Of Mathematics

George Shakan

Data Science and Math Blog

Tag Archives: Dimensionality Reduction

PCA like a Mathematician