Tag Archives: Dimensionality Reduction

PCA like a Mathematician

Principal Component Analysis, or PCA, is a fundamental dimensionality reduction technique using in Machine Learning. The general goal is to transform a {d}-dimensional problem into a {k}-dimensional one, where {k} is smaller than {d}. This can be used, for instance, to plot large dimensional data on a 2D or 3D plot, or to reduce the number of features in an effort to eliminate random noise in a machine learning problem (made precise below). In this post, we will discuss PCA in mathematical terms. This is adopted from a first course in Linear Algebra, the principle of maximum likelihood estimation, and the Eckhart-Young theorem from 1936.

{\smallskip}

Example: Suppose we are presented with {m} samples {x_1 , \ldots x_{m}}, where each is a 100 dimensional vector. After inspection, we realize that each data point is of the form {c_i z}, where {c_i} is a real number and {z} is a 100-dimensional vector. Thus, to describe each {x_i}, we simply need to specify {c_i}, rather than the 100 coordinates of {x_i}. This effectively reduces the dimensionality of our data from 100 to 1. {\spadesuit} {\smallskip}

The goal of PCA is to provide a methodical way of accomplishing this in general. There are a few issues to overcome.

  • The data may not be of the form {c_i z}, but rather {c_i z_1 + d_i z_2}, or even higher dimensional.
  • The data might only be approximately of the form {c_i z}, i.e. of the form {c_i z + e_i }, where {e_i} is a small error term.
  • We may not know, a priori, the form of the data. How do we determine the best {k}-dimensional representation of the data?

Let us address these issues below.

1. PCA in Mathematical Terms

Suppose we are presented with a matrix {X} of size {m \times d}, where each row is a sample and each column is a feature. We denote the rows of {X} are given by {x_1 , \ldots , x_m}. We suppose further, without loss of generality after scaling, that each column of {X} has zero mean and unit variance.

Our goal is to find the best approximation of the form

\displaystyle x_i \approx z_i^T W ,

for all {i = 1 , \ldots , m}, where {W} is a matrix of size {k \times d} and {z_i} is a vector of size {k}. Thus, each {x_i} is generated from lower dimensional data, via the {z_i}.

We suppose, more precisely, that

\displaystyle x_i =z_i^T W + \epsilon_i,

where {\epsilon_i \sim N(0, I)} is a error term. That is each coordinate of {\epsilon_i} is normally distributed with mean zero and variance 1. From here, we plan to derive PCA. One can wonder if the assumption on {\epsilon_i} is reasonable, which can be addressed on an individual basis. The assumption that {x_i} is generated by lower dimensional data is sometimes referred to as a latent representation, and the interested reader may consult this blog post for applications of this idea into generative models, such as image generation.

From this assumption, how do we find {W} and the {z_i}? The answer comes from the principal of Maximum Likelihood Estimation (MLE). Utilizing the formula for the probability density function of a multivariate normal distribution, for a fixed {W}, and {z_1 , \ldots , z_m}, the probability of observing {x_1 , \ldots , x_m} is given by

\displaystyle P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = \prod_{i=1}^m \frac{1}{(2 \pi )^{d/2}} \exp \left( - \frac{1}{2 } \| x_i - z_i^TW \|^2 \right).

The norm in the above equation is the classic euclidean norm. The principal of MLE states that we should choose {W} and the {z_i} to maximize this probability. It is typically easier to maximize a sum rather than a product. For instance, in Calculus, it is much easier to take the derivative of a sum than a product, as the product rule of derivatives is more complex than the sum rule. A classic mathematical trick that converts a product into a sum is to take logs. As the logarithm is a monotonically increasing function, maximizing the log of a function is equivalent to maximizing the function itself. Thus, we consider the log likelihood

\displaystyle \log P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = - \frac{m}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^m \| x_i - z_i^T W \|^2.

As our goal is to maximize the above, we can equivalently minimize the sum

\displaystyle \sum_{i=1}^m \| x_i - z_i^T W \|^2,

after removing the constant term that does not depend on {W} or the {z_i} and scaling the remaining summand. We can rewrite the above as

\displaystyle  || X - ZW||^2,

where {X} is the matrix of size {m \times d} whose rows are {x_1 , \ldots , x_m}, {Z} is the matrix of size {m \times k} whose rows are {z_1 , \ldots , z_m}, and {W} is the matrix of size {d \times k}.

2. The Eckhart-Young Theorem

The solution to this problem comes from the Eckart-Young theorem. Unfortunately, the original paper is behind a paywall, but we will provide the ideas here. Recall the rank of a matrix is the dimension of its column space.

Let us also recall the Singular Value Decomposition (SVD) of a matrix {X} of size {m \times d} is given by

\displaystyle X = U \Sigma V^T,

where {U} is an {m \times m} orthogonal matrix, {\Sigma} is a {m \times d} diagonal matrix, and {V} is a {d \times d} orthogonal matrix. The diagonal entries of {\Sigma} are called the singular values of {X}. It turns out that the columns of {U} are the eigenvectors of {X X^T} and the columns of {V} are the eigenvectors of {X^T X}.

{\smallskip}

Theorem: (Eckhart-Young) Let {X} be any matrix of size {m \times d}. Let {k \leq d}. Then

\displaystyle \min_{X_k} ||X - X_k||_{F}^2 \ \ {\rm s.t.} \ \ {\rm rank}(X_k) \leq k,

is given by

\displaystyle X_k = U \Sigma_{k}V^T,

where

\displaystyle X = U \Sigma V^T,

is the Singular Value Decomposition of {X} and {\Sigma_k} is {m\times d} matrix which has is zero except the first {k} columns are the same as {\Sigma}. {\spadesuit} {\smallskip}

There is a lot to unpack here:

  • How does this solve our original problem?
  • Why is this theorem true?

Let us address these in order. First note that {{\rm rank}(ZW) \leq {\rm rank}(Z) \leq k}, as {Z} has {k} columns. Thus, we were seeking a rank {k} approximation to {X} above. Here the rows of {U\Sigma_k} are precisely {z_1 , \ldots , z_m} from above. We also have {W} taking a particularly simple form, of a mostly zero matrix. This has the nice generalization of our original one dimensional example, where

\displaystyle x_i = \sum_{i=1}^k z_i \sigma_i,

where {\sigma_i} is the {i}th singular value of {X}.

{\smallskip}

Proof of Eckhart Young By SVD, it is enough to find {X_k} that minimizes

\displaystyle ||U \Sigma V^T - X_k||_{F}^2.

Since the Frobenius norm is invariant under orthogonal transformations, we may equivalently minimize

\displaystyle ||\Sigma - U^T X_k V||_{F}^2.

But {U^T X_k V} is a rank {k} matrix, as it is the product of two rank {k} matrices. Since {\Sigma} is a diagonal matrix, it is straightforward (or see here) to see that the best rank {k} approximation to {\Sigma} is to keep the first {k} columns of {\Sigma} and set the rest to zero, via {\Sigma_k}. Thus,

\displaystyle \Sigma_k = U^T X_k V,

and the result follows. {\spadesuit}

3. A Nice After Effect

Thanks to the singular value decomposition, we are equipped with an algorithm for reducing the dimension on non-training data. To see this, we may compute the latent variables, via

\displaystyle X_k = X V_k,

where {V_k} is the matrix of size {d \times k} whose columns are the first {k} columns of {V}. Thus, given a new data point {x}, we may compute the latent representation via

\displaystyle z = x^T V_k.

This is a nice feature of PCA, which is not afforded by other dimensionality reduction techniques, such as t-SNE. This allows the transform method to be available in sklearn’s PCA implementation, which affords the opportunity for PCA to be used in the feature engineering portion of a machine learning pipeline.

4. An Alternate Perspective

The more classical way to approach PCA is to search for a vector {v \in \mathbb{R}^d} satisfying

\displaystyle v = \arg \max_{u \in \mathbb{R}^d} || Xu||.

Thus, {v} is the vector that maximizes the variance of the data projected onto the line generated by {v}. However, the MLE approach above is more natural from a machine learning perspective.

That being said, I answered a question on math stackexchange awhile back that provides a derivation of PCA from this perspective. I was curious if the Lagrange Multiplier, outlined in that post, approach could be used to derive the Eckhart-Young theorem.

To see how to approach this, suppose we want to prove Eckhart-Young for {k=1}, i.e. we would like to find a rank 1 matrix {X_1}, that minimizes

\displaystyle ||X - X_1||_{F}^2.

As {X_1} is rank 1, it can be written as {uv^T}, where {u} and {v} are vectors. Thus, we are looking for {u} and {v} that minimize

\displaystyle ||X - uv^T||_{F}^2.

This is an optimization problem in {u} and {v}. Let us take a gradient with respect to {u} and {v} and set it equal to zero. By the definition of the Frobenius norm, and the multivariate chain rule, we have

\displaystyle \nabla_u ||X - uv^T||_{F}^2 = -2 (X - uv^T)v = 0,

and

\displaystyle \nabla_v ||X - uv^T||_{F}^2 = -2 (X - uv^T)^T u = 0.

Thus,

\displaystyle Xv = uv^Tv, \ \ \ X^T u = v u^T u.

These are precisely the equations satisfied by the SVD of {X}.

I’d like to see a cleaned up version of this argument, in the general case.