Tag Archives: PCA

PCA like a Mathematician

Principal Component Analysis, or PCA, is a fundamental dimensionality reduction technique using in Machine Learning. The general goal is to transform a {d}-dimensional problem into a {k}-dimensional one, where {k} is smaller than {d}. This can be used, for instance, to plot large dimensional data on a 2D or 3D plot, or to reduce the number of features in an effort to eliminate random noise in a machine learning problem (made precise below). In this post, we will discuss PCA in mathematical terms. This is adopted from a first course in Linear Algebra, the principle of maximum likelihood estimation, and the Eckhart-Young theorem from 1936.

{\smallskip}

Example: Suppose we are presented with {m} samples {x_1 , \ldots x_{m}}, where each is a 100 dimensional vector. After inspection, we realize that each data point is of the form {c_i z}, where {c_i} is a real number and {z} is a 100-dimensional vector. Thus, to describe each {x_i}, we simply need to specify {c_i}, rather than the 100 coordinates of {x_i}. This effectively reduces the dimensionality of our data from 100 to 1. {\spadesuit} {\smallskip}

The goal of PCA is to provide a methodical way of accomplishing this in general. There are a few issues to overcome.

  • The data may not be of the form {c_i z}, but rather {c_i z_1 + d_i z_2}, or even higher dimensional.
  • The data might only be approximately of the form {c_i z}, i.e. of the form {c_i z + e_i }, where {e_i} is a small error term.
  • We may not know, a priori, the form of the data. How do we determine the best {k}-dimensional representation of the data?

Let us address these issues below.

1. PCA in Mathematical Terms

Suppose we are presented with a matrix {X} of size {m \times d}, where each row is a sample and each column is a feature. We denote the rows of {X} are given by {x_1 , \ldots , x_m}. We suppose further, without loss of generality after scaling, that each column of {X} has zero mean and unit variance.

Our goal is to find the best approximation of the form

\displaystyle x_i \approx z_i^T W ,

for all {i = 1 , \ldots , m}, where {W} is a matrix of size {k \times d} and {z_i} is a vector of size {k}. Thus, each {x_i} is generated from lower dimensional data, via the {z_i}.

We suppose, more precisely, that

\displaystyle x_i =z_i^T W + \epsilon_i,

where {\epsilon_i \sim N(0, I)} is a error term. That is each coordinate of {\epsilon_i} is normally distributed with mean zero and variance 1. From here, we plan to derive PCA. One can wonder if the assumption on {\epsilon_i} is reasonable, which can be addressed on an individual basis. The assumption that {x_i} is generated by lower dimensional data is sometimes referred to as a latent representation, and the interested reader may consult this blog post for applications of this idea into generative models, such as image generation.

From this assumption, how do we find {W} and the {z_i}? The answer comes from the principal of Maximum Likelihood Estimation (MLE). Utilizing the formula for the probability density function of a multivariate normal distribution, for a fixed {W}, and {z_1 , \ldots , z_m}, the probability of observing {x_1 , \ldots , x_m} is given by

\displaystyle P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = \prod_{i=1}^m \frac{1}{(2 \pi )^{d/2}} \exp \left( - \frac{1}{2 } \| x_i - z_i^TW \|^2 \right).

The norm in the above equation is the classic euclidean norm. The principal of MLE states that we should choose {W} and the {z_i} to maximize this probability. It is typically easier to maximize a sum rather than a product. For instance, in Calculus, it is much easier to take the derivative of a sum than a product, as the product rule of derivatives is more complex than the sum rule. A classic mathematical trick that converts a product into a sum is to take logs. As the logarithm is a monotonically increasing function, maximizing the log of a function is equivalent to maximizing the function itself. Thus, we consider the log likelihood

\displaystyle \log P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = - \frac{m}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^m \| x_i - z_i^T W \|^2.

As our goal is to maximize the above, we can equivalently minimize the sum

\displaystyle \sum_{i=1}^m \| x_i - z_i^T W \|^2,

after removing the constant term that does not depend on {W} or the {z_i} and scaling the remaining summand. We can rewrite the above as

\displaystyle  || X - ZW||^2,

where {X} is the matrix of size {m \times d} whose rows are {x_1 , \ldots , x_m}, {Z} is the matrix of size {m \times k} whose rows are {z_1 , \ldots , z_m}, and {W} is the matrix of size {d \times k}.

2. The Eckhart-Young Theorem

The solution to this problem comes from the Eckart-Young theorem. Unfortunately, the original paper is behind a paywall, but we will provide the ideas here. Recall the rank of a matrix is the dimension of its column space.

Let us also recall the Singular Value Decomposition (SVD) of a matrix {X} of size {m \times d} is given by

\displaystyle X = U \Sigma V^T,

where {U} is an {m \times m} orthogonal matrix, {\Sigma} is a {m \times d} diagonal matrix, and {V} is a {d \times d} orthogonal matrix. The diagonal entries of {\Sigma} are called the singular values of {X}. It turns out that the columns of {U} are the eigenvectors of {X X^T} and the columns of {V} are the eigenvectors of {X^T X}.

{\smallskip}

Theorem: (Eckhart-Young) Let {X} be any matrix of size {m \times d}. Let {k \leq d}. Then

\displaystyle \min_{X_k} ||X - X_k||_{F}^2 \ \ {\rm s.t.} \ \ {\rm rank}(X_k) \leq k,

is given by

\displaystyle X_k = U \Sigma_{k}V^T,

where

\displaystyle X = U \Sigma V^T,

is the Singular Value Decomposition of {X} and {\Sigma_k} is {m\times d} matrix which has is zero except the first {k} columns are the same as {\Sigma}. {\spadesuit} {\smallskip}

There is a lot to unpack here:

  • How does this solve our original problem?
  • Why is this theorem true?

Let us address these in order. First note that {{\rm rank}(ZW) \leq {\rm rank}(Z) \leq k}, as {Z} has {k} columns. Thus, we were seeking a rank {k} approximation to {X} above. Here the rows of {U\Sigma_k} are precisely {z_1 , \ldots , z_m} from above. We also have {W} taking a particularly simple form, of a mostly zero matrix. This has the nice generalization of our original one dimensional example, where

\displaystyle x_i = \sum_{i=1}^k z_i \sigma_i,

where {\sigma_i} is the {i}th singular value of {X}.

{\smallskip}

Proof of Eckhart Young By SVD, it is enough to find {X_k} that minimizes

\displaystyle ||U \Sigma V^T - X_k||_{F}^2.

Since the Frobenius norm is invariant under orthogonal transformations, we may equivalently minimize

\displaystyle ||\Sigma - U^T X_k V||_{F}^2.

But {U^T X_k V} is a rank {k} matrix, as it is the product of two rank {k} matrices. Since {\Sigma} is a diagonal matrix, it is straightforward (or see here) to see that the best rank {k} approximation to {\Sigma} is to keep the first {k} columns of {\Sigma} and set the rest to zero, via {\Sigma_k}. Thus,

\displaystyle \Sigma_k = U^T X_k V,

and the result follows. {\spadesuit}

3. A Nice After Effect

Thanks to the singular value decomposition, we are equipped with an algorithm for reducing the dimension on non-training data. To see this, we may compute the latent variables, via

\displaystyle X_k = X V_k,

where {V_k} is the matrix of size {d \times k} whose columns are the first {k} columns of {V}. Thus, given a new data point {x}, we may compute the latent representation via

\displaystyle z = x^T V_k.

This is a nice feature of PCA, which is not afforded by other dimensionality reduction techniques, such as t-SNE. This allows the transform method to be available in sklearn’s PCA implementation, which affords the opportunity for PCA to be used in the feature engineering portion of a machine learning pipeline.

4. An Alternate Perspective

The more classical way to approach PCA is to search for a vector {v \in \mathbb{R}^d} satisfying

\displaystyle v = \arg \max_{u \in \mathbb{R}^d} || Xu||.

Thus, {v} is the vector that maximizes the variance of the data projected onto the line generated by {v}. However, the MLE approach above is more natural from a machine learning perspective.

That being said, I answered a question on math stackexchange awhile back that provides a derivation of PCA from this perspective. I was curious if the Lagrange Multiplier, outlined in that post, approach could be used to derive the Eckhart-Young theorem.

To see how to approach this, suppose we want to prove Eckhart-Young for {k=1}, i.e. we would like to find a rank 1 matrix {X_1}, that minimizes

\displaystyle ||X - X_1||_{F}^2.

As {X_1} is rank 1, it can be written as {uv^T}, where {u} and {v} are vectors. Thus, we are looking for {u} and {v} that minimize

\displaystyle ||X - uv^T||_{F}^2.

This is an optimization problem in {u} and {v}. Let us take a gradient with respect to {u} and {v} and set it equal to zero. By the definition of the Frobenius norm, and the multivariate chain rule, we have

\displaystyle \nabla_u ||X - uv^T||_{F}^2 = -2 (X - uv^T)v = 0,

and

\displaystyle \nabla_v ||X - uv^T||_{F}^2 = -2 (X - uv^T)^T u = 0.

Thus,

\displaystyle Xv = uv^Tv, \ \ \ X^T u = v u^T u.

These are precisely the equations satisfied by the SVD of {X}.

I’d like to see a cleaned up version of this argument, in the general case.

Singular Value Decomposition and PCA

Principal Component Analysis (PCA) is a popular technique in machine learning for dimension reduction. It can be derived from Singular Value Decomposition (SVD) which we will discuss in this post. We will cover the math, an example in python, and finally some intuition.

The Math

SVD asserts that any m \times d matrix A can be written as

A = U\Sigma V^T,

where

  • \Sigma is m \times d and is rectangular diagonal, that is \Sigma_{ij} = 0 for i \neq j.
  • U and V are orthogonal, that is U^T U = I and V^T V = I.

This allows us to derive PCA. First, we are given a data matrix, X, that is a matrix where each of the m rows are data points with d features.

First, we translate each column of X so that each column has mean zero. That is we translate each column by \mu_1 , \ldots , \mu_d, respectively. For instance if

X = \left[\begin{matrix}1 & 2\\3 & 4\\5 & 6\end{matrix}\right] ,

Then we would translate the first column by -3 and the second column by -4. Then we divide each column by their standard deviations, \sigma_1 , \ldots , \sigma_d. Thus in our example we divide the first column and second column by \sqrt{8}. Let us call this new scaled matrix A to align with our notation above. Then we can perform PCA by performing SVD

A = U\Sigma V^T

and computing the $m \times d$ matrix

AV.

For PCA with k components, the data point in the j^{\rm th} row of X is transformed to the first k entries of the j^{\rm th} row of AV.

Put another way, given a data point x, we compute PCA in two steps:

z = (x_1-\mu_1)/\sigma_1 , \cdots , (x_d-\mu_d)/\sigma_d,

and then

(z \cdot v_1 , \cdots , z \cdot v_k)

where v_1 , \ldots , v_k are the first k columns of V.

Python Example

We will show how to code up PCA using only numpy and torchvision (only to retrieve some classical datasets). First we start with some imports.

import torchvision.datasets as datasets
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

Then we define a PCA function that does the work we did in the previous section. The data is inputted as a numpy array. We utilize the np.linalg.svd function for SVD and specify full_matrices=False so that the dimensions of U and V^T align with our computations above.

def pca(data,k=2):
    data = data.reshape((data.shape[0], -1))
    mu = np.mean(data, axis=0)
    std = np.std(data, axis=0).clip(min=1e-10).reshape(1,-1)
    A = (data - mu) / std
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    V = np.transpose(Vt)
    tmp = np.dot(A, V)
    return tmp[:,:k],V

We can now load in the MNIST, Fashion MNIST, and CIFAR datasets from torchvision datasets.

mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None)
np_mnist = mnist_trainset.data.numpy()/255.0

fmnist_trainset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=None)
fmnist_numpy = fmnist_trainset.data.numpy()/255.0

cifar= datasets.CIFAR100(root='./data', train=True, download=True, transform=None)
cifar_np = cifar.data /255.0

We can now perform PCA on these datasets.

pca_mnist,V_mnist = pca(np_mnist)
pca_fmnist , V_fmnist = pca(np_fmnist)
pca_cifar , V_cifar = pca(cifar_np)

We can see how they divide the datasets into their classes. With the following code.

fig, axs = plt.subplots(3, 4, figsize=(15, 6))

for i in range(3):
    for j in range(4):
        N = i*4+j 
        axs[i, j].imshow(.5*V_mnist[:,N].reshape((28,28))+.5, cmap="gray")
        N +=1 
        if N == 1:
            axs[i, j].set_title(f"{N}st Principal Component")
        elif N == 2:
            axs[i, j].set_title(f"{N}nd Principal Component")
        elif N == 3:
            axs[i, j].set_title(f"{N}rd Principal Component")
        else:
            axs[i, j].set_title(f"{N}th Principal Component")
        axs[i, j].axis('off')

We can also see what pixels are activated for each principal component. For instance we have the following for the MNIST data set

and the Fashion MNIST

You are encouraged to think about how I came about these images!

Intuition

There are two steps to PCA:

  • Scaling the columns of the data matrix
  • Computing the SVD and transforming the data

Let’s see these steps with a text example to see what’s going on. We generate a bunch of points in 2 dimensional space that lie close to the line y =2x, where x is between 0 and 1.

import numpy as np 

X = np.zeros((1_000,2))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
fig = sns.scatterplot(x = X[:,0],y = X[:,1])

Now we perform the scaling step.

Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
fig = sns.scatterplot(x=Z[:, 0], y=Z[:, 1])

You can see first of all this shifted the data so that now the center point is (0,0), instead of the original (.5,1). Also, through scaling, the data now lies near the line y = x, rather than y = 2x.

The scaling step (in particular division by the standard deviation) makes it so we do not favor a feature simply because the coordinates are on average larger.

We can now compute V.

Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)

This gives

V=  \begin{bmatrix} -0.70710678 & -0.70710678 \\ -0.70710678 & 0.70710678 \end{bmatrix}

Notice the first column of V, v_1 is a scalar multiple times

(1,1).

and the second column, v_2 is a scalar multiple of

(-1,1).

Here is what is going on. In our data set, after scaling, data points lie on the line y=x, up to a small error. PCA is a way of automating the process of finding this y=x line. Indeed, this is given by the vector v_1. When we compute v_1 \cdot x, for a data point x, we are getting information on the whereabouts of x lies on the line y = x. Indeed, for a point u = (a , a) we have

u \cdot (1,1)  = 2 a,

so this dot product is able to precisely compute where x is on the line y = x.

In higher dimensions, the same general principle holds. PCA allows us to find the line (or plane, or higher dimensional space) that best describes our data. Indeed let’s look at the previous example in 3-dimensions. We just add some small random noise to the additional coordinate

X = np.zeros((1_000,3))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
X[:,2] = .1*np.random.uniform(0,1,1_000)
Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)
print(V)

Here we find that the first column of V is, roughly, a scalar multiple times

(1,1,0).

Thus PCA still identifies the important relation between the first and second coordinate (after scaling) and disregards the other noise.

Limitations of PCA

One major limitation of PCA is that it’s only capable of capturing linear relations between the features. This could be handled by combining feature engineering with PCA, though this presents its own set of challenges. One way to overcome this is to invoke non-linear methods of dimension reduction, i.e. t-SNE or UMAP. In the following figure, I ran t-SNE on the MNIST data set (first using PCA to reduce the number of features to 25). We can see it performs much better than PCA at identifying digits.

Furthermore, PCA can lead to unpredictable outcomes that depend on the distribution of the data collected, as studied in this paper of Elhaik which I learned about from the Batch. This highlights that one should be careful making inferences from PCA alone.