Principal Component Analysis (PCA) is a popular technique in machine learning for dimension reduction. It can be derived from Singular Value Decomposition (SVD) which we will discuss in this post. We will cover the math, an example in python, and finally some intuition.

## The Math

SVD asserts that *any* matrix can be written as

where

- is and is rectangular diagonal, that is for .
- and are orthogonal, that is and .

This allows us to derive PCA. First, we are given a data matrix, , that is a matrix where each of the rows are data points with features.

First, we translate each column of so that each column has mean zero. That is we translate each column by , respectively. For instance if

Then we would translate the first column by -3 and the second column by -4. Then we divide each column by their standard deviations, . Thus in our example we divide the first column and second column by . Let us call this new scaled matrix to align with our notation above. Then we can perform PCA by performing SVD

and computing the $m \times d$ matrix

For PCA with components, the data point in the row of is transformed to the first entries of the row of .

Put another way, given a data point , we compute PCA in two steps:

and then

where are the first columns of .

## Python Example

We will show how to code up PCA using only numpy and torchvision (only to retrieve some classical datasets). First we start with some imports.

import torchvision.datasets as datasets import numpy as np import matplotlib.pyplot as plt import seaborn as sns

Then we define a PCA function that does the work we did in the previous section. The data is inputted as a numpy array. We utilize the np.linalg.svd function for SVD and specify full_matrices=False so that the dimensions of and align with our computations above.

def pca(data,k=2): data = data.reshape((data.shape[0], -1)) mu = np.mean(data, axis=0) std = np.std(data, axis=0).clip(min=1e-10).reshape(1,-1) A = (data - mu) / std U, S, Vt = np.linalg.svd(A, full_matrices=False) V = np.transpose(Vt) tmp = np.dot(A, V) return tmp[:,:k],V

We can now load in the MNIST, Fashion MNIST, and CIFAR datasets from torchvision datasets.

mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None) np_mnist = mnist_trainset.data.numpy()/255.0 fmnist_trainset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=None) fmnist_numpy = fmnist_trainset.data.numpy()/255.0 cifar= datasets.CIFAR100(root='./data', train=True, download=True, transform=None) cifar_np = cifar.data /255.0

We can now perform PCA on these datasets.

pca_mnist,V_mnist = pca(np_mnist) pca_fmnist , V_fmnist = pca(np_fmnist) pca_cifar , V_cifar = pca(cifar_np)

We can see how they divide the datasets into their classes. With the following code.

fig, axs = plt.subplots(3, 4, figsize=(15, 6)) for i in range(3): for j in range(4): N = i*4+j axs[i, j].imshow(.5*V_mnist[:,N].reshape((28,28))+.5, cmap="gray") N +=1 if N == 1: axs[i, j].set_title(f"{N}st Principal Component") elif N == 2: axs[i, j].set_title(f"{N}nd Principal Component") elif N == 3: axs[i, j].set_title(f"{N}rd Principal Component") else: axs[i, j].set_title(f"{N}th Principal Component") axs[i, j].axis('off')

We can also see what pixels are activated for each principal component. For instance we have the following for the MNIST data set

and the Fashion MNIST

You are encouraged to think about how I came about these images!

## Intuition

There are two steps to PCA:

- Scaling the columns of the data matrix
- Computing the SVD and transforming the data

Let’s see these steps with a text example to see what’s going on. We generate a bunch of points in 2 dimensional space that lie close to the line , where is between 0 and 1.

import numpy as np X = np.zeros((1_000,2)) X[:,0] = np.random.uniform(0,1,1_000) X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000) fig = sns.scatterplot(x = X[:,0],y = X[:,1])

Now we perform the scaling step.

Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0) fig = sns.scatterplot(x=Z[:, 0], y=Z[:, 1])

You can see first of all this shifted the data so that now the center point is , instead of the original . Also, through scaling, the data now lies near the line , rather than .

The scaling step (in particular division by the standard deviation) makes it so we do not favor a feature simply because the coordinates are on average larger.

We can now compute .

Vt = np.linalg.svd(Z, full_matrices=False)[-1] V = np.transpose(Vt)

This gives

Notice the first column of , is a scalar multiple times

and the second column, is a scalar multiple of

.

Here is what is going on. In our data set, after scaling, data points lie on the line , up to a small error. PCA is a way of automating the process of finding this line. Indeed, this is given by the vector . When we compute , for a data point , we are getting information on the whereabouts of lies on the line . Indeed, for a point we have

so this dot product is able to precisely compute where is on the line .

In higher dimensions, the same general principle holds. PCA allows us to find the line (or plane, or higher dimensional space) that best describes our data. Indeed let’s look at the previous example in 3-dimensions. We just add some small random noise to the additional coordinate

X = np.zeros((1_000,3)) X[:,0] = np.random.uniform(0,1,1_000) X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000) X[:,2] = .1*np.random.uniform(0,1,1_000) Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0) Vt = np.linalg.svd(Z, full_matrices=False)[-1] V = np.transpose(Vt) print(V)

Here we find that the first column of is, roughly, a scalar multiple times

Thus PCA still identifies the important relation between the first and second coordinate (after scaling) and disregards the other noise.

## Limitations of PCA

One major limitation of PCA is that it’s only capable of capturing linear relations between the features. This could be handled by combining feature engineering with PCA, though this presents its own set of challenges. One way to overcome this is to invoke non-linear methods of dimension reduction, i.e. t-SNE or UMAP. In the following figure, I ran t-SNE on the MNIST data set (first using PCA to reduce the number of features to 25). We can see it performs much better than PCA at identifying digits.

Furthermore, PCA can lead to unpredictable outcomes that depend on the distribution of the data collected, as studied in this paper of Elhaik which I learned about from the Batch. This highlights that one should be careful making inferences from PCA alone.