Tag Archives: PCA

Singular Value Decomposition and PCA

Principal Component Analysis (PCA) is a popular technique in machine learning for dimension reduction. It can be derived from Singular Value Decomposition (SVD) which we will discuss in this post. We will cover the math, an example in python, and finally some intuition.

The Math

SVD asserts that any m \times d matrix A can be written as

A = U\Sigma V^T,


  • \Sigma is m \times d and is rectangular diagonal, that is \Sigma_{ij} = 0 for i \neq j.
  • U and V are orthogonal, that is U^T U = I and V^T V = I.

This allows us to derive PCA. First, we are given a data matrix, X, that is a matrix where each of the m rows are data points with d features.

First, we translate each column of X so that each column has mean zero. That is we translate each column by \mu_1 , \ldots , \mu_d, respectively. For instance if

X = \left[\begin{matrix}1 & 2\\3 & 4\\5 & 6\end{matrix}\right] ,

Then we would translate the first column by -3 and the second column by -4. Then we divide each column by their standard deviations, \sigma_1 , \ldots , \sigma_d. Thus in our example we divide the first column and second column by \sqrt{8}. Let us call this new scaled matrix A to align with our notation above. Then we can perform PCA by performing SVD

A = U\Sigma V^T

and computing the $m \times d$ matrix


For PCA with k components, the data point in the j^{\rm th} row of X is transformed to the first k entries of the j^{\rm th} row of AV.

Put another way, given a data point x, we compute PCA in two steps:

z = (x_1-\mu_1)/\sigma_1 , \cdots , (x_d-\mu_d)/\sigma_d,

and then

(z \cdot v_1 , \cdots , z \cdot v_k)

where v_1 , \ldots , v_k are the first k columns of V.

Python Example

We will show how to code up PCA using only numpy and torchvision (only to retrieve some classical datasets). First we start with some imports.

import torchvision.datasets as datasets
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

Then we define a PCA function that does the work we did in the previous section. The data is inputted as a numpy array. We utilize the np.linalg.svd function for SVD and specify full_matrices=False so that the dimensions of U and V^T align with our computations above.

def pca(data,k=2):
    data = data.reshape((data.shape[0], -1))
    mu = np.mean(data, axis=0)
    std = np.std(data, axis=0).clip(min=1e-10).reshape(1,-1)
    A = (data - mu) / std
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    V = np.transpose(Vt)
    tmp = np.dot(A, V)
    return tmp[:,:k],V

We can now load in the MNIST, Fashion MNIST, and CIFAR datasets from torchvision datasets.

mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None)
np_mnist = mnist_trainset.data.numpy()/255.0

fmnist_trainset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=None)
fmnist_numpy = fmnist_trainset.data.numpy()/255.0

cifar= datasets.CIFAR100(root='./data', train=True, download=True, transform=None)
cifar_np = cifar.data /255.0

We can now perform PCA on these datasets.

pca_mnist,V_mnist = pca(np_mnist)
pca_fmnist , V_fmnist = pca(np_fmnist)
pca_cifar , V_cifar = pca(cifar_np)

We can see how they divide the datasets into their classes. With the following code.

fig, axs = plt.subplots(3, 4, figsize=(15, 6))

for i in range(3):
    for j in range(4):
        N = i*4+j 
        axs[i, j].imshow(.5*V_mnist[:,N].reshape((28,28))+.5, cmap="gray")
        N +=1 
        if N == 1:
            axs[i, j].set_title(f"{N}st Principal Component")
        elif N == 2:
            axs[i, j].set_title(f"{N}nd Principal Component")
        elif N == 3:
            axs[i, j].set_title(f"{N}rd Principal Component")
            axs[i, j].set_title(f"{N}th Principal Component")
        axs[i, j].axis('off')

We can also see what pixels are activated for each principal component. For instance we have the following for the MNIST data set

and the Fashion MNIST

You are encouraged to think about how I came about these images!


There are two steps to PCA:

  • Scaling the columns of the data matrix
  • Computing the SVD and transforming the data

Let’s see these steps with a text example to see what’s going on. We generate a bunch of points in 2 dimensional space that lie close to the line y =2x, where x is between 0 and 1.

import numpy as np 

X = np.zeros((1_000,2))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
fig = sns.scatterplot(x = X[:,0],y = X[:,1])

Now we perform the scaling step.

Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
fig = sns.scatterplot(x=Z[:, 0], y=Z[:, 1])

You can see first of all this shifted the data so that now the center point is (0,0), instead of the original (.5,1). Also, through scaling, the data now lies near the line y = x, rather than y = 2x.

The scaling step (in particular division by the standard deviation) makes it so we do not favor a feature simply because the coordinates are on average larger.

We can now compute V.

Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)

This gives

V=  \begin{bmatrix} -0.70710678 & -0.70710678 \\ -0.70710678 & 0.70710678 \end{bmatrix}

Notice the first column of V, v_1 is a scalar multiple times


and the second column, v_2 is a scalar multiple of


Here is what is going on. In our data set, after scaling, data points lie on the line y=x, up to a small error. PCA is a way of automating the process of finding this y=x line. Indeed, this is given by the vector v_1. When we compute v_1 \cdot x, for a data point x, we are getting information on the whereabouts of x lies on the line y = x. Indeed, for a point u = (a , a) we have

u \cdot (1,1)  = 2 a,

so this dot product is able to precisely compute where x is on the line y = x.

In higher dimensions, the same general principle holds. PCA allows us to find the line (or plane, or higher dimensional space) that best describes our data. Indeed let’s look at the previous example in 3-dimensions. We just add some small random noise to the additional coordinate

X = np.zeros((1_000,3))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
X[:,2] = .1*np.random.uniform(0,1,1_000)
Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)

Here we find that the first column of V is, roughly, a scalar multiple times


Thus PCA still identifies the important relation between the first and second coordinate (after scaling) and disregards the other noise.

Limitations of PCA

One major limitation of PCA is that it’s only capable of capturing linear relations between the features. This could be handled by combining feature engineering with PCA, though this presents its own set of challenges. One way to overcome this is to invoke non-linear methods of dimension reduction, i.e. t-SNE or UMAP. In the following figure, I ran t-SNE on the MNIST data set (first using PCA to reduce the number of features to 25). We can see it performs much better than PCA at identifying digits.

Furthermore, PCA can lead to unpredictable outcomes that depend on the distribution of the data collected, as studied in this paper of Elhaik which I learned about from the Batch. This highlights that one should be careful making inferences from PCA alone.