Principal Component Analysis (PCA) is a popular technique in machine learning for dimension reduction. It can be derived from Singular Value Decomposition (SVD) which we will discuss in this post. We will cover the math, an example in python, and finally some intuition.
The Math
SVD asserts that any matrix
can be written as
where
is
and is rectangular diagonal, that is
for
.
and
are orthogonal, that is
and
.
This allows us to derive PCA. First, we are given a data matrix, , that is a matrix where each of the
rows are data points with
features.
First, we translate each column of so that each column has mean zero. That is we translate each column by
, respectively. For instance if
Then we would translate the first column by -3 and the second column by -4. Then we divide each column by their standard deviations, . Thus in our example we divide the first column and second column by
. Let us call this new scaled matrix
to align with our notation above. Then we can perform PCA by performing SVD
and computing the $m \times d$ matrix
For PCA with components, the data point in the
row of
is transformed to the first
entries of the
row of
.
Put another way, given a data point , we compute PCA in two steps:
and then
where are the first
columns of
.
Python Example
We will show how to code up PCA using only numpy and torchvision (only to retrieve some classical datasets). First we start with some imports.
import torchvision.datasets as datasets import numpy as np import matplotlib.pyplot as plt import seaborn as sns
Then we define a PCA function that does the work we did in the previous section. The data is inputted as a numpy array. We utilize the np.linalg.svd function for SVD and specify full_matrices=False so that the dimensions of and
align with our computations above.
def pca(data,k=2): data = data.reshape((data.shape[0], -1)) mu = np.mean(data, axis=0) std = np.std(data, axis=0).clip(min=1e-10).reshape(1,-1) A = (data - mu) / std U, S, Vt = np.linalg.svd(A, full_matrices=False) V = np.transpose(Vt) tmp = np.dot(A, V) return tmp[:,:k],V
We can now load in the MNIST, Fashion MNIST, and CIFAR datasets from torchvision datasets.
mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None) np_mnist = mnist_trainset.data.numpy()/255.0 fmnist_trainset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=None) fmnist_numpy = fmnist_trainset.data.numpy()/255.0 cifar= datasets.CIFAR100(root='./data', train=True, download=True, transform=None) cifar_np = cifar.data /255.0
We can now perform PCA on these datasets.
pca_mnist,V_mnist = pca(np_mnist) pca_fmnist , V_fmnist = pca(np_fmnist) pca_cifar , V_cifar = pca(cifar_np)
We can see how they divide the datasets into their classes. With the following code.
fig, axs = plt.subplots(3, 4, figsize=(15, 6)) for i in range(3): for j in range(4): N = i*4+j axs[i, j].imshow(.5*V_mnist[:,N].reshape((28,28))+.5, cmap="gray") N +=1 if N == 1: axs[i, j].set_title(f"{N}st Principal Component") elif N == 2: axs[i, j].set_title(f"{N}nd Principal Component") elif N == 3: axs[i, j].set_title(f"{N}rd Principal Component") else: axs[i, j].set_title(f"{N}th Principal Component") axs[i, j].axis('off')



We can also see what pixels are activated for each principal component. For instance we have the following for the MNIST data set

and the Fashion MNIST

You are encouraged to think about how I came about these images!
Intuition
There are two steps to PCA:
- Scaling the columns of the data matrix
- Computing the SVD and transforming the data
Let’s see these steps with a text example to see what’s going on. We generate a bunch of points in 2 dimensional space that lie close to the line , where
is between 0 and 1.
import numpy as np X = np.zeros((1_000,2)) X[:,0] = np.random.uniform(0,1,1_000) X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000) fig = sns.scatterplot(x = X[:,0],y = X[:,1])

Now we perform the scaling step.
Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0) fig = sns.scatterplot(x=Z[:, 0], y=Z[:, 1])

You can see first of all this shifted the data so that now the center point is , instead of the original
. Also, through scaling, the data now lies near the line
, rather than
.
The scaling step (in particular division by the standard deviation) makes it so we do not favor a feature simply because the coordinates are on average larger.
We can now compute .
Vt = np.linalg.svd(Z, full_matrices=False)[-1] V = np.transpose(Vt)
This gives
Notice the first column of ,
is a scalar multiple times
and the second column, is a scalar multiple of
.
Here is what is going on. In our data set, after scaling, data points lie on the line , up to a small error. PCA is a way of automating the process of finding this
line. Indeed, this is given by the vector
. When we compute
, for a data point
, we are getting information on the whereabouts of
lies on the line
. Indeed, for a point
we have
so this dot product is able to precisely compute where is on the line
.
In higher dimensions, the same general principle holds. PCA allows us to find the line (or plane, or higher dimensional space) that best describes our data. Indeed let’s look at the previous example in 3-dimensions. We just add some small random noise to the additional coordinate
X = np.zeros((1_000,3)) X[:,0] = np.random.uniform(0,1,1_000) X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000) X[:,2] = .1*np.random.uniform(0,1,1_000) Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0) Vt = np.linalg.svd(Z, full_matrices=False)[-1] V = np.transpose(Vt) print(V)
Here we find that the first column of is, roughly, a scalar multiple times
Thus PCA still identifies the important relation between the first and second coordinate (after scaling) and disregards the other noise.
Limitations of PCA
One major limitation of PCA is that it’s only capable of capturing linear relations between the features. This could be handled by combining feature engineering with PCA, though this presents its own set of challenges. One way to overcome this is to invoke non-linear methods of dimension reduction, i.e. t-SNE or UMAP. In the following figure, I ran t-SNE on the MNIST data set (first using PCA to reduce the number of features to 25). We can see it performs much better than PCA at identifying digits.

Furthermore, PCA can lead to unpredictable outcomes that depend on the distribution of the data collected, as studied in this paper of Elhaik which I learned about from the Batch. This highlights that one should be careful making inferences from PCA alone.