Example: Suppose we are presented with samples
, where each is a 100 dimensional vector. After inspection, we realize that each data point is of the form
, where
is a real number and
is a 100-dimensional vector. Thus, to describe each
, we simply need to specify
, rather than the 100 coordinates of
. This effectively reduces the dimensionality of our data from 100 to 1.
The goal of PCA is to provide a methodical way of accomplishing this in general. There are a few issues to overcome.
- The data may not be of the form
, but rather
, or even higher dimensional.
- The data might only be approximately of the form
, i.e. of the form
, where
is a small error term.
- We may not know, a priori, the form of the data. How do we determine the best
-dimensional representation of the data?
Let us address these issues below.
1. PCA in Mathematical Terms
Suppose we are presented with a matrix of size
, where each row is a sample and each column is a feature. We denote the rows of
are given by
. We suppose further, without loss of generality after scaling, that each column of
has zero mean and unit variance.
Our goal is to find the best approximation of the form
We suppose, more precisely, that
From this assumption, how do we find and the
? The answer comes from the principal of Maximum Likelihood Estimation (MLE). Utilizing the formula for the probability density function of a multivariate normal distribution, for a fixed
, and
, the probability of observing
is given by
The norm in the above equation is the classic euclidean norm. The principal of MLE states that we should choose and the
to maximize this probability. It is typically easier to maximize a sum rather than a product. For instance, in Calculus, it is much easier to take the derivative of a sum than a product, as the product rule of derivatives is more complex than the sum rule. A classic mathematical trick that converts a product into a sum is to take logs. As the logarithm is a monotonically increasing function, maximizing the log of a function is equivalent to maximizing the function itself. Thus, we consider the log likelihood
As our goal is to maximize the above, we can equivalently minimize the sum
after removing the constant term that does not depend on or the
and scaling the remaining summand. We can rewrite the above as
where is the matrix of size
whose rows are
,
is the matrix of size
whose rows are
, and
is the matrix of size
.
2. The Eckhart-Young Theorem
The solution to this problem comes from the Eckart-Young theorem. Unfortunately, the original paper is behind a paywall, but we will provide the ideas here. Recall the rank of a matrix is the dimension of its column space.
Let us also recall the Singular Value Decomposition (SVD) of a matrix of size
is given by
Theorem: (Eckhart-Young) Let be any matrix of size
. Let
. Then
There is a lot to unpack here:
- How does this solve our original problem?
- Why is this theorem true?
Let us address these in order. First note that , as
has
columns. Thus, we were seeking a rank
approximation to
above. Here the rows of
are precisely
from above. We also have
taking a particularly simple form, of a mostly zero matrix. This has the nice generalization of our original one dimensional example, where
Proof of Eckhart Young By SVD, it is enough to find that minimizes
3. A Nice After Effect
Thanks to the singular value decomposition, we are equipped with an algorithm for reducing the dimension on non-training data. To see this, we may compute the latent variables, via
4. An Alternate Perspective
The more classical way to approach PCA is to search for a vector satisfying
That being said, I answered a question on math stackexchange awhile back that provides a derivation of PCA from this perspective. I was curious if the Lagrange Multiplier, outlined in that post, approach could be used to derive the Eckhart-Young theorem.
To see how to approach this, suppose we want to prove Eckhart-Young for , i.e. we would like to find a rank 1 matrix
, that minimizes
I’d like to see a cleaned up version of this argument, in the general case.







