Category Archives: Machine Learning

AUC and Cross-entropy

In this post we will see that Log loss and AUC for binary classification problems are closely related.

Suppose we have a set of probabilities that are meant to predict whether a true label is 0 or 1. That is, let {\widehat{y_k}, y_k \in [0,1]} be the predicted and true labels for the {k}-th example, {k = 1, \ldots , m}.

Recall the AUC is the area under the ROC curve. We will use another interpretation, which is derived on the above wiki: AUC is the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

Note the larger that AUC is, the better the predictions are. For comparison of AUC to log loss, we will convert AUC to a loss function. We do so by studying {1-{\rm AUC}}.

Let {P} be the number of positive examples and {N} be the number of negative samples. Thus,

\displaystyle 1 - AUC = \frac{1}{PN}\sum_{p : y_p = 1} \sum_{n : y_n = 0} f_{AUC}(y_p,y_n),

where {f_{AUC}(x,y) = 1} if {y>x} and {0} otherwise.

We introduce another quantity:

\displaystyle I = \frac{1}{PN} \sum_{p : y_p = 1} \sum_{n : y_n = 0} \log(\widehat{y_p}) \cdot \log(1-\widehat{y_n}).

This quantity has a similar shape to the first quantity, except that {f_{AUC}(x,y)} is replaced by {\log(x) \cdot \log(1-y)}. Note that these two functions are quite similar. The latter is a smooth variant of the first, that penalizes much harsher for misclassifications that are poor. Put another way, the latter quantity penalizes harshly for confidently wrong predictions.

The above equation factors and is equal to

\displaystyle I = \left( \frac{1}{P} \sum_{p : y_p = 1} \log(\widehat{y_p}) \right) \left( \frac{1}{N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}) \right).

Let us compare this to the log loss:

\displaystyle {\rm LogLoss} = \frac{1}{P +N} \sum_{p : y_p = 1} \log(\widehat{y_p}) + \frac{1}{P + N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}).

The intermediate expression and LogLoss are connected by the AM-GM inequality. For simplicity, let us suppose {P = N}. Then, the AM-GM inequality implies

\displaystyle  \sqrt{I} \leq {\rm LogLoss}.

As {I} is similar to {1 - AUC}, we have that {1-{\rm AUC}} is roughly a lower bound on the log loss. Let us make some assumptions

  • Balanced data: {P \approx N}.
  • The predictions are not very wrong too often, i.e. {\widehat{y_p}} is not too often close to zero.
  • Predictions have similar errors for positive and negative, that is the log loss over positive and negative examples is similar.

Then following through the above argument, we find that

\displaystyle 1 - AUC \approx \sqrt{{\rm LogLoss}}.

The main new addition to what is above is to understand that equality in the AM-GM inequality is achieved when the two terms are equal.

Thus, we understand a couple things:

  • We expect that {1 - AUC \approx \sqrt{{\rm LogLoss}}}.
  • If this is not the case, then we can infer that the assumptions above are not met in our particular use-case.

PCA like a Mathematician

Principal Component Analysis, or PCA, is a fundamental dimensionality reduction technique using in Machine Learning. The general goal is to transform a {d}-dimensional problem into a {k}-dimensional one, where {k} is smaller than {d}. This can be used, for instance, to plot large dimensional data on a 2D or 3D plot, or to reduce the number of features in an effort to eliminate random noise in a machine learning problem (made precise below). In this post, we will discuss PCA in mathematical terms. This is adopted from a first course in Linear Algebra, the principle of maximum likelihood estimation, and the Eckhart-Young theorem from 1936.

{\smallskip}

Example: Suppose we are presented with {m} samples {x_1 , \ldots x_{m}}, where each is a 100 dimensional vector. After inspection, we realize that each data point is of the form {c_i z}, where {c_i} is a real number and {z} is a 100-dimensional vector. Thus, to describe each {x_i}, we simply need to specify {c_i}, rather than the 100 coordinates of {x_i}. This effectively reduces the dimensionality of our data from 100 to 1. {\spadesuit} {\smallskip}

The goal of PCA is to provide a methodical way of accomplishing this in general. There are a few issues to overcome.

  • The data may not be of the form {c_i z}, but rather {c_i z_1 + d_i z_2}, or even higher dimensional.
  • The data might only be approximately of the form {c_i z}, i.e. of the form {c_i z + e_i }, where {e_i} is a small error term.
  • We may not know, a priori, the form of the data. How do we determine the best {k}-dimensional representation of the data?

Let us address these issues below.

1. PCA in Mathematical Terms

Suppose we are presented with a matrix {X} of size {m \times d}, where each row is a sample and each column is a feature. We denote the rows of {X} are given by {x_1 , \ldots , x_m}. We suppose further, without loss of generality after scaling, that each column of {X} has zero mean and unit variance.

Our goal is to find the best approximation of the form

\displaystyle x_i \approx z_i^T W ,

for all {i = 1 , \ldots , m}, where {W} is a matrix of size {k \times d} and {z_i} is a vector of size {k}. Thus, each {x_i} is generated from lower dimensional data, via the {z_i}.

We suppose, more precisely, that

\displaystyle x_i =z_i^T W + \epsilon_i,

where {\epsilon_i \sim N(0, I)} is a error term. That is each coordinate of {\epsilon_i} is normally distributed with mean zero and variance 1. From here, we plan to derive PCA. One can wonder if the assumption on {\epsilon_i} is reasonable, which can be addressed on an individual basis. The assumption that {x_i} is generated by lower dimensional data is sometimes referred to as a latent representation, and the interested reader may consult this blog post for applications of this idea into generative models, such as image generation.

From this assumption, how do we find {W} and the {z_i}? The answer comes from the principal of Maximum Likelihood Estimation (MLE). Utilizing the formula for the probability density function of a multivariate normal distribution, for a fixed {W}, and {z_1 , \ldots , z_m}, the probability of observing {x_1 , \ldots , x_m} is given by

\displaystyle P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = \prod_{i=1}^m \frac{1}{(2 \pi )^{d/2}} \exp \left( - \frac{1}{2 } \| x_i - z_i^TW \|^2 \right).

The norm in the above equation is the classic euclidean norm. The principal of MLE states that we should choose {W} and the {z_i} to maximize this probability. It is typically easier to maximize a sum rather than a product. For instance, in Calculus, it is much easier to take the derivative of a sum than a product, as the product rule of derivatives is more complex than the sum rule. A classic mathematical trick that converts a product into a sum is to take logs. As the logarithm is a monotonically increasing function, maximizing the log of a function is equivalent to maximizing the function itself. Thus, we consider the log likelihood

\displaystyle \log P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = - \frac{m}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^m \| x_i - z_i^T W \|^2.

As our goal is to maximize the above, we can equivalently minimize the sum

\displaystyle \sum_{i=1}^m \| x_i - z_i^T W \|^2,

after removing the constant term that does not depend on {W} or the {z_i} and scaling the remaining summand. We can rewrite the above as

\displaystyle  || X - ZW||^2,

where {X} is the matrix of size {m \times d} whose rows are {x_1 , \ldots , x_m}, {Z} is the matrix of size {m \times k} whose rows are {z_1 , \ldots , z_m}, and {W} is the matrix of size {d \times k}.

2. The Eckhart-Young Theorem

The solution to this problem comes from the Eckart-Young theorem. Unfortunately, the original paper is behind a paywall, but we will provide the ideas here. Recall the rank of a matrix is the dimension of its column space.

Let us also recall the Singular Value Decomposition (SVD) of a matrix {X} of size {m \times d} is given by

\displaystyle X = U \Sigma V^T,

where {U} is an {m \times m} orthogonal matrix, {\Sigma} is a {m \times d} diagonal matrix, and {V} is a {d \times d} orthogonal matrix. The diagonal entries of {\Sigma} are called the singular values of {X}. It turns out that the columns of {U} are the eigenvectors of {X X^T} and the columns of {V} are the eigenvectors of {X^T X}.

{\smallskip}

Theorem: (Eckhart-Young) Let {X} be any matrix of size {m \times d}. Let {k \leq d}. Then

\displaystyle \min_{X_k} ||X - X_k||_{F}^2 \ \ {\rm s.t.} \ \ {\rm rank}(X_k) \leq k,

is given by

\displaystyle X_k = U \Sigma_{k}V^T,

where

\displaystyle X = U \Sigma V^T,

is the Singular Value Decomposition of {X} and {\Sigma_k} is {m\times d} matrix which has is zero except the first {k} columns are the same as {\Sigma}. {\spadesuit} {\smallskip}

There is a lot to unpack here:

  • How does this solve our original problem?
  • Why is this theorem true?

Let us address these in order. First note that {{\rm rank}(ZW) \leq {\rm rank}(Z) \leq k}, as {Z} has {k} columns. Thus, we were seeking a rank {k} approximation to {X} above. Here the rows of {U\Sigma_k} are precisely {z_1 , \ldots , z_m} from above. We also have {W} taking a particularly simple form, of a mostly zero matrix. This has the nice generalization of our original one dimensional example, where

\displaystyle x_i = \sum_{i=1}^k z_i \sigma_i,

where {\sigma_i} is the {i}th singular value of {X}.

{\smallskip}

Proof of Eckhart Young By SVD, it is enough to find {X_k} that minimizes

\displaystyle ||U \Sigma V^T - X_k||_{F}^2.

Since the Frobenius norm is invariant under orthogonal transformations, we may equivalently minimize

\displaystyle ||\Sigma - U^T X_k V||_{F}^2.

But {U^T X_k V} is a rank {k} matrix, as it is the product of two rank {k} matrices. Since {\Sigma} is a diagonal matrix, it is straightforward (or see here) to see that the best rank {k} approximation to {\Sigma} is to keep the first {k} columns of {\Sigma} and set the rest to zero, via {\Sigma_k}. Thus,

\displaystyle \Sigma_k = U^T X_k V,

and the result follows. {\spadesuit}

3. A Nice After Effect

Thanks to the singular value decomposition, we are equipped with an algorithm for reducing the dimension on non-training data. To see this, we may compute the latent variables, via

\displaystyle X_k = X V_k,

where {V_k} is the matrix of size {d \times k} whose columns are the first {k} columns of {V}. Thus, given a new data point {x}, we may compute the latent representation via

\displaystyle z = x^T V_k.

This is a nice feature of PCA, which is not afforded by other dimensionality reduction techniques, such as t-SNE. This allows the transform method to be available in sklearn’s PCA implementation, which affords the opportunity for PCA to be used in the feature engineering portion of a machine learning pipeline.

4. An Alternate Perspective

The more classical way to approach PCA is to search for a vector {v \in \mathbb{R}^d} satisfying

\displaystyle v = \arg \max_{u \in \mathbb{R}^d} || Xu||.

Thus, {v} is the vector that maximizes the variance of the data projected onto the line generated by {v}. However, the MLE approach above is more natural from a machine learning perspective.

That being said, I answered a question on math stackexchange awhile back that provides a derivation of PCA from this perspective. I was curious if the Lagrange Multiplier, outlined in that post, approach could be used to derive the Eckhart-Young theorem.

To see how to approach this, suppose we want to prove Eckhart-Young for {k=1}, i.e. we would like to find a rank 1 matrix {X_1}, that minimizes

\displaystyle ||X - X_1||_{F}^2.

As {X_1} is rank 1, it can be written as {uv^T}, where {u} and {v} are vectors. Thus, we are looking for {u} and {v} that minimize

\displaystyle ||X - uv^T||_{F}^2.

This is an optimization problem in {u} and {v}. Let us take a gradient with respect to {u} and {v} and set it equal to zero. By the definition of the Frobenius norm, and the multivariate chain rule, we have

\displaystyle \nabla_u ||X - uv^T||_{F}^2 = -2 (X - uv^T)v = 0,

and

\displaystyle \nabla_v ||X - uv^T||_{F}^2 = -2 (X - uv^T)^T u = 0.

Thus,

\displaystyle Xv = uv^Tv, \ \ \ X^T u = v u^T u.

These are precisely the equations satisfied by the SVD of {X}.

I’d like to see a cleaned up version of this argument, in the general case.

Temperature in Natural Language Processing

In Machine Learning, and in particular Generative AI, temperature is a useful hyperparameter for tuning model outputs.

In this post, we will discuss the following.

  • Temperature is a parameter developers can use to alter outputs from Large Language Models
  • With a higher temperature we get more creative outputs.
  • Why changing the temperature is useful.

Let’s start with an example (the code I used is at the end of this post). Consider the following prompt, taken from Google’s Minerva paper.

A line parallel to y = 4x + 6 passes through (5, 10). What is the y-coordinate of the point where this line crosses the y-axis?

I asked ChatGPT four times to answer this prompt, similar to what was done in their paper. Here is what I got:

  • Since a line parallel to y=4x+6 has the same slope, the desired line has slope 4. Therefore, the equation of the line is of the form y=4x+b. Plugging in the point (5, 10), we have 10=4(5)+b, so b=10-4(5)=-10. Therefore, the y-intercept is \boxed{-10}.
  • Since the line is parallel to y = 4x + 6, which has a slope of 4, the line must have a slope of 4. The equation of the line is then y - 10 = 4(x - 5) = 4x - 20. Setting x = 0, we find y = -20 + 10 = \boxed{-10}.
  • The given line has slope 4, so a line parallel to it also has slope 4. If a line with slope 4 passes through (5,10), then it is of the form y-10=4(x-5). Expanding the right side gives y-10=4x-20, so y=4x-10. Setting x=0, we find that the line crosses the y-axis at \boxed{-10}.
  • Since a line parallel to y=4x+6 will have the same slope, this line will have slope equal to 4. Therefore, its y-intercept has y-coordinate \boxed{-14}.

The first three answers are correct, while the last answer is wrong. Actually the final answer is correct all the way up to the 4 in 14. What happened under the hood is that likely ChatGPT was nearly sure the 4 should be a 0, but still was hedging its bets that it could in fact be a 4. In that particular instance, it randomly chose the 4.

To understand temperature, we first have to understand a bit about how large language models work. For instance, the G in GPT stands for generative. In practice, this means that given some text, the Large Language model will try to predict the next letter (or more precisely token, see this previous blog post). For a model like GPT, each letter will be assigned a probability as to how likely it is to be the next character. For instance, in the example above, the model likely assigned a large probability to the character 0 and a small, but positive, one to the character 4. It’s at this point the concept of temperature is useful.

Now that we have a bunch of probabilities assigned to each character, we have to define a methodical way of choosing the next character. Do we just assign the character with highest probability?

Unfortunately, there is no “one size fits all” solution to this problem, which is why we introduce the notion of temperature. Let’s look at an example. Suppose we are choosing between two characters to output next. Suppose further that our model overwhelming thinks the first character is the best choice. We plot how the temperature affects our choice of character in this example.

Here, for a low temperature (i.e. \theta close to 0), the model outputs the first character nearly all the time. But as the temperature grows larger, the model outputs each character around half the time.

Thus as we decrease the temperature, we get closer to the model that only outputs the highest probability character. As we increase the temperature, we get closer to the model that chooses each character randomly and uniformly.

And just for completeness, I will mention that the temperature 0 response I got from ChatGPT in the above example was in alignment with the three correct answers.

Why is Temperature Useful?

The argument made in the aforementioned Minerva paper was that by increasing the temperature, we can have the model generate a variety of outputs. From this variety of outputs, we may pick the “best” one. How we choose the best one can vary, but what they did is just take the most popular one.

This allows us to explore the probability space of answer generated by the generative model in order to make a more informed decision at which one to proceed with.

The Code

Here is the Python code I used to generate the example above. First I created a .env file in the same directory with my Open AI API key (fill out with your API key)

OPENAI_API_KEY=

Then I used Langchain (though this is not really required for such a simple example) as follows:

import os 
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain import PromptTemplate

OPENAI_API_KEY = os.environ.get("OPEN_API_KEY")
query = "A line parallel to $y = 4x + 6$ passes through $(5, 10)$. What is the $y$-coordinate of the point where this line crosses the $y$-axis?"

prompt = PromptTemplate.from_template(query )
prompt.format()
llm = ChatOpenAI(temperature = 1)

chain = LLMChain(llm=llm, prompt=prompt)

responses = [chain.run({}) for _ in range(5)]

Singular Value Decomposition and PCA

Principal Component Analysis (PCA) is a popular technique in machine learning for dimension reduction. It can be derived from Singular Value Decomposition (SVD) which we will discuss in this post. We will cover the math, an example in python, and finally some intuition.

The Math

SVD asserts that any m \times d matrix A can be written as

A = U\Sigma V^T,

where

  • \Sigma is m \times d and is rectangular diagonal, that is \Sigma_{ij} = 0 for i \neq j.
  • U and V are orthogonal, that is U^T U = I and V^T V = I.

This allows us to derive PCA. First, we are given a data matrix, X, that is a matrix where each of the m rows are data points with d features.

First, we translate each column of X so that each column has mean zero. That is we translate each column by \mu_1 , \ldots , \mu_d, respectively. For instance if

X = \left[\begin{matrix}1 & 2\\3 & 4\\5 & 6\end{matrix}\right] ,

Then we would translate the first column by -3 and the second column by -4. Then we divide each column by their standard deviations, \sigma_1 , \ldots , \sigma_d. Thus in our example we divide the first column and second column by \sqrt{8}. Let us call this new scaled matrix A to align with our notation above. Then we can perform PCA by performing SVD

A = U\Sigma V^T

and computing the $m \times d$ matrix

AV.

For PCA with k components, the data point in the j^{\rm th} row of X is transformed to the first k entries of the j^{\rm th} row of AV.

Put another way, given a data point x, we compute PCA in two steps:

z = (x_1-\mu_1)/\sigma_1 , \cdots , (x_d-\mu_d)/\sigma_d,

and then

(z \cdot v_1 , \cdots , z \cdot v_k)

where v_1 , \ldots , v_k are the first k columns of V.

Python Example

We will show how to code up PCA using only numpy and torchvision (only to retrieve some classical datasets). First we start with some imports.

import torchvision.datasets as datasets
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

Then we define a PCA function that does the work we did in the previous section. The data is inputted as a numpy array. We utilize the np.linalg.svd function for SVD and specify full_matrices=False so that the dimensions of U and V^T align with our computations above.

def pca(data,k=2):
    data = data.reshape((data.shape[0], -1))
    mu = np.mean(data, axis=0)
    std = np.std(data, axis=0).clip(min=1e-10).reshape(1,-1)
    A = (data - mu) / std
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    V = np.transpose(Vt)
    tmp = np.dot(A, V)
    return tmp[:,:k],V

We can now load in the MNIST, Fashion MNIST, and CIFAR datasets from torchvision datasets.

mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None)
np_mnist = mnist_trainset.data.numpy()/255.0

fmnist_trainset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=None)
fmnist_numpy = fmnist_trainset.data.numpy()/255.0

cifar= datasets.CIFAR100(root='./data', train=True, download=True, transform=None)
cifar_np = cifar.data /255.0

We can now perform PCA on these datasets.

pca_mnist,V_mnist = pca(np_mnist)
pca_fmnist , V_fmnist = pca(np_fmnist)
pca_cifar , V_cifar = pca(cifar_np)

We can see how they divide the datasets into their classes. With the following code.

fig, axs = plt.subplots(3, 4, figsize=(15, 6))

for i in range(3):
    for j in range(4):
        N = i*4+j 
        axs[i, j].imshow(.5*V_mnist[:,N].reshape((28,28))+.5, cmap="gray")
        N +=1 
        if N == 1:
            axs[i, j].set_title(f"{N}st Principal Component")
        elif N == 2:
            axs[i, j].set_title(f"{N}nd Principal Component")
        elif N == 3:
            axs[i, j].set_title(f"{N}rd Principal Component")
        else:
            axs[i, j].set_title(f"{N}th Principal Component")
        axs[i, j].axis('off')

We can also see what pixels are activated for each principal component. For instance we have the following for the MNIST data set

and the Fashion MNIST

You are encouraged to think about how I came about these images!

Intuition

There are two steps to PCA:

  • Scaling the columns of the data matrix
  • Computing the SVD and transforming the data

Let’s see these steps with a text example to see what’s going on. We generate a bunch of points in 2 dimensional space that lie close to the line y =2x, where x is between 0 and 1.

import numpy as np 

X = np.zeros((1_000,2))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
fig = sns.scatterplot(x = X[:,0],y = X[:,1])

Now we perform the scaling step.

Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
fig = sns.scatterplot(x=Z[:, 0], y=Z[:, 1])

You can see first of all this shifted the data so that now the center point is (0,0), instead of the original (.5,1). Also, through scaling, the data now lies near the line y = x, rather than y = 2x.

The scaling step (in particular division by the standard deviation) makes it so we do not favor a feature simply because the coordinates are on average larger.

We can now compute V.

Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)

This gives

V=  \begin{bmatrix} -0.70710678 & -0.70710678 \\ -0.70710678 & 0.70710678 \end{bmatrix}

Notice the first column of V, v_1 is a scalar multiple times

(1,1).

and the second column, v_2 is a scalar multiple of

(-1,1).

Here is what is going on. In our data set, after scaling, data points lie on the line y=x, up to a small error. PCA is a way of automating the process of finding this y=x line. Indeed, this is given by the vector v_1. When we compute v_1 \cdot x, for a data point x, we are getting information on the whereabouts of x lies on the line y = x. Indeed, for a point u = (a , a) we have

u \cdot (1,1)  = 2 a,

so this dot product is able to precisely compute where x is on the line y = x.

In higher dimensions, the same general principle holds. PCA allows us to find the line (or plane, or higher dimensional space) that best describes our data. Indeed let’s look at the previous example in 3-dimensions. We just add some small random noise to the additional coordinate

X = np.zeros((1_000,3))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
X[:,2] = .1*np.random.uniform(0,1,1_000)
Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)
print(V)

Here we find that the first column of V is, roughly, a scalar multiple times

(1,1,0).

Thus PCA still identifies the important relation between the first and second coordinate (after scaling) and disregards the other noise.

Limitations of PCA

One major limitation of PCA is that it’s only capable of capturing linear relations between the features. This could be handled by combining feature engineering with PCA, though this presents its own set of challenges. One way to overcome this is to invoke non-linear methods of dimension reduction, i.e. t-SNE or UMAP. In the following figure, I ran t-SNE on the MNIST data set (first using PCA to reduce the number of features to 25). We can see it performs much better than PCA at identifying digits.

Furthermore, PCA can lead to unpredictable outcomes that depend on the distribution of the data collected, as studied in this paper of Elhaik which I learned about from the Batch. This highlights that one should be careful making inferences from PCA alone.

The Square Root Cancellation Heuristic

In the first equation of the popular Attention is all you need paper (see also this blog post), the authors write

{\rm Attention}(Q,K,V) = {\rm softmax} \left( \frac{QK^T}{\sqrt{d_k}} V\right).

In this post we are going to discuss where the \sqrt{d_k} comes from, leading us to some classical Probability Theory. We will first talk about the math with some examples and then quickly make the connection.

The principle of square root cancellation also appears in the Batch Norm Paper, Neural Network weight initialization (see also Xavier Uniform), and elsewhere.

The Math

Let v be a $d$-dimensional vector with entries that are either +1 or -1 Then we may write

v= (v_1 , \ldots , v_d).

We will be interested in the simple question: what is the sum of the coordinates? For a fixed vector, we can easily compute this via

v_1 + \cdots + v_d.

The largest this sum can be is d and the smallest it can be is -d. However, what we would like to know the average behavior of this sum

Since the vector has both +1‘s and -1‘s, there will likely be cancellation in the sum. That is, we expect the sum will not be either d or -d. It turns out that as d gets large, there will typically be a lot of cancellation.

Let’s first run an experiment in python. We let d = 500 and compute the sum of 100,000 random vectors each with \pm 1 entries.

import numpy as np 
import seaborn as sns 

d = 500; sums = []

for _ in range(100_000):
    v = np.random.choice([-1, 1], size=d)
    sums.append(np.sum(v))

_ = sns.histplot(data = sums)

We see that most of the sums are quite a bit smaller than 500, and barely any are smaller than -75 or larger than 75.

This is the so-called square root cancellation. If we take a random vector of dimension d with randomly generated \pm 1 entries, we expect the order of the sum to be \sqrt{d}. Here \sqrt{500} \approx 22.36, and it is a general phenomenon that most of the sums will lie between, say, 4\sqrt{d} and \sqrt{d}. This general principle goes much deeper than \pm 1 vectors and turns out to be very well-studied concept in several areas of mathematics (see this post for Number Theory or this post for Probability). In fact, the famous Riemann Hypothesis can be reformulated as showing a certain sum exhibits square root cancellation.

The square root cancellation heuristic asserts that “most” vectors exhibit square root cancellation. Put another way, if we come across a vector in practice, we’d expect there to be square root cancellation as in the above example.

It turns out that the example we will discuss below generalizes considerably. The only thing that really matters is that each coordinate has mean zero (which is easily remedied by translation of a fixed constant). There are also some technical assumptions that the coordinates of v are not too large, but this is never an issue for vectors that only take on finitely many values. The motivated reader can consult the Lindeberg-Lévy Central Limit Theorem for more along this direction.

Before going any deeper, let’s explain what’s going on in the aforementioned Attention is All you Need paper.

Attention is All You Need Example

In the above equation, we have

\frac{QK^T}{\sqrt{d_k}} .

Each entries of QK^T is the dot product a row of Q and a row of K:

q \cdot k = q_1 k_1 + \cdots + q_{d_k} k_{d_k}.

It turns out that this is the sum of d_k numbers, and by the square root cancellation principle, we expect the sum to be of order \sqrt{d_k}. Hence dividing through by this number allows the sum to be, on average, of size comparable to 1 (rather than the much larger \sqrt{d_k}.

Keeping the sums from being too large helps us avoid the problem of exploding gradients in deep learning.

Back to the Math

Recall we had a vector v of d dimensions,

v = (v_1 , \ldots , v_d),

and we are interesting in what the sum is, on average. To make the question more precise, we assume each coordinate is independently generated randomly with mean zero. The key assumption is independence, that is the generation of one coordinate does not impact the others. The mean zero assumption can be obtained by shifting the values (for instance instead of recording a six sided dice roll, record the dice roll minus 3.5).

Here are some examples in python that you are welcome to check

v1 = np.random.choice([1,2,3,4,5,6],size=500) - 3.5
v2 = np.random.exponential(1,size=500) - 1 
v3 = np.random.normal(0,1,size=500) 

For such a vector, and d a bit large (say over 30), we expect

|v_1 + \cdots + v_d| \approx \sqrt{v_1^2 + \cdots + v_d^2}.

Specializing to the case where the coordinates are \pm 1, we recover the \sqrt{d} from earlier. Now why should we expect that the sum is of size \sqrt{d}?

One way to see this is to explicitly compute the expected value of

\left( v_1^2 + \cdots + v_d^2 \right),

which is precisely $d$. This computation can be done by expanding the square and using that the covariance of two different coordinates is 0. We will put the classical computation at the end of the post for those interested.

It is a common theme in this area that working with the square of the sum is theoretically much easier than working with the sum itself.

The Central Limit Theorem

It is worth reiterating that the square root cancellation heuristic generalizes much further than the \pm 1 example, but let’s continue our focus there. The astute reader will notice that the above histplot looked like the normal distribution. In fact, it is. If we consider the sum

\frac{v_1 + \cdots + v_d}{\sqrt{d}},

the histplot will be normal with mean zero and variance. Put another way,

v_1 + \cdots + v_d \approx \sqrt{d},

and we can quantify the \approx precisely. Thus if we have a sum of d numbers (where we expect the numbers to be on average 0), we can divide the sum by \sqrt{d} in order to make the sum \approx 1.

The Variance Computation

We will show

E[ \left( v_1^2 + \cdots + v_d^2 \right) ] = \sum_{j} E[v_j^2]

Indeed,

E[ \left( v_1^2 + \cdots + v_d^2 \right) ]= E[\sum_{ i,j} v_i v_j ] =\sum_{ i,j} E[v_i v_j ] ,

using FOIL from algebra and linearity of expectation. It is thus enough to show,

\sum_{ i,j} E[v_i v_j ]  = 0,

for every i \neq j. But this follows from independence.