In this post we are going to discuss where the comes from, leading us to some classical Probability Theory. We will first talk about the math with some examples and then quickly make the connection.
Let be a $d$-dimensional vector with entries that are either or Then we may write
We will be interested in the simple question: what is the sum of the coordinates? For a fixed vector, we can easily compute this via
The largest this sum can be is and the smallest it can be is . However, what we would like to know the average behavior of this sum
Since the vector has both ‘s and ‘s, there will likely be cancellation in the sum. That is, we expect the sum will not be either or . It turns out that as gets large, there will typically be a lot of cancellation.
Let’s first run an experiment in python. We let and compute the sum of 100,000 random vectors each with entries.
import numpy as np import seaborn as sns d = 500; sums =  for _ in range(100_000): v = np.random.choice([-1, 1], size=d) sums.append(np.sum(v)) _ = sns.histplot(data = sums)
We see that most of the sums are quite a bit smaller than 500, and barely any are smaller than or larger than 75.
This is the so-called square root cancellation. If we take a random vector of dimension with randomly generated entries, we expect the order of the sum to be . Here , and it is a general phenomenon that most of the sums will lie between, say, and . This general principle goes much deeper than vectors and turns out to be very well-studied concept in several areas of mathematics (see this post for Number Theory or this post for Probability). In fact, the famous Riemann Hypothesis can be reformulated as showing a certain sum exhibits square root cancellation.
The square root cancellation heuristic asserts that “most” vectors exhibit square root cancellation. Put another way, if we come across a vector in practice, we’d expect there to be square root cancellation as in the above example.
It turns out that the example we will discuss below generalizes considerably. The only thing that really matters is that each coordinate has mean zero (which is easily remedied by translation of a fixed constant). There are also some technical assumptions that the coordinates of are not too large, but this is never an issue for vectors that only take on finitely many values. The motivated reader can consult the Lindeberg-Lévy Central Limit Theorem for more along this direction.
Before going any deeper, let’s explain what’s going on in the aforementioned Attention is All you Need paper.
Attention is All You Need Example
In the above equation, we have
Each entries of is the dot product a row of and a row of :
It turns out that this is the sum of numbers, and by the square root cancellation principle, we expect the sum to be of order . Hence dividing through by this number allows the sum to be, on average, of size comparable to 1 (rather than the much larger .
Keeping the sums from being too large helps us avoid the problem of exploding gradients in deep learning.
Back to the Math
Recall we had a vector of dimensions,
and we are interesting in what the sum is, on average. To make the question more precise, we assume each coordinate is independently generated randomly with mean zero. The key assumption is independence, that is the generation of one coordinate does not impact the others. The mean zero assumption can be obtained by shifting the values (for instance instead of recording a six sided dice roll, record the dice roll minus 3.5).
Here are some examples in python that you are welcome to check
v1 = np.random.choice([1,2,3,4,5,6],size=500) - 3.5 v2 = np.random.exponential(1,size=500) - 1 v3 = np.random.normal(0,1,size=500)
For such a vector, and a bit large (say over 30), we expect
Specializing to the case where the coordinates are , we recover the from earlier. Now why should we expect that the sum is of size ?
One way to see this is to explicitly compute the expected value of
which is precisely $d$. This computation can be done by expanding the square and using that the covariance of two different coordinates is 0. We will put the classical computation at the end of the post for those interested.
It is a common theme in this area that working with the square of the sum is theoretically much easier than working with the sum itself.
The Central Limit Theorem
It is worth reiterating that the square root cancellation heuristic generalizes much further than the example, but let’s continue our focus there. The astute reader will notice that the above histplot looked like the normal distribution. In fact, it is. If we consider the sum
the histplot will be normal with mean zero and variance. Put another way,
and we can quantify the precisely. Thus if we have a sum of numbers (where we expect the numbers to be on average 0), we can divide the sum by in order to make the sum .
The Variance Computation
We will show
for every . But this follows from independence.