# Temperature in Natural Language Processing

In Machine Learning, and in particular Generative AI, temperature is a useful hyperparameter for tuning model outputs.

In this post, we will discuss the following.

• Temperature is a parameter developers can use to alter outputs from Large Language Models
• With a higher temperature we get more creative outputs.
• Why changing the temperature is useful.

Let’s start with an example (the code I used is at the end of this post). Consider the following prompt, taken from Google’s Minerva paper.

A line parallel to $y = 4x + 6$ passes through $(5, 10)$. What is the $y$-coordinate of the point where this line crosses the $y$-axis?

I asked ChatGPT four times to answer this prompt, similar to what was done in their paper. Here is what I got:

• Since a line parallel to $y=4x+6$ has the same slope, the desired line has slope 4. Therefore, the equation of the line is of the form $y=4x+b$. Plugging in the point $(5, 10)$, we have $10=4(5)+b$, so $b=10-4(5)=-10$. Therefore, the $y$-intercept is $\boxed{-10}$.
• Since the line is parallel to $y = 4x + 6$, which has a slope of 4, the line must have a slope of 4. The equation of the line is then $y - 10 = 4(x - 5) = 4x - 20$. Setting $x = 0$, we find $y = -20 + 10 = \boxed{-10}$.
• The given line has slope 4, so a line parallel to it also has slope 4. If a line with slope 4 passes through $(5,10)$, then it is of the form $y-10=4(x-5)$. Expanding the right side gives $y-10=4x-20$, so $y=4x-10$. Setting $x=0$, we find that the line crosses the $y$-axis at $\boxed{-10}$.
• Since a line parallel to $y=4x+6$ will have the same slope, this line will have slope equal to 4. Therefore, its $y$-intercept has $y$-coordinate $\boxed{-14}$.

The first three answers are correct, while the last answer is wrong. Actually the final answer is correct all the way up to the 4 in 14. What happened under the hood is that likely ChatGPT was nearly sure the 4 should be a 0, but still was hedging its bets that it could in fact be a 4. In that particular instance, it randomly chose the 4.

To understand temperature, we first have to understand a bit about how large language models work. For instance, the G in GPT stands for generative. In practice, this means that given some text, the Large Language model will try to predict the next letter (or more precisely token, see this previous blog post). For a model like GPT, each letter will be assigned a probability as to how likely it is to be the next character. For instance, in the example above, the model likely assigned a large probability to the character 0 and a small, but positive, one to the character 4. It’s at this point the concept of temperature is useful.

Now that we have a bunch of probabilities assigned to each character, we have to define a methodical way of choosing the next character. Do we just assign the character with highest probability?

Unfortunately, there is no “one size fits all” solution to this problem, which is why we introduce the notion of temperature. Let’s look at an example. Suppose we are choosing between two characters to output next. Suppose further that our model overwhelming thinks the first character is the best choice. We plot how the temperature affects our choice of character in this example.

Here, for a low temperature (i.e. $theta$ close to 0), the model outputs the first character nearly all the time. But as the temperature grows larger, the model outputs each character around half the time.

Thus as we decrease the temperature, we get closer to the model that only outputs the highest probability character. As we increase the temperature, we get closer to the model that chooses each character randomly and uniformly.

And just for completeness, I will mention that the temperature 0 response I got from ChatGPT in the above example was in alignment with the three correct answers.

## Why is Temperature Useful?

The argument made in the aforementioned Minerva paper was that by increasing the temperature, we can have the model generate a variety of outputs. From this variety of outputs, we may pick the “best” one. How we choose the best one can vary, but what they did is just take the most popular one.

This allows us to explore the probability space of answer generated by the generative model in order to make a more informed decision at which one to proceed with.

## The Code

Here is the Python code I used to generate the example above. First I created a .env file in the same directory with my Open AI API key (fill out with your API key)

OPENAI_API_KEY=

Then I used Langchain (though this is not really required for such a simple example) as follows:

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain import PromptTemplate

OPENAI_API_KEY = os.environ.get("OPEN_API_KEY")
query = "A line parallel to $y = 4x + 6$ passes through $(5, 10)$. What is the $y$-coordinate of the point where this line crosses the $y$-axis?"

prompt = PromptTemplate.from_template(query )
prompt.format()
llm = ChatOpenAI(temperature = 1)

chain = LLMChain(llm=llm, prompt=prompt)

responses = [chain.run({}) for _ in range(5)]

# Singular Value Decomposition and PCA

Principal Component Analysis (PCA) is a popular technique in machine learning for dimension reduction. It can be derived from Singular Value Decomposition (SVD) which we will discuss in this post. We will cover the math, an example in python, and finally some intuition.

## The Math

SVD asserts that any $m \times d$ matrix $A$ can be written as

$A = U\Sigma V^T,$

where

• $\Sigma$ is $m \times d$ and is rectangular diagonal, that is $\Sigma_{ij} = 0$ for $i \neq j$.
• $U$ and $V$ are orthogonal, that is $U^T U = I$ and $V^T V = I$.

This allows us to derive PCA. First, we are given a data matrix, $X$, that is a matrix where each of the $m$ rows are data points with $d$ features.

First, we translate each column of $X$ so that each column has mean zero. That is we translate each column by $\mu_1 , \ldots , \mu_d$, respectively. For instance if

$X = \left[\begin{matrix}1 & 2\\3 & 4\\5 & 6\end{matrix}\right] ,$

Then we would translate the first column by -3 and the second column by -4. Then we divide each column by their standard deviations, $\sigma_1 , \ldots , \sigma_d$. Thus in our example we divide the first column and second column by $\sqrt{8}$. Let us call this new scaled matrix $A$ to align with our notation above. Then we can perform PCA by performing SVD

$A = U\Sigma V^T$

and computing the $m \times d$ matrix

$AV.$

For PCA with $k$ components, the data point in the $j^{\rm th}$ row of $X$ is transformed to the first $k$ entries of the $j^{\rm th}$ row of $AV$.

Put another way, given a data point $x$, we compute PCA in two steps:

$z = (x_1-\mu_1)/\sigma_1 , \cdots , (x_d-\mu_d)/\sigma_d,$

and then

$(z \cdot v_1 , \cdots , z \cdot v_k)$

where $v_1 , \ldots , v_k$ are the first $k$ columns of $V$.

## Python Example

We will show how to code up PCA using only numpy and torchvision (only to retrieve some classical datasets). First we start with some imports.

import torchvision.datasets as datasets
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

Then we define a PCA function that does the work we did in the previous section. The data is inputted as a numpy array. We utilize the np.linalg.svd function for SVD and specify full_matrices=False so that the dimensions of $U$ and $V^T$ align with our computations above.

def pca(data,k=2):
data = data.reshape((data.shape[0], -1))
mu = np.mean(data, axis=0)
std = np.std(data, axis=0).clip(min=1e-10).reshape(1,-1)
A = (data - mu) / std
U, S, Vt = np.linalg.svd(A, full_matrices=False)
V = np.transpose(Vt)
tmp = np.dot(A, V)
return tmp[:,:k],V

We can now load in the MNIST, Fashion MNIST, and CIFAR datasets from torchvision datasets.

mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None)
np_mnist = mnist_trainset.data.numpy()/255.0

fmnist_numpy = fmnist_trainset.data.numpy()/255.0

cifar_np = cifar.data /255.0

We can now perform PCA on these datasets.

pca_mnist,V_mnist = pca(np_mnist)
pca_fmnist , V_fmnist = pca(np_fmnist)
pca_cifar , V_cifar = pca(cifar_np)

We can see how they divide the datasets into their classes. With the following code.

fig, axs = plt.subplots(3, 4, figsize=(15, 6))

for i in range(3):
for j in range(4):
N = i*4+j
axs[i, j].imshow(.5*V_mnist[:,N].reshape((28,28))+.5, cmap="gray")
N +=1
if N == 1:
axs[i, j].set_title(f"{N}st Principal Component")
elif N == 2:
axs[i, j].set_title(f"{N}nd Principal Component")
elif N == 3:
axs[i, j].set_title(f"{N}rd Principal Component")
else:
axs[i, j].set_title(f"{N}th Principal Component")
axs[i, j].axis('off')


We can also see what pixels are activated for each principal component. For instance we have the following for the MNIST data set

and the Fashion MNIST

You are encouraged to think about how I came about these images!

## Intuition

There are two steps to PCA:

• Scaling the columns of the data matrix
• Computing the SVD and transforming the data

Let’s see these steps with a text example to see what’s going on. We generate a bunch of points in 2 dimensional space that lie close to the line $y =2x$, where $x$ is between 0 and 1.

import numpy as np

X = np.zeros((1_000,2))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
fig = sns.scatterplot(x = X[:,0],y = X[:,1])

Now we perform the scaling step.

Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
fig = sns.scatterplot(x=Z[:, 0], y=Z[:, 1])

You can see first of all this shifted the data so that now the center point is $(0,0)$, instead of the original $(.5,1)$. Also, through scaling, the data now lies near the line $y = x$, rather than $y = 2x$.

The scaling step (in particular division by the standard deviation) makes it so we do not favor a feature simply because the coordinates are on average larger.

We can now compute $V$.

Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)

This gives

$V= \begin{bmatrix} -0.70710678 & -0.70710678 \\ -0.70710678 & 0.70710678 \end{bmatrix}$

Notice the first column of $V$, $v_1$ is a scalar multiple times

$(1,1).$

and the second column, $v_2$ is a scalar multiple of

$(-1,1)$.

Here is what is going on. In our data set, after scaling, data points lie on the line $y=x$, up to a small error. PCA is a way of automating the process of finding this $y=x$ line. Indeed, this is given by the vector $v_1$. When we compute $v_1 \cdot x$, for a data point $x$, we are getting information on the whereabouts of $x$ lies on the line $y = x$. Indeed, for a point $u = (a , a)$ we have

$u \cdot (1,1) = 2 a,$

so this dot product is able to precisely compute where $x$ is on the line $y = x$.

In higher dimensions, the same general principle holds. PCA allows us to find the line (or plane, or higher dimensional space) that best describes our data. Indeed let’s look at the previous example in 3-dimensions. We just add some small random noise to the additional coordinate

X = np.zeros((1_000,3))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
X[:,2] = .1*np.random.uniform(0,1,1_000)
Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)
print(V)

Here we find that the first column of $V$ is, roughly, a scalar multiple times

$(1,1,0).$

Thus PCA still identifies the important relation between the first and second coordinate (after scaling) and disregards the other noise.

## Limitations of PCA

One major limitation of PCA is that it’s only capable of capturing linear relations between the features. This could be handled by combining feature engineering with PCA, though this presents its own set of challenges. One way to overcome this is to invoke non-linear methods of dimension reduction, i.e. t-SNE or UMAP. In the following figure, I ran t-SNE on the MNIST data set (first using PCA to reduce the number of features to 25). We can see it performs much better than PCA at identifying digits.

Furthermore, PCA can lead to unpredictable outcomes that depend on the distribution of the data collected, as studied in this paper of Elhaik which I learned about from the Batch. This highlights that one should be careful making inferences from PCA alone.

# The Square Root Cancellation Heuristic

In the first equation of the popular Attention is all you need paper (see also this blog post), the authors write

${\rm Attention}(Q,K,V) = {\rm softmax} \left( \frac{QK^T}{\sqrt{d_k}} V\right).$

In this post we are going to discuss where the $\sqrt{d_k}$ comes from, leading us to some classical Probability Theory. We will first talk about the math with some examples and then quickly make the connection.

The principle of square root cancellation also appears in the Batch Norm Paper, Neural Network weight initialization (see also Xavier Uniform), and elsewhere.

## The Math

Let $v$ be a $d$-dimensional vector with entries that are either $+1$ or $-1$ Then we may write

$v= (v_1 , \ldots , v_d).$

We will be interested in the simple question: what is the sum of the coordinates? For a fixed vector, we can easily compute this via

$v_1 + \cdots + v_d.$

The largest this sum can be is $d$ and the smallest it can be is $-d$. However, what we would like to know the average behavior of this sum

Since the vector has both $+1$‘s and $-1$‘s, there will likely be cancellation in the sum. That is, we expect the sum will not be either $d$ or $-d$. It turns out that as $d$ gets large, there will typically be a lot of cancellation.

Let’s first run an experiment in python. We let $d = 500$ and compute the sum of 100,000 random vectors each with $\pm 1$ entries.

import numpy as np
import seaborn as sns

d = 500; sums = []

for _ in range(100_000):
v = np.random.choice([-1, 1], size=d)
sums.append(np.sum(v))

_ = sns.histplot(data = sums)

We see that most of the sums are quite a bit smaller than 500, and barely any are smaller than $-75$ or larger than 75.

This is the so-called square root cancellation. If we take a random vector of dimension $d$ with randomly generated $\pm 1$ entries, we expect the order of the sum to be $\sqrt{d}$. Here $\sqrt{500} \approx 22.36$, and it is a general phenomenon that most of the sums will lie between, say, $4\sqrt{d}$ and $\sqrt{d}$. This general principle goes much deeper than $\pm 1$ vectors and turns out to be very well-studied concept in several areas of mathematics (see this post for Number Theory or this post for Probability). In fact, the famous Riemann Hypothesis can be reformulated as showing a certain sum exhibits square root cancellation.

The square root cancellation heuristic asserts that “most” vectors exhibit square root cancellation. Put another way, if we come across a vector in practice, we’d expect there to be square root cancellation as in the above example.

It turns out that the example we will discuss below generalizes considerably. The only thing that really matters is that each coordinate has mean zero (which is easily remedied by translation of a fixed constant). There are also some technical assumptions that the coordinates of $v$ are not too large, but this is never an issue for vectors that only take on finitely many values. The motivated reader can consult the Lindeberg-Lévy Central Limit Theorem for more along this direction.

Before going any deeper, let’s explain what’s going on in the aforementioned Attention is All you Need paper.

## Attention is All You Need Example

In the above equation, we have

$\frac{QK^T}{\sqrt{d_k}} .$

Each entries of $QK^T$ is the dot product a row of $Q$ and a row of $K$:

$q \cdot k = q_1 k_1 + \cdots + q_{d_k} k_{d_k}.$

It turns out that this is the sum of $d_k$ numbers, and by the square root cancellation principle, we expect the sum to be of order $\sqrt{d_k}$. Hence dividing through by this number allows the sum to be, on average, of size comparable to 1 (rather than the much larger $\sqrt{d_k}$.

Keeping the sums from being too large helps us avoid the problem of exploding gradients in deep learning.

## Back to the Math

Recall we had a vector $v$ of $d$ dimensions,

$v = (v_1 , \ldots , v_d),$

and we are interesting in what the sum is, on average. To make the question more precise, we assume each coordinate is independently generated randomly with mean zero. The key assumption is independence, that is the generation of one coordinate does not impact the others. The mean zero assumption can be obtained by shifting the values (for instance instead of recording a six sided dice roll, record the dice roll minus 3.5).

Here are some examples in python that you are welcome to check

v1 = np.random.choice([1,2,3,4,5,6],size=500) - 3.5
v2 = np.random.exponential(1,size=500) - 1
v3 = np.random.normal(0,1,size=500) 

For such a vector, and $d$ a bit large (say over 30), we expect

$|v_1 + \cdots + v_d| \approx \sqrt{v_1^2 + \cdots + v_d^2}.$

Specializing to the case where the coordinates are $\pm 1$, we recover the $\sqrt{d}$ from earlier. Now why should we expect that the sum is of size $\sqrt{d}$?

One way to see this is to explicitly compute the expected value of

$\left( v_1^2 + \cdots + v_d^2 \right),$

which is precisely $d$. This computation can be done by expanding the square and using that the covariance of two different coordinates is 0. We will put the classical computation at the end of the post for those interested.

It is a common theme in this area that working with the square of the sum is theoretically much easier than working with the sum itself.

## The Central Limit Theorem

It is worth reiterating that the square root cancellation heuristic generalizes much further than the $\pm 1$ example, but let’s continue our focus there. The astute reader will notice that the above histplot looked like the normal distribution. In fact, it is. If we consider the sum

$\frac{v_1 + \cdots + v_d}{\sqrt{d}},$

the histplot will be normal with mean zero and variance. Put another way,

$v_1 + \cdots + v_d \approx \sqrt{d},$

and we can quantify the $\approx$ precisely. Thus if we have a sum of $d$ numbers (where we expect the numbers to be on average 0), we can divide the sum by $\sqrt{d}$ in order to make the sum $\approx 1$.

## The Variance Computation

We will show

$E[ \left( v_1^2 + \cdots + v_d^2 \right) ] = \sum_{j} E[v_j^2]$

Indeed,

$E[ \left( v_1^2 + \cdots + v_d^2 \right) ]= E[\sum_{ i,j} v_i v_j ] =\sum_{ i,j} E[v_i v_j ] ,$

using FOIL from algebra and linearity of expectation. It is thus enough to show,

$\sum_{ i,j} E[v_i v_j ] = 0,$

for every $i \neq j$. But this follows from independence.