Temperature in Natural Language Processing

In Machine Learning, and in particular Generative AI, temperature is a useful hyperparameter for tuning model outputs.

In this post, we will discuss the following.

• Temperature is a parameter developers can use to alter outputs from Large Language Models
• With a higher temperature we get more creative outputs.
• Why changing the temperature is useful.

Let’s start with an example (the code I used is at the end of this post). Consider the following prompt, taken from Google’s Minerva paper.

A line parallel to $y = 4x + 6$ passes through $(5, 10)$. What is the $y$-coordinate of the point where this line crosses the $y$-axis?

I asked ChatGPT four times to answer this prompt, similar to what was done in their paper. Here is what I got:

• Since a line parallel to $y=4x+6$ has the same slope, the desired line has slope 4. Therefore, the equation of the line is of the form $y=4x+b$. Plugging in the point $(5, 10)$, we have $10=4(5)+b$, so $b=10-4(5)=-10$. Therefore, the $y$-intercept is $\boxed{-10}$.
• Since the line is parallel to $y = 4x + 6$, which has a slope of 4, the line must have a slope of 4. The equation of the line is then $y - 10 = 4(x - 5) = 4x - 20$. Setting $x = 0$, we find $y = -20 + 10 = \boxed{-10}$.
• The given line has slope 4, so a line parallel to it also has slope 4. If a line with slope 4 passes through $(5,10)$, then it is of the form $y-10=4(x-5)$. Expanding the right side gives $y-10=4x-20$, so $y=4x-10$. Setting $x=0$, we find that the line crosses the $y$-axis at $\boxed{-10}$.
• Since a line parallel to $y=4x+6$ will have the same slope, this line will have slope equal to 4. Therefore, its $y$-intercept has $y$-coordinate $\boxed{-14}$.

The first three answers are correct, while the last answer is wrong. Actually the final answer is correct all the way up to the 4 in 14. What happened under the hood is that likely ChatGPT was nearly sure the 4 should be a 0, but still was hedging its bets that it could in fact be a 4. In that particular instance, it randomly chose the 4.

To understand temperature, we first have to understand a bit about how large language models work. For instance, the G in GPT stands for generative. In practice, this means that given some text, the Large Language model will try to predict the next letter (or more precisely token, see this previous blog post). For a model like GPT, each letter will be assigned a probability as to how likely it is to be the next character. For instance, in the example above, the model likely assigned a large probability to the character 0 and a small, but positive, one to the character 4. It’s at this point the concept of temperature is useful.

Now that we have a bunch of probabilities assigned to each character, we have to define a methodical way of choosing the next character. Do we just assign the character with highest probability?

Unfortunately, there is no “one size fits all” solution to this problem, which is why we introduce the notion of temperature. Let’s look at an example. Suppose we are choosing between two characters to output next. Suppose further that our model overwhelming thinks the first character is the best choice. We plot how the temperature affects our choice of character in this example.

Here, for a low temperature (i.e. $theta$ close to 0), the model outputs the first character nearly all the time. But as the temperature grows larger, the model outputs each character around half the time.

Thus as we decrease the temperature, we get closer to the model that only outputs the highest probability character. As we increase the temperature, we get closer to the model that chooses each character randomly and uniformly.

And just for completeness, I will mention that the temperature 0 response I got from ChatGPT in the above example was in alignment with the three correct answers.

Why is Temperature Useful?

The argument made in the aforementioned Minerva paper was that by increasing the temperature, we can have the model generate a variety of outputs. From this variety of outputs, we may pick the “best” one. How we choose the best one can vary, but what they did is just take the most popular one.

This allows us to explore the probability space of answer generated by the generative model in order to make a more informed decision at which one to proceed with.

The Code

Here is the Python code I used to generate the example above. First I created a .env file in the same directory with my Open AI API key (fill out with your API key)

`OPENAI_API_KEY=`

Then I used Langchain (though this is not really required for such a simple example) as follows:

```import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain import PromptTemplate

OPENAI_API_KEY = os.environ.get("OPEN_API_KEY")```
```query = "A line parallel to \$y = 4x + 6\$ passes through \$(5, 10)\$. What is the \$y\$-coordinate of the point where this line crosses the \$y\$-axis?"

prompt = PromptTemplate.from_template(query )
prompt.format()
llm = ChatOpenAI(temperature = 1)

chain = LLMChain(llm=llm, prompt=prompt)

responses = [chain.run({}) for _ in range(5)]```