Category Archives: Artificial Intelligence

Temperature in Natural Language Processing

In Machine Learning, and in particular Generative AI, temperature is a useful hyperparameter for tuning model outputs.

In this post, we will discuss the following.

  • Temperature is a parameter developers can use to alter outputs from Large Language Models
  • With a higher temperature we get more creative outputs.
  • Why changing the temperature is useful.

Let’s start with an example (the code I used is at the end of this post). Consider the following prompt, taken from Google’s Minerva paper.

A line parallel to y = 4x + 6 passes through (5, 10). What is the y-coordinate of the point where this line crosses the y-axis?

I asked ChatGPT four times to answer this prompt, similar to what was done in their paper. Here is what I got:

  • Since a line parallel to y=4x+6 has the same slope, the desired line has slope 4. Therefore, the equation of the line is of the form y=4x+b. Plugging in the point (5, 10), we have 10=4(5)+b, so b=10-4(5)=-10. Therefore, the y-intercept is \boxed{-10}.
  • Since the line is parallel to y = 4x + 6, which has a slope of 4, the line must have a slope of 4. The equation of the line is then y - 10 = 4(x - 5) = 4x - 20. Setting x = 0, we find y = -20 + 10 = \boxed{-10}.
  • The given line has slope 4, so a line parallel to it also has slope 4. If a line with slope 4 passes through (5,10), then it is of the form y-10=4(x-5). Expanding the right side gives y-10=4x-20, so y=4x-10. Setting x=0, we find that the line crosses the y-axis at \boxed{-10}.
  • Since a line parallel to y=4x+6 will have the same slope, this line will have slope equal to 4. Therefore, its y-intercept has y-coordinate \boxed{-14}.

The first three answers are correct, while the last answer is wrong. Actually the final answer is correct all the way up to the 4 in 14. What happened under the hood is that likely ChatGPT was nearly sure the 4 should be a 0, but still was hedging its bets that it could in fact be a 4. In that particular instance, it randomly chose the 4.

To understand temperature, we first have to understand a bit about how large language models work. For instance, the G in GPT stands for generative. In practice, this means that given some text, the Large Language model will try to predict the next letter (or more precisely token, see this previous blog post). For a model like GPT, each letter will be assigned a probability as to how likely it is to be the next character. For instance, in the example above, the model likely assigned a large probability to the character 0 and a small, but positive, one to the character 4. It’s at this point the concept of temperature is useful.

Now that we have a bunch of probabilities assigned to each character, we have to define a methodical way of choosing the next character. Do we just assign the character with highest probability?

Unfortunately, there is no “one size fits all” solution to this problem, which is why we introduce the notion of temperature. Let’s look at an example. Suppose we are choosing between two characters to output next. Suppose further that our model overwhelming thinks the first character is the best choice. We plot how the temperature affects our choice of character in this example.

Here, for a low temperature (i.e. theta close to 0), the model outputs the first character nearly all the time. But as the temperature grows larger, the model outputs each character around half the time.

Thus as we decrease the temperature, we get closer to the model that only outputs the highest probability character. As we increase the temperature, we get closer to the model that chooses each character randomly and uniformly.

And just for completeness, I will mention that the temperature 0 response I got from ChatGPT in the above example was in alignment with the three correct answers.

Why is Temperature Useful?

The argument made in the aforementioned Minerva paper was that by increasing the temperature, we can have the model generate a variety of outputs. From this variety of outputs, we may pick the “best” one. How we choose the best one can vary, but what they did is just take the most popular one.

This allows us to explore the probability space of answer generated by the generative model in order to make a more informed decision at which one to proceed with.

The Code

Here is the Python code I used to generate the example above. First I created a .env file in the same directory with my Open AI API key (fill out with your API key)


Then I used Langchain (though this is not really required for such a simple example) as follows:

import os 
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain import PromptTemplate

OPENAI_API_KEY = os.environ.get("OPEN_API_KEY")
query = "A line parallel to $y = 4x + 6$ passes through $(5, 10)$. What is the $y$-coordinate of the point where this line crosses the $y$-axis?"

prompt = PromptTemplate.from_template(query )
llm = ChatOpenAI(temperature = 1)

chain = LLMChain(llm=llm, prompt=prompt)

responses = [{}) for _ in range(5)]

How Does ChatGPT read?

How would ChatGPT read the infamous “Hello, World!” Does it see each character, sequentially

H e l l o , W o r l d !

Or maybe it sees each word as well as the punctuation:

Hello , World !

By the end of this post we will have a full understanding of this. On the way, we will learn about unicode, UTF-8, and byte pair encoding (BPE).

In order to understand how ChatGPT sees data, we have to understand the data on which it is trained on. The majority of the data used to train GPT-3 comes from the Common Crawl dataset, which is text scraped from the internet. Thus we turn our attention to understanding how text is encoded on the web.

Code Points

In order for our computers to store and transfer text, we need a way of converting characters (i.e. elements of an alphabet, punctuation, etc.) to bits. Thanks to binary numbers it is enough to convert characters to integers (though encoding schemes like the popular UTF-8 provide a more complex and efficient conversion code points to bits, as we will see later).

Thus we first turn our attention to mapping characters to integers, denoted a character encoding. This leads us to Unicode.

A Brief History of Unicode

The earliest character encoding was ASCII (pronounced like as-kee), which stands for the American Standard Code of International Information Exchange. One key problem with it is already evident from the name..what if non-Americans would like to exchange information?

ASCII provides code points for 128 characters, including the English alphabet and common punctuation. ASCII is typically sufficient for sending English messages. You can get the ASCII encoding of the letter A (and vice versa) in python with the following built in function.

print(ord("A")) #ASCII code point of A
print(chr(65)) #character of code point 65

In addition to the aforementioned symbols, there are also code points that correspond to non-printable information, which can cause some confusion.

ASCII contains most of the characters you will need if your goal is to communicate in English, and was widely adopted in the 1960s. However, ASCII cannot support languages with a different alphabets, accented characters, emojis, and more.

Thus a group of people set to create more inclusive standards for representing text, that was also backwards compatible with the already widely adopted ASCII. After several iterations, Unicode is now the widely adopted standard. It is supported by a variety of blue chip companies, as can be seen from their member’s page.

What is Unicode?

Unicode is a way to convert nearly 150,000 characters to integers. For instance, here is a nice list of the integer to character conversions. You can input unicode directly into html via &# followed by the decimal representation. For instance.

<p> &#70000 </p>

renders as 𑅰 and 🤠, respectively.

You can also directly write unicode on your local machine by following a tutorial (Mac, Windows, and Linux).

Thus Unicode extends ASCII to accommodate nearly all desired written text with nearly 150,000 characters assigned a code point. It turns out this encoding plays a large part in encoding text in the web and consequently the training of ChatGPT. But before we see this connection, we have to discuss UTF-8.


While unicode is accommodating in terms of encoded characters, it is not terribly efficient. For instance, if you plan to write in mostly ASCII, it would make sense to make those characters require a smaller amount of space to encode. This is exactly the purpose that the UTF-8 encoding serves.

Recall that to store and send text, one needs to convert to bits. In practice, we work with bytes, which is just 8 bits. As 8 bits gives 2^8 = 256 possibilities, all of ASCII can be represented by 1 byte (with room to spare). UTF-8 is an attempt to convert the Unicode code points to bytes in an efficient manner.

A byte can be represented by two hexadecimal numbers. For instance, 0-20 are given by:

0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14

So we can represent the character H, which is 72 in ASCII, by 48 in hexadecimal. UTF-8 converts Unicode code points to either 1,2,3 or 4 bytes. For instance, the UTF-8 encoding of

Hello 🤠


48 65 6c 6c 6f 20 f0 9f a4 a0

Note that the first 5 bytes correspond to H e l l o, while the last 4 correspond to 🤠. UTF-8 is set up in such a way that it is clear the last 4 bytes are all part of one character.

Back to Chat GPT

Unicode has the advantage of being able to methodically encode nearly all text on the web to integers. An integer turns out to be perfect for inputting into a machine learning model. However, inputting raw Unicode, which consists of nearly 150,000 characters, would be inefficient and beyond current computational power. For instance, the encoding used for ChatGPT that we will discuss below has 100,261 tokens. Thus it is convenient to have a clever way of converting text to integers that goes beyond Unicode.

Byte Pair Encoding (BPE) is a preprocessing step that allows us to identify subwords that appear often in the text. The starting point of BPE, as the name suggests, is bytes. We start by encoding every single byte to an integer 0-255, which we call a token. Thus any Unicode text can be written as a sequence of tokens via the UTF-8 encoding. For instance, the Hello 🤠 above can be tokenized to

72 101 108 108 111 32 240 159 164 160

However, we can make this more efficient by adding additional tokens. For instance, the word “to” appears quite often in English text. However, it is currently encoded as

83 78

What we can do is create a new token for the word to so that instead of using 2 tokens for this common word, we only use one (this is the “pair” in BPE). Using tiktoken, released by openai, we can see that this is exactly what was done.

#will need to install tikoken: pip install tiktoken 
import tiktoken 
enc = tiktoken.encoding_for_model("gpt-3.5-turbo") #gpt-3.5-turbo - ChatGPT

Running this code, we see that the integer 998 is reserved for to.

Byte Pair Encoding

So how exactly is the byte pair encoding performed? We will give a brief explanation and note the details can be found in ~20 lines of python code in Algorithm 1 of this paper of Senrich, Haddow, and Birch and also explained in Section 2 of the gpt-2 paper.

We start by taking a smallish sample of our text data. We then convert convert the text to bytes via the UTF-8 decoding. After this we see which pair of bytes appears the most often and assign a new token to that pair. We can see the first pair with the following python code (continued from above).

print([x for x in enc.decode_bytes([256])])

The result is 32 32, which corresponds to two consecutive spaces. BPE then repeats this process, with the possibility of joining the newly created token to any other token. In fact, this is repeated over 100,000 times!

We see that the first join is joining bytes 32 with itself. In fact, this is just two consecutive spaces. The first non-space join is that of i and n to form “in” (token 258).

It is worth mentioning that BPE is not the only method of tokenizing. For instance, Google’s Bard uses SentencePiece.

One Issue

It is well known that not every byte sequence is valid UTF-8 code. Thus, it is possible in theory for ChatGPT to produce non-valid UTF-8. Of course, this becomes increasing rare as the model is trained more and more. In fact the decoder provided by tiktoken has a kwarg to specify how to address this exact issue.


To see how ChatGPT is trained, we first have to understand the data. The data is scraped from the web, which lead us to the UTF-8 encoding. Such an encoding gives nearly 150,000 characters and is inefficient. This motivates looking at a compression technique, i.e. the Byte Pair Encoding.

Using AI to Write Math

Unfortunately (or perhaps, fortunately), we are still far from the days where we can ask a computer to write proofs for us. However, there are tools available today that can concretely assist with writing mathematics.

I made a video on the topic, with blog post below.

While I left research mathematics some time ago, I still find myself typing up some math from time to time. For me this has become a bit easier with Github Copilot.

I am now VSCode as my text editor to write LaTex. VSCode is a free and powerful IDE used by millions of software engineers. With it, one can access GitHub Copilot (for $100/year).

GitHub Copilot saves quite a bit of time in Latexing. Let’s see it in action suggesting useful LaTex code for a matrix.

The lighter text “\begin{pmatrix}” is what is suggested by Github Copilot. You can simply hit tab to accept the suggestion or keep typing to reject it. Let’s keep accepting:

This eventually gives the final result.

As you can see it quite accurately gives the LaTex code for a matrix, and only takes about 3 seconds real-time to do so.

Funny enough, it suggests some non-sense afterwards, about A having integer entries and some other things.

Vscode has a lot of flexibility in itself. There are many extensions that are useful. For instance, to work with latex, you’ll at the very least need a latex and pdf viewer extension. There is also git integration, which allows you to store your work remotely with the click of a couple of buttons.

Another reason to get familiar with VSCode is that it is a powerful tool for programming. That way, if you do every need to do some programming, you will already have a leg up on getting started!

I started a YouTube Channel

My First Video

I am happy to announce that I just posted my first video to a new YouTube Channel. The result is an interview with Xavier Garcia about how Chat GPT works.

What’s Next?

I’ll continue to post videos surrounding Machine Learning, Data Science, and perhaps elsewhere. I plan to do more interviews, individual videos about relevant topics, as well as some educational material.

The Process

I was already meeting with Xavier regularly to discuss Machine Learning topics, and so I thought to myself that other people might benefit from our discussions.

Our discussion for the video was perhaps a bit altered than the usual, as we were a bit less likely to say things “off the cuff.” I’ll be looking to make this easier going forward, as I think a lot of valuable insight lie here.

There was/is a learning curve in making videos, and I’ll definitely be looking to improve the quality as time goes forward. From zero to published video took a couple months (of course not doing this full time). Also, I got lucky and found a very helpful video editor which made things a lot easier. Things like sound quality, video quality, and lighting are not things I’ve had to worry about too much in the past. I didn’t invest too much in equipment, just trying to make use of what I had.

Can Chat-GPT Do Math?

Chat-GPT is a new impressive AI chatbot released by Open AI. Impressive applications of it can be found all over the internet. But can it do math?

By math, we do not mean simply perform computations. Its own design ensures that there will be computational problems it will be unable to solve. What I am more interested in is if it can solve problems that require some mathematical reasoning.

To choose our problems, we use the MMLU dataset. Galactica, a large language model recently released by Meta AI, has achieved some good results on this dataset. Their findings are in section 5.3 of their paper.

In what follows, we’ll pick a few problems and see how Chat-GPT handles them. If it fails, we will try to reword the question in hopes of achieving some partial results.

We note that the MMLU data set has multiple choice questions, but we just present the questions to chat-GPT. We just choose question from the abstract algebra section of the MMLU. Afterwards, we’ll prod chat-GPT for intuition and a proof. We’ll end by suggesting a few applications of chat-GPT for mathematicians.

Question 1

Me: The inverse of -i in the multiplicative group, {1, -1, i , -i } is

Chat-GPT: The inverse of -i in the multiplicative group {1, -1, i, -i} is i. The inverse of an element x in a group is an element y such that $latex xy=yx=1$, where 1 is the identity element of the group. In this group, 1 is the identity element, so the inverse of an element x is an element y such that x*y=1.

For example, the inverse of i is -i, because i*(-i)=1. Similarly, the inverse of -i is i, because (-i)*i=1.

Remark: Chat-GPT answers this question perfectly and gives an excellent explanation. To be fair, this is an easier question.

Question 2

Me: How many homomorphisms are there of Z into Z_2? Please write in latex format.

Chat-GPT: There are two homomorphisms of \mathbb{Z} into \mathbb{Z}_2. They are defined as follows:

f_1:\mathbb{Z}\rightarrow\mathbb{Z}_2 such that f_1(n)=n\pmod{2} for all n\in\mathbb{Z}

f_2:\mathbb{Z}\rightarrow\mathbb{Z}_2 such that f_2(n)=0 for all n\in\mathbb{Z}

Remark: Again, Chat-GPT is spot on. However, I asked this question a few times and it once said 1, and another time said two but gave a false justification:

There are only two homomorphisms of Z (the integers) into Z2 (the integers modulo 2).

One homomorphism sends every integer to 0, and the other homomorphism sends every integer to 1.

Question 3

Me: Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5. Please write your answer in latex.


The index of <p> in S_5 is the number of elements in S_5 that come before <p> in lexicographic order.

To find the index of <p>, we must first determine the cycle type of p. The cycle type of a permutation is a list of the lengths of its disjoint cycles, listed in decreasing order.

The cycle type of p is [2,2], since p consists of a 2-cycle and a 2-cycle.

Now we can use the cycle index of S_5 to determine the number of permutations in S_5 with cycle type [2,2].

The cycle index of S_5 is given by:

Z(S_5) = \frac{1}{5!} \sum_{i=1}^{5} a_i p_i^{5}

where a_i is the number of permutations in S_5 with i cycles, and p_i is a variable representing a cycle of length i.

Substituting the cycle type [2,2] into the cycle index gives:

\frac{1}{5!} (p_2^2 + p_2^2)

This simplifies to:

\frac{1}{120} (p_2^2 + p_2^2) = \frac{1}{60} (2p_2^2) = \frac{1}{30} p_2^2

The coefficient of p_2^2 in this expression is \frac{1}{30}, so the number of permutations in S_5 with cycle type [2,2] is \frac{1}{30}.

Since <p> is the only permutation in S_5 with cycle type [2,2], it is the \frac{1}{30}-th permutation in S_5.

Therefore, the index of <p> in S_5 is \boxed{\frac{1}{30}}.

Remark: Woah! This is very wrong! The very first sentence is completely wrong. Also, its final conclusion, 1/30 is not even an integer, which is certainly not right. Indeed, the index of a subgroup counts something and most be a positive integer.

We are able to salvage something though. Suppose we computed that the order of <p> is 5, using our algebra skills. Then we can prompt chat-GPT with the following.

Me: Now suppose H \leq S_5 is a subgroup of order 5. What is the index of H in S_5?

Chat-GPT: The index of a subgroup H in a group G is the number of left cosets of H in G. This is also equal to the number of right cosets of H in G.

Since the order of a subgroup divides the order of the group, the order of H must divide the order of S_5, which is 120. The only divisors of 120 that are less than 120 are 1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 24, 30, 40, and 120. Therefore, the only possible orders for H are 1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 24, 30, 40, and 120.

Since the order of H is 5, the index of H in S_5 is \frac{120}{5} = \boxed{24}.

Remark: Here chat-GPT answers perfectly with a great explanation. Perhaps it would be nice to mention that the main fact being used is Lagrange’s theorem, though this can easily be obtained by asking it “What theorem did you use above?”

But what’s the intuition?

What makes a research mathematician great is a certain “intuition” for problems. Let’s see if chat-GPT can explain any of that.

Me: Can you explain the intuition behind the Cauchy-Schwarz inequality? Please write any math formula in latex.

Chat-GPT: Sure! The Cauchy-Schwarz inequality is a useful inequality that arises in various areas of mathematics, such as vector algebra, inner product spaces, and probability theory. It states that for any two vectors x and y, the following inequality holds:

\left(\sum_{i=1}^n x_i y_i\right)^2 \le \left(\sum_{i=1}^n x_i^2\right)\left(\sum_{i=1}^n y_i^2\right)

Intuitively, the Cauchy-Schwarz inequality states that the inner product of two vectors x and y is “controlled” by the lengths of the vectors x and y. To see this, consider the special case where x and y are unit vectors (i.e., vectors of length 1). In this case, the inequality becomes:

\left(\sum_{i=1}^n x_i y_i\right)^2 \le \left(\sum_{i=1}^n x_i^2\right)\left(\sum_{i=1}^n y_i^2\right) = 1 \cdot 1 = 1

which means that the inner product of x and y is always less than or equal to 1. This is consistent with our intuition that the inner product of two unit vectors should be “small” if they are “orthogonal” (i.e., perpendicular) and “large” if they are “parallel” (i.e., pointing in the same direction).

I hope this helps! Let me know if you have any questions.

Remark: This is not too bad, but could certainly be improved. For instance, explaining when equality holds in Cauchy-Schwarz (when one vector is a multiple of the other) helps give a more precise meaning the the last paragraph. One can see this first few paragraphs of this blog post of Terence Tao to see an explanation from one of the world’s leading mathematicians. Also, here is a math stack exchange on the topic.

Prove it.

Me: Can you prove it?

Chat-GPT: Sure! Here is a proof of the Cauchy-Schwarz inequality:

First, we can rewrite the inequality as:

\left(\sum_{i=1}^n x_i y_i\right)^2 - \left(\sum_{i=1}^n x_i^2\right)\left(\sum_{i=1}^n y_i^2\right) \le 0

Next, we can expand the left-hand side as follows:

\sum_{i=1}^n x_i^2 y_i^2 + \sum_{i \ne j} x_i x_j y_i y_j - \sum_{i=1}^n x_i^2 y_i^2 - \sum_{i \ne j} x_i^2 y_j^2 \le 0

This simplifies to:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \sum_{i \ne j} (x_i y_j)^2 \le 0

Now, we can apply the arithmetic-geometric mean inequality to each term on the right-hand side:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \sum_{i \ne j} [(x_i y_j)^2]^{\frac{1}{2}}[(x_i y_j)^2]^{\frac{1}{2}} \le 0

This simplifies to:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \left(\sum_{i \ne j} [(x_i y_j)^2]^{\frac{1}{2}}\right)^2 \le 0

Finally, we can apply the arithmetic-geometric mean inequality again to the term on the right-hand side:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \left[\prod_{i \ne j} [(x_i y_j)^2]^{\frac{1}{2}}\right]^{\frac{2}{n-1}} \le 0

This simplifies to:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \prod_{i \ne j} (x_i y_j) \le 0

And we are done! This completes the proof of the Cauchy-Schwarz inequality.

I hope this helps! Let me know if you have any questions.

Remark: Unfortunately, this proof is not quite correct. Both the proof method, as well as the details (i.e. its application of AM-GM) are severely flawed.


To be honest, Chat-GPT is quite impressive for a model that is not tailored for mathematics in any way. I think it could already, in its current format, help a mathematician to prepare lecture notes or even easier parts of a paper. Perhaps its greatest utility would come in automating certain parts of grants and job applications