H e l l o , W o r l d !
Or maybe it sees each word as well as the punctuation:
Hello , World !
In order to understand how ChatGPT sees data, we have to understand the data on which it is trained on. The majority of the data used to train GPT-3 comes from the Common Crawl dataset, which is text scraped from the internet. Thus we turn our attention to understanding how text is encoded on the web.
In order for our computers to store and transfer text, we need a way of converting characters (i.e. elements of an alphabet, punctuation, etc.) to bits. Thanks to binary numbers it is enough to convert characters to integers (though encoding schemes like the popular UTF-8 provide a more complex and efficient conversion code points to bits, as we will see later).
A Brief History of Unicode
The earliest character encoding was ASCII (pronounced like as-kee), which stands for the American Standard Code of International Information Exchange. One key problem with it is already evident from the name..what if non-Americans would like to exchange information?
ASCII provides code points for 128 characters, including the English alphabet and common punctuation. ASCII is typically sufficient for sending English messages. You can get the ASCII encoding of the letter A (and vice versa) in python with the following built in function.
print(ord("A")) #ASCII code point of A print(chr(65)) #character of code point 65
In addition to the aforementioned symbols, there are also code points that correspond to non-printable information, which can cause some confusion.
ASCII contains most of the characters you will need if your goal is to communicate in English, and was widely adopted in the 1960s. However, ASCII cannot support languages with a different alphabets, accented characters, emojis, and more.
Thus a group of people set to create more inclusive standards for representing text, that was also backwards compatible with the already widely adopted ASCII. After several iterations, Unicode is now the widely adopted standard. It is supported by a variety of blue chip companies, as can be seen from their member’s page.
What is Unicode?
Unicode is a way to convert nearly 150,000 characters to integers. For instance, here is a nice list of the integer to character conversions. You can input unicode directly into html via &# followed by the decimal representation. For instance.
<p>𑅰</p> <p> 𑅰 </p>
renders as 𑅰 and 🤠, respectively.
Thus Unicode extends ASCII to accommodate nearly all desired written text with nearly 150,000 characters assigned a code point. It turns out this encoding plays a large part in encoding text in the web and consequently the training of ChatGPT. But before we see this connection, we have to discuss UTF-8.
While unicode is accommodating in terms of encoded characters, it is not terribly efficient. For instance, if you plan to write in mostly ASCII, it would make sense to make those characters require a smaller amount of space to encode. This is exactly the purpose that the UTF-8 encoding serves.
Recall that to store and send text, one needs to convert to bits. In practice, we work with bytes, which is just 8 bits. As 8 bits gives possibilities, all of ASCII can be represented by 1 byte (with room to spare). UTF-8 is an attempt to convert the Unicode code points to bytes in an efficient manner.
A byte can be represented by two hexadecimal numbers. For instance, 0-20 are given by:
0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14
So we can represent the character H, which is 72 in ASCII, by 48 in hexadecimal. UTF-8 converts Unicode code points to either 1,2,3 or 4 bytes. For instance, the UTF-8 encoding of
48 65 6c 6c 6f 20 f0 9f a4 a0
Note that the first 5 bytes correspond to H e l l o, while the last 4 correspond to 🤠. UTF-8 is set up in such a way that it is clear the last 4 bytes are all part of one character.
Back to Chat GPT
Unicode has the advantage of being able to methodically encode nearly all text on the web to integers. An integer turns out to be perfect for inputting into a machine learning model. However, inputting raw Unicode, which consists of nearly 150,000 characters, would be inefficient and beyond current computational power. For instance, the encoding used for ChatGPT that we will discuss below has 100,261 tokens. Thus it is convenient to have a clever way of converting text to integers that goes beyond Unicode.
Byte Pair Encoding (BPE) is a preprocessing step that allows us to identify subwords that appear often in the text. The starting point of BPE, as the name suggests, is bytes. We start by encoding every single byte to an integer 0-255, which we call a token. Thus any Unicode text can be written as a sequence of tokens via the UTF-8 encoding. For instance, the Hello 🤠 above can be tokenized to
72 101 108 108 111 32 240 159 164 160
However, we can make this more efficient by adding additional tokens. For instance, the word “to” appears quite often in English text. However, it is currently encoded as
What we can do is create a new token for the word to so that instead of using 2 tokens for this common word, we only use one (this is the “pair” in BPE). Using tiktoken, released by openai, we can see that this is exactly what was done.
#will need to install tikoken: pip install tiktoken import tiktoken enc = tiktoken.encoding_for_model("gpt-3.5-turbo") #gpt-3.5-turbo - ChatGPT enc.encode('to')
Running this code, we see that the integer 998 is reserved for to.
Byte Pair Encoding
So how exactly is the byte pair encoding performed? We will give a brief explanation and note the details can be found in ~20 lines of python code in Algorithm 1 of this paper of Senrich, Haddow, and Birch and also explained in Section 2 of the gpt-2 paper.
We start by taking a smallish sample of our text data. We then convert convert the text to bytes via the UTF-8 decoding. After this we see which pair of bytes appears the most often and assign a new token to that pair. We can see the first pair with the following python code (continued from above).
print([x for x in enc.decode_bytes()])
The result is 32 32, which corresponds to two consecutive spaces. BPE then repeats this process, with the possibility of joining the newly created token to any other token. In fact, this is repeated over 100,000 times!
We see that the first join is joining bytes 32 with itself. In fact, this is just two consecutive spaces. The first non-space join is that of i and n to form “in” (token 258).
It is well known that not every byte sequence is valid UTF-8 code. Thus, it is possible in theory for ChatGPT to produce non-valid UTF-8. Of course, this becomes increasing rare as the model is trained more and more. In fact the decoder provided by tiktoken has a kwarg to specify how to address this exact issue.
To see how ChatGPT is trained, we first have to understand the data. The data is scraped from the web, which lead us to the UTF-8 encoding. Such an encoding gives nearly 150,000 characters and is inefficient. This motivates looking at a compression technique, i.e. the Byte Pair Encoding.