The (near) Future of AI in Math

I believe that the way mathematicians conduct research will be fundamentally different in the near future. Below, I’ll describe my unique perspective, developed from working as a mathematician and also in big tech developing AI solutions. Whilst AI will play a pivotal role, I believe the key next step is the adoption of software best-practices.

Current State of Mathematical Research

This the current lifecycle of a research paper

  1. Brainstorm to strategize for ideas to explore
  2. Convert Idea into a concrete strategy
  3. Convince oneself that strategy will work
  4. Work on the implementation details
  5. Edit and verify correctness of paper
  6. Submit Paper to a journal
  7. Third Party reviews paper for correctness/value

To an outsider, it may come to a surprise that it is not uncommon for step (7) to be the most painful step in this journey. Reviews can take years and are somewhat subjective. Reviewers are unpaid and need to balance this duty with other competing goals. This is especially painful for early-career mathematicians. A postdoc can be awaiting a career changing decision whilst a paper sits idly in an anonymous review’s mailbox.

Verifying correctness of a paper is a thorny issue. Even after self and peer review, it is not uncommon to find papers with unfixable mathematical flaws. There are many stories of top-tier mathematicians with such errors.

Computer Formalization of Mathematical Proofs

These problems can largely be solved with software engineering. Formal theorem provers, like lean, provide a way to confirm proofs are correct. Thus, if new papers were written in lean, there would be no need for humans to check the validity of the proof. This may half the work for writers of the paper and decrease the work of a reviewer — assuming they actually do check validity — by an order of magnitude.

This sounds great, but in practice there are serious challenges. First off, writing a formal proof in lean currently quite challenging on its own. Mathematicians typically skip details that other experts can infer. Computers do not allow such a luxury. Every tiny step must be justified. Whilst this is a challenge that is immediately apparent when starting to use lean (see my previous blog post), I do not see this as a long term obstacle. The use of computers allows for automation and tooling, which will ease the burden in the long run.

Tooling

Good tooling makes software development quicker, easier and less error prone — and even more enjoyable. For instance, a good IDE, can allow developers to catch and avoid bugs quickly and seamlessly navigate large code bases. Lean also offers tooling (called tactics) that allows developers to go from a statement that is “almost” proved to fully proved.

Another promising direction for good tooling is AI. Tools like Github Copilot and Claude Code give useful autocomplete and even fully code based on structured requirements. Currently these tools have limited effectiveness for Lean specific tasks — likely due to limited available data for good lean proofs. However, I do not see this as a long term blocker.

There is a virtuous cycle: as the tooling gets better, more theorems will be converted to lean, further improving the available data.

Code Migration

In order for computer formalization to work at scale, we need to formalize existing math. This is a monumental challenge. Think of this a code migration, where:

  1. Legacy code (i.e. math papers) is not formal with incomplete — and sometimes incorrect — implementations
  2. Contributions have been made from hundreds of thousands of developers over the course of 2000 years.
  3. The subject matter experts typically have no experience in software engineering
  4. Research communities are largely siloed, with projects usually having less than 5 team members

Code migrations at much smaller scales are already incredibly challenging. Currently, what exists is one large code repository that is used to host all the different code (monorepo). Inside this repository, we can find various folders containing proofs coming from different areas of mathematics. The monorepo approach is known to work at large scale systems, such as at Meta or Google.

The Vision

We wonder which of the steps

  1. Brainstorm to strategize for ideas to explore
  2. Convert Idea into a concrete strategy
  3. Convince oneself that strategy will work
  4. Work on the implementation details
  5. Edit and verify correctness of paper
  6. Submit Paper to a journal
  7. Third Party reviews paper for correctness/value

can be transformed by adopting the advances in software engineering and AI. Already, mathematicians are using ChatGPT to brainstorm research ideas and even come up with proofs of simple lemmas. The latter can easily save hours of time.

Converting an idea to a concrete strategy and convincing oneself it will work is a trickier task. This usually involves deep domain expertise if one wants to be at the cutting edge of research.

Writing the implementation details and checking them is a long process. This usually involves breaking a large problem into many sub-problems and then dilligently verifying the correctness of every sub-problem. While not quite ready now, this is where good tooling could play a pivotal role. I believe a system that takes a statement of a lemma as an input and (usually) provides a correct formal proof as output should be within reach. We have some early signs of this, both through ChatGPT and also by using AI helpers within the lean code base itself. Nevertheless, there is much room for improvement. For a deep dive, Terry Tao posted a series of Youtube videos working through the same lemma with several different AI helpers.

Once we have formally verified the paper, all that is left to do is for the reviewer to provide feedback to decide on whether the paper will be accepted. This usually involves understanding the novelty and impact of a paper. For domain experts, this is a quick process. Furthermore, this is something that researchers should already be doing, as it is important to have a mental map of the research in one’s area.

Conclusion

Mathematics will inevitably adopt the tooling from software engineering and AI. Painful steps in mathematical research will be largely replaced, such as verification and certain implementation details. AI and software tooling will be a force multiplier, allowing individual mathematicians to accomplish much more than they can today.

AUC and Cross-entropy

In this post we will see that Log loss and AUC for binary classification problems are closely related.

Suppose we have a set of probabilities that are meant to predict whether a true label is 0 or 1. That is, let {\widehat{y_k}, y_k \in [0,1]} be the predicted and true labels for the {k}-th example, {k = 1, \ldots , m}.

Recall the AUC is the area under the ROC curve. We will use another interpretation, which is derived on the above wiki: AUC is the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

Note the larger that AUC is, the better the predictions are. For comparison of AUC to log loss, we will convert AUC to a loss function. We do so by studying {1-{\rm AUC}}.

Let {P} be the number of positive examples and {N} be the number of negative samples. Thus,

\displaystyle 1 - AUC = \frac{1}{PN}\sum_{p : y_p = 1} \sum_{n : y_n = 0} f_{AUC}(y_p,y_n),

where {f_{AUC}(x,y) = 1} if {y>x} and {0} otherwise.

We introduce another quantity:

\displaystyle I = \frac{1}{PN} \sum_{p : y_p = 1} \sum_{n : y_n = 0} \log(\widehat{y_p}) \cdot \log(1-\widehat{y_n}).

This quantity has a similar shape to the first quantity, except that {f_{AUC}(x,y)} is replaced by {\log(x) \cdot \log(1-y)}. Note that these two functions are quite similar. The latter is a smooth variant of the first, that penalizes much harsher for misclassifications that are poor. Put another way, the latter quantity penalizes harshly for confidently wrong predictions.

The above equation factors and is equal to

\displaystyle I = \left( \frac{1}{P} \sum_{p : y_p = 1} \log(\widehat{y_p}) \right) \left( \frac{1}{N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}) \right).

Let us compare this to the log loss:

\displaystyle {\rm LogLoss} = \frac{1}{P +N} \sum_{p : y_p = 1} \log(\widehat{y_p}) + \frac{1}{P + N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}).

The intermediate expression and LogLoss are connected by the AM-GM inequality. For simplicity, let us suppose {P = N}. Then, the AM-GM inequality implies

\displaystyle  \sqrt{I} \leq {\rm LogLoss}.

As {I} is similar to {1 - AUC}, we have that {1-{\rm AUC}} is roughly a lower bound on the log loss. Let us make some assumptions

  • Balanced data: {P \approx N}.
  • The predictions are not very wrong too often, i.e. {\widehat{y_p}} is not too often close to zero.
  • Predictions have similar errors for positive and negative, that is the log loss over positive and negative examples is similar.

Then following through the above argument, we find that

\displaystyle 1 - AUC \approx \sqrt{{\rm LogLoss}}.

The main new addition to what is above is to understand that equality in the AM-GM inequality is achieved when the two terms are equal.

Thus, we understand a couple things:

  • We expect that {1 - AUC \approx \sqrt{{\rm LogLoss}}}.
  • If this is not the case, then we can infer that the assumptions above are not met in our particular use-case.

PCA like a Mathematician

Principal Component Analysis, or PCA, is a fundamental dimensionality reduction technique using in Machine Learning. The general goal is to transform a {d}-dimensional problem into a {k}-dimensional one, where {k} is smaller than {d}. This can be used, for instance, to plot large dimensional data on a 2D or 3D plot, or to reduce the number of features in an effort to eliminate random noise in a machine learning problem (made precise below). In this post, we will discuss PCA in mathematical terms. This is adopted from a first course in Linear Algebra, the principle of maximum likelihood estimation, and the Eckhart-Young theorem from 1936.

{\smallskip}

Example: Suppose we are presented with {m} samples {x_1 , \ldots x_{m}}, where each is a 100 dimensional vector. After inspection, we realize that each data point is of the form {c_i z}, where {c_i} is a real number and {z} is a 100-dimensional vector. Thus, to describe each {x_i}, we simply need to specify {c_i}, rather than the 100 coordinates of {x_i}. This effectively reduces the dimensionality of our data from 100 to 1. {\spadesuit} {\smallskip}

The goal of PCA is to provide a methodical way of accomplishing this in general. There are a few issues to overcome.

  • The data may not be of the form {c_i z}, but rather {c_i z_1 + d_i z_2}, or even higher dimensional.
  • The data might only be approximately of the form {c_i z}, i.e. of the form {c_i z + e_i }, where {e_i} is a small error term.
  • We may not know, a priori, the form of the data. How do we determine the best {k}-dimensional representation of the data?

Let us address these issues below.

1. PCA in Mathematical Terms

Suppose we are presented with a matrix {X} of size {m \times d}, where each row is a sample and each column is a feature. We denote the rows of {X} are given by {x_1 , \ldots , x_m}. We suppose further, without loss of generality after scaling, that each column of {X} has zero mean and unit variance.

Our goal is to find the best approximation of the form

\displaystyle x_i \approx z_i^T W ,

for all {i = 1 , \ldots , m}, where {W} is a matrix of size {k \times d} and {z_i} is a vector of size {k}. Thus, each {x_i} is generated from lower dimensional data, via the {z_i}.

We suppose, more precisely, that

\displaystyle x_i =z_i^T W + \epsilon_i,

where {\epsilon_i \sim N(0, I)} is a error term. That is each coordinate of {\epsilon_i} is normally distributed with mean zero and variance 1. From here, we plan to derive PCA. One can wonder if the assumption on {\epsilon_i} is reasonable, which can be addressed on an individual basis. The assumption that {x_i} is generated by lower dimensional data is sometimes referred to as a latent representation, and the interested reader may consult this blog post for applications of this idea into generative models, such as image generation.

From this assumption, how do we find {W} and the {z_i}? The answer comes from the principal of Maximum Likelihood Estimation (MLE). Utilizing the formula for the probability density function of a multivariate normal distribution, for a fixed {W}, and {z_1 , \ldots , z_m}, the probability of observing {x_1 , \ldots , x_m} is given by

\displaystyle P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = \prod_{i=1}^m \frac{1}{(2 \pi )^{d/2}} \exp \left( - \frac{1}{2 } \| x_i - z_i^TW \|^2 \right).

The norm in the above equation is the classic euclidean norm. The principal of MLE states that we should choose {W} and the {z_i} to maximize this probability. It is typically easier to maximize a sum rather than a product. For instance, in Calculus, it is much easier to take the derivative of a sum than a product, as the product rule of derivatives is more complex than the sum rule. A classic mathematical trick that converts a product into a sum is to take logs. As the logarithm is a monotonically increasing function, maximizing the log of a function is equivalent to maximizing the function itself. Thus, we consider the log likelihood

\displaystyle \log P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = - \frac{m}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^m \| x_i - z_i^T W \|^2.

As our goal is to maximize the above, we can equivalently minimize the sum

\displaystyle \sum_{i=1}^m \| x_i - z_i^T W \|^2,

after removing the constant term that does not depend on {W} or the {z_i} and scaling the remaining summand. We can rewrite the above as

\displaystyle  || X - ZW||^2,

where {X} is the matrix of size {m \times d} whose rows are {x_1 , \ldots , x_m}, {Z} is the matrix of size {m \times k} whose rows are {z_1 , \ldots , z_m}, and {W} is the matrix of size {d \times k}.

2. The Eckhart-Young Theorem

The solution to this problem comes from the Eckart-Young theorem. Unfortunately, the original paper is behind a paywall, but we will provide the ideas here. Recall the rank of a matrix is the dimension of its column space.

Let us also recall the Singular Value Decomposition (SVD) of a matrix {X} of size {m \times d} is given by

\displaystyle X = U \Sigma V^T,

where {U} is an {m \times m} orthogonal matrix, {\Sigma} is a {m \times d} diagonal matrix, and {V} is a {d \times d} orthogonal matrix. The diagonal entries of {\Sigma} are called the singular values of {X}. It turns out that the columns of {U} are the eigenvectors of {X X^T} and the columns of {V} are the eigenvectors of {X^T X}.

{\smallskip}

Theorem: (Eckhart-Young) Let {X} be any matrix of size {m \times d}. Let {k \leq d}. Then

\displaystyle \min_{X_k} ||X - X_k||_{F}^2 \ \ {\rm s.t.} \ \ {\rm rank}(X_k) \leq k,

is given by

\displaystyle X_k = U \Sigma_{k}V^T,

where

\displaystyle X = U \Sigma V^T,

is the Singular Value Decomposition of {X} and {\Sigma_k} is {m\times d} matrix which has is zero except the first {k} columns are the same as {\Sigma}. {\spadesuit} {\smallskip}

There is a lot to unpack here:

  • How does this solve our original problem?
  • Why is this theorem true?

Let us address these in order. First note that {{\rm rank}(ZW) \leq {\rm rank}(Z) \leq k}, as {Z} has {k} columns. Thus, we were seeking a rank {k} approximation to {X} above. Here the rows of {U\Sigma_k} are precisely {z_1 , \ldots , z_m} from above. We also have {W} taking a particularly simple form, of a mostly zero matrix. This has the nice generalization of our original one dimensional example, where

\displaystyle x_i = \sum_{i=1}^k z_i \sigma_i,

where {\sigma_i} is the {i}th singular value of {X}.

{\smallskip}

Proof of Eckhart Young By SVD, it is enough to find {X_k} that minimizes

\displaystyle ||U \Sigma V^T - X_k||_{F}^2.

Since the Frobenius norm is invariant under orthogonal transformations, we may equivalently minimize

\displaystyle ||\Sigma - U^T X_k V||_{F}^2.

But {U^T X_k V} is a rank {k} matrix, as it is the product of two rank {k} matrices. Since {\Sigma} is a diagonal matrix, it is straightforward (or see here) to see that the best rank {k} approximation to {\Sigma} is to keep the first {k} columns of {\Sigma} and set the rest to zero, via {\Sigma_k}. Thus,

\displaystyle \Sigma_k = U^T X_k V,

and the result follows. {\spadesuit}

3. A Nice After Effect

Thanks to the singular value decomposition, we are equipped with an algorithm for reducing the dimension on non-training data. To see this, we may compute the latent variables, via

\displaystyle X_k = X V_k,

where {V_k} is the matrix of size {d \times k} whose columns are the first {k} columns of {V}. Thus, given a new data point {x}, we may compute the latent representation via

\displaystyle z = x^T V_k.

This is a nice feature of PCA, which is not afforded by other dimensionality reduction techniques, such as t-SNE. This allows the transform method to be available in sklearn’s PCA implementation, which affords the opportunity for PCA to be used in the feature engineering portion of a machine learning pipeline.

4. An Alternate Perspective

The more classical way to approach PCA is to search for a vector {v \in \mathbb{R}^d} satisfying

\displaystyle v = \arg \max_{u \in \mathbb{R}^d} || Xu||.

Thus, {v} is the vector that maximizes the variance of the data projected onto the line generated by {v}. However, the MLE approach above is more natural from a machine learning perspective.

That being said, I answered a question on math stackexchange awhile back that provides a derivation of PCA from this perspective. I was curious if the Lagrange Multiplier, outlined in that post, approach could be used to derive the Eckhart-Young theorem.

To see how to approach this, suppose we want to prove Eckhart-Young for {k=1}, i.e. we would like to find a rank 1 matrix {X_1}, that minimizes

\displaystyle ||X - X_1||_{F}^2.

As {X_1} is rank 1, it can be written as {uv^T}, where {u} and {v} are vectors. Thus, we are looking for {u} and {v} that minimize

\displaystyle ||X - uv^T||_{F}^2.

This is an optimization problem in {u} and {v}. Let us take a gradient with respect to {u} and {v} and set it equal to zero. By the definition of the Frobenius norm, and the multivariate chain rule, we have

\displaystyle \nabla_u ||X - uv^T||_{F}^2 = -2 (X - uv^T)v = 0,

and

\displaystyle \nabla_v ||X - uv^T||_{F}^2 = -2 (X - uv^T)^T u = 0.

Thus,

\displaystyle Xv = uv^Tv, \ \ \ X^T u = v u^T u.

These are precisely the equations satisfied by the SVD of {X}.

I’d like to see a cleaned up version of this argument, in the general case.

Breaking Down Problems for LLMs

We all have some problem we are trying to solve. First off, if you can convert the problem to text, or increasingly pdfs, images, video, etc., then you might be able to use AI for some or all of the solution.

Let’s focus on text for now. In this case, we have Large Language Models (LLMs), like ChatGPT, at our disposal.

Now, simply inputting your entire problem into the LLM may not work great or even at all. I talk about this at length in a recent YouTube video, focusing on a feature Amazon recently released.

Simply put, LLMs are fantastic at these sort of tasks:

  • Summarize this
  • Answer this Question based on this context
  • Sentiment Analysis
  • Translate this
  • Generate 10 Ideas for this
  • Write Code to complete this
  • See if there are any errors this text
  • Etc.

Thus if your original problem can be made easier to solve by solving one of these sub-problems, then LLMs might be a good tool for the job.

How one prompts the LLM makes a huge difference on the results. See another recent video of my where I discuss some common tricks.

Once we have a sub-problem in mind, it is sometimes the case that vanilla LLMs are not enough for the job. We can soup them up with things like

Thus we have a lot of tricks up our sleeves to make use of LLMs in practice.

Amazon’s Customer Summary Feature

In a recent YouTube Video I discuss how a cool feature from Amazon could be implemented with Large Language Models.

The “Customer Says” feature, as picture here:

allows one to

  1. See the sentiment of a product along different categories
  2. Click on each category to see some sample reviews.

In the video above, I explain how this feature can be broken down into a bunch of sub-problems, each of which can be solved with LLMs.

Carnival Of Mathematics

In this somewhat different post, I am hosting the long-running Carnival of Mathematics. First I’ll talk about 223 (the issue number) and then I’ll round up some mathematical posts from December 2023.

It’s primetime we talk about 223. First of all, it is a lucky prime, to which it is unknown if there are infinitely many. To write the number 223 as the sum of fifth powers requires 37 terms, more than any other number. This is an example of Waring’s problem has a rich history going back to Diophantus nearly 2000 years ago. Also, 223 is the number of permutations on 6 elements that have a strong fixed point (see below for Python code).

And now for what happened last month.

  • Quanta brings current research mathematics to a general audience. Solutions to long-standing problems happen all the time in math, and this video highlights some from the past year. The one I know the most about is at the end. The “Dense Sets have 3-term Arithmetic Progressions” problem has been the subject of intense study for over 70 years. Multiple Fields medal winners over several generations have studied the problem and a big breakthrough was made by two computer scientists (including a PhD student from my Alma mater).
  • Quanta also posted their Year in Math. Particularly interesting to me is Andrew Granville’s (my former Mentor) discussion of the intersection of computation and mathematics.
  • Here you can find a “Theorem of the Day,” where the author impressively keeps posting interesting theorems from mathematics on a daily basis.
  • An interesting story of a Math SAT problem with no correct answer.
  • John Cook’s explanation of a Calculus trick. See if you can figure out how his trick is a manifestation of the fact that for a right triangle with angle theta, sin(\theta) only depends on the ration of the non-hypotenuse sides.
  • If you are interested in some drawings with math and a little bit of humor, you can find that here.
  • Another article from Quanta highlighting a very recent breakthrough. Interesting to me is how quickly (3 weeks!) the result was formalized into computers. What this means is a group of people got together and input all of the results into a computer which then tested the validity of the results in an automated way. This process of inputting the results into the computer can be quite painful and I am impressed by the speed at which it was done. You can check out my previous blog post for my experience with it. Incidentally, one of the classical results I helped formalized in that blog post was used to formalize the recent result. The referee and review process for math papers is particularly painful for mathematicians. It is not uncommon to see mathematicians submit a paper and wait years for the decision to accept or reject to come in. Meanwhile, especially for younger mathematicians, their career hangs in the balance as they await these decisions. Adopting Software Engineering best practices to this process, as done above, I believe go a long way to alleviate some of this anguish.

Here is the Python code for the strong fixed point verification.

from itertools import permutations
perms = permutations([1,2,3,4,5,6])

def fixed_points(perm):
"""
Input: a permutation written as a tuple, i.e. (1,3,2) for 1->1, 2->3, 3->2
Output: a list of the indices of the fixed point of the permutation
"""
return [ind for ind,p_ind in enumerate(perm) if ind+1 == p_ind]

def is_strong_fixed_point(ind,perm):
"""
Input: An index and a permutation
Output: True if the index is a strong fixed point of the permutation, else False
See here: https://oeis.org/A006932
"""
for lesser_ind in range(0,ind):
if perm[lesser_ind] > perm[ind]:
return False
for greater_ind in range(ind+1,len(perm)):
if perm[greater_ind] < perm[ind]:
return False
return True

counter = 0
for perm in perms:
for fixed_point in fixed_points(perm):
if is_strong_fixed_point(fixed_point,perm):
counter +=1
break
print(counter)

Temperature in Natural Language Processing

In Machine Learning, and in particular Generative AI, temperature is a useful hyperparameter for tuning model outputs.

In this post, we will discuss the following.

  • Temperature is a parameter developers can use to alter outputs from Large Language Models
  • With a higher temperature we get more creative outputs.
  • Why changing the temperature is useful.

Let’s start with an example (the code I used is at the end of this post). Consider the following prompt, taken from Google’s Minerva paper.

A line parallel to y = 4x + 6 passes through (5, 10). What is the y-coordinate of the point where this line crosses the y-axis?

I asked ChatGPT four times to answer this prompt, similar to what was done in their paper. Here is what I got:

  • Since a line parallel to y=4x+6 has the same slope, the desired line has slope 4. Therefore, the equation of the line is of the form y=4x+b. Plugging in the point (5, 10), we have 10=4(5)+b, so b=10-4(5)=-10. Therefore, the y-intercept is \boxed{-10}.
  • Since the line is parallel to y = 4x + 6, which has a slope of 4, the line must have a slope of 4. The equation of the line is then y - 10 = 4(x - 5) = 4x - 20. Setting x = 0, we find y = -20 + 10 = \boxed{-10}.
  • The given line has slope 4, so a line parallel to it also has slope 4. If a line with slope 4 passes through (5,10), then it is of the form y-10=4(x-5). Expanding the right side gives y-10=4x-20, so y=4x-10. Setting x=0, we find that the line crosses the y-axis at \boxed{-10}.
  • Since a line parallel to y=4x+6 will have the same slope, this line will have slope equal to 4. Therefore, its y-intercept has y-coordinate \boxed{-14}.

The first three answers are correct, while the last answer is wrong. Actually the final answer is correct all the way up to the 4 in 14. What happened under the hood is that likely ChatGPT was nearly sure the 4 should be a 0, but still was hedging its bets that it could in fact be a 4. In that particular instance, it randomly chose the 4.

To understand temperature, we first have to understand a bit about how large language models work. For instance, the G in GPT stands for generative. In practice, this means that given some text, the Large Language model will try to predict the next letter (or more precisely token, see this previous blog post). For a model like GPT, each letter will be assigned a probability as to how likely it is to be the next character. For instance, in the example above, the model likely assigned a large probability to the character 0 and a small, but positive, one to the character 4. It’s at this point the concept of temperature is useful.

Now that we have a bunch of probabilities assigned to each character, we have to define a methodical way of choosing the next character. Do we just assign the character with highest probability?

Unfortunately, there is no “one size fits all” solution to this problem, which is why we introduce the notion of temperature. Let’s look at an example. Suppose we are choosing between two characters to output next. Suppose further that our model overwhelming thinks the first character is the best choice. We plot how the temperature affects our choice of character in this example.

Here, for a low temperature (i.e. \theta close to 0), the model outputs the first character nearly all the time. But as the temperature grows larger, the model outputs each character around half the time.

Thus as we decrease the temperature, we get closer to the model that only outputs the highest probability character. As we increase the temperature, we get closer to the model that chooses each character randomly and uniformly.

And just for completeness, I will mention that the temperature 0 response I got from ChatGPT in the above example was in alignment with the three correct answers.

Why is Temperature Useful?

The argument made in the aforementioned Minerva paper was that by increasing the temperature, we can have the model generate a variety of outputs. From this variety of outputs, we may pick the “best” one. How we choose the best one can vary, but what they did is just take the most popular one.

This allows us to explore the probability space of answer generated by the generative model in order to make a more informed decision at which one to proceed with.

The Code

Here is the Python code I used to generate the example above. First I created a .env file in the same directory with my Open AI API key (fill out with your API key)

OPENAI_API_KEY=

Then I used Langchain (though this is not really required for such a simple example) as follows:

import os 
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain import PromptTemplate

OPENAI_API_KEY = os.environ.get("OPEN_API_KEY")
query = "A line parallel to $y = 4x + 6$ passes through $(5, 10)$. What is the $y$-coordinate of the point where this line crosses the $y$-axis?"

prompt = PromptTemplate.from_template(query )
prompt.format()
llm = ChatOpenAI(temperature = 1)

chain = LLMChain(llm=llm, prompt=prompt)

responses = [chain.run({}) for _ in range(5)]

Singular Value Decomposition and PCA

Principal Component Analysis (PCA) is a popular technique in machine learning for dimension reduction. It can be derived from Singular Value Decomposition (SVD) which we will discuss in this post. We will cover the math, an example in python, and finally some intuition.

The Math

SVD asserts that any m \times d matrix A can be written as

A = U\Sigma V^T,

where

  • \Sigma is m \times d and is rectangular diagonal, that is \Sigma_{ij} = 0 for i \neq j.
  • U and V are orthogonal, that is U^T U = I and V^T V = I.

This allows us to derive PCA. First, we are given a data matrix, X, that is a matrix where each of the m rows are data points with d features.

First, we translate each column of X so that each column has mean zero. That is we translate each column by \mu_1 , \ldots , \mu_d, respectively. For instance if

X = \left[\begin{matrix}1 & 2\\3 & 4\\5 & 6\end{matrix}\right] ,

Then we would translate the first column by -3 and the second column by -4. Then we divide each column by their standard deviations, \sigma_1 , \ldots , \sigma_d. Thus in our example we divide the first column and second column by \sqrt{8}. Let us call this new scaled matrix A to align with our notation above. Then we can perform PCA by performing SVD

A = U\Sigma V^T

and computing the $m \times d$ matrix

AV.

For PCA with k components, the data point in the j^{\rm th} row of X is transformed to the first k entries of the j^{\rm th} row of AV.

Put another way, given a data point x, we compute PCA in two steps:

z = (x_1-\mu_1)/\sigma_1 , \cdots , (x_d-\mu_d)/\sigma_d,

and then

(z \cdot v_1 , \cdots , z \cdot v_k)

where v_1 , \ldots , v_k are the first k columns of V.

Python Example

We will show how to code up PCA using only numpy and torchvision (only to retrieve some classical datasets). First we start with some imports.

import torchvision.datasets as datasets
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

Then we define a PCA function that does the work we did in the previous section. The data is inputted as a numpy array. We utilize the np.linalg.svd function for SVD and specify full_matrices=False so that the dimensions of U and V^T align with our computations above.

def pca(data,k=2):
    data = data.reshape((data.shape[0], -1))
    mu = np.mean(data, axis=0)
    std = np.std(data, axis=0).clip(min=1e-10).reshape(1,-1)
    A = (data - mu) / std
    U, S, Vt = np.linalg.svd(A, full_matrices=False)
    V = np.transpose(Vt)
    tmp = np.dot(A, V)
    return tmp[:,:k],V

We can now load in the MNIST, Fashion MNIST, and CIFAR datasets from torchvision datasets.

mnist = datasets.MNIST(root='./data', train=True, download=True, transform=None)
np_mnist = mnist_trainset.data.numpy()/255.0

fmnist_trainset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=None)
fmnist_numpy = fmnist_trainset.data.numpy()/255.0

cifar= datasets.CIFAR100(root='./data', train=True, download=True, transform=None)
cifar_np = cifar.data /255.0

We can now perform PCA on these datasets.

pca_mnist,V_mnist = pca(np_mnist)
pca_fmnist , V_fmnist = pca(np_fmnist)
pca_cifar , V_cifar = pca(cifar_np)

We can see how they divide the datasets into their classes. With the following code.

fig, axs = plt.subplots(3, 4, figsize=(15, 6))

for i in range(3):
    for j in range(4):
        N = i*4+j 
        axs[i, j].imshow(.5*V_mnist[:,N].reshape((28,28))+.5, cmap="gray")
        N +=1 
        if N == 1:
            axs[i, j].set_title(f"{N}st Principal Component")
        elif N == 2:
            axs[i, j].set_title(f"{N}nd Principal Component")
        elif N == 3:
            axs[i, j].set_title(f"{N}rd Principal Component")
        else:
            axs[i, j].set_title(f"{N}th Principal Component")
        axs[i, j].axis('off')

We can also see what pixels are activated for each principal component. For instance we have the following for the MNIST data set

and the Fashion MNIST

You are encouraged to think about how I came about these images!

Intuition

There are two steps to PCA:

  • Scaling the columns of the data matrix
  • Computing the SVD and transforming the data

Let’s see these steps with a text example to see what’s going on. We generate a bunch of points in 2 dimensional space that lie close to the line y =2x, where x is between 0 and 1.

import numpy as np 

X = np.zeros((1_000,2))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
fig = sns.scatterplot(x = X[:,0],y = X[:,1])

Now we perform the scaling step.

Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
fig = sns.scatterplot(x=Z[:, 0], y=Z[:, 1])

You can see first of all this shifted the data so that now the center point is (0,0), instead of the original (.5,1). Also, through scaling, the data now lies near the line y = x, rather than y = 2x.

The scaling step (in particular division by the standard deviation) makes it so we do not favor a feature simply because the coordinates are on average larger.

We can now compute V.

Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)

This gives

V=  \begin{bmatrix} -0.70710678 & -0.70710678 \\ -0.70710678 & 0.70710678 \end{bmatrix}

Notice the first column of V, v_1 is a scalar multiple times

(1,1).

and the second column, v_2 is a scalar multiple of

(-1,1).

Here is what is going on. In our data set, after scaling, data points lie on the line y=x, up to a small error. PCA is a way of automating the process of finding this y=x line. Indeed, this is given by the vector v_1. When we compute v_1 \cdot x, for a data point x, we are getting information on the whereabouts of x lies on the line y = x. Indeed, for a point u = (a , a) we have

u \cdot (1,1)  = 2 a,

so this dot product is able to precisely compute where x is on the line y = x.

In higher dimensions, the same general principle holds. PCA allows us to find the line (or plane, or higher dimensional space) that best describes our data. Indeed let’s look at the previous example in 3-dimensions. We just add some small random noise to the additional coordinate

X = np.zeros((1_000,3))
X[:,0] = np.random.uniform(0,1,1_000)
X[:,1] = 2*X[:,0] + np.random.normal(0,0.1,1_000)
X[:,2] = .1*np.random.uniform(0,1,1_000)
Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Vt = np.linalg.svd(Z, full_matrices=False)[-1]
V = np.transpose(Vt)
print(V)

Here we find that the first column of V is, roughly, a scalar multiple times

(1,1,0).

Thus PCA still identifies the important relation between the first and second coordinate (after scaling) and disregards the other noise.

Limitations of PCA

One major limitation of PCA is that it’s only capable of capturing linear relations between the features. This could be handled by combining feature engineering with PCA, though this presents its own set of challenges. One way to overcome this is to invoke non-linear methods of dimension reduction, i.e. t-SNE or UMAP. In the following figure, I ran t-SNE on the MNIST data set (first using PCA to reduce the number of features to 25). We can see it performs much better than PCA at identifying digits.

Furthermore, PCA can lead to unpredictable outcomes that depend on the distribution of the data collected, as studied in this paper of Elhaik which I learned about from the Batch. This highlights that one should be careful making inferences from PCA alone.

The Square Root Cancellation Heuristic

In the first equation of the popular Attention is all you need paper (see also this blog post), the authors write

{\rm Attention}(Q,K,V) = {\rm softmax} \left( \frac{QK^T}{\sqrt{d_k}} V\right).

In this post we are going to discuss where the \sqrt{d_k} comes from, leading us to some classical Probability Theory. We will first talk about the math with some examples and then quickly make the connection.

The principle of square root cancellation also appears in the Batch Norm Paper, Neural Network weight initialization (see also Xavier Uniform), and elsewhere.

The Math

Let v be a $d$-dimensional vector with entries that are either +1 or -1 Then we may write

v= (v_1 , \ldots , v_d).

We will be interested in the simple question: what is the sum of the coordinates? For a fixed vector, we can easily compute this via

v_1 + \cdots + v_d.

The largest this sum can be is d and the smallest it can be is -d. However, what we would like to know the average behavior of this sum

Since the vector has both +1‘s and -1‘s, there will likely be cancellation in the sum. That is, we expect the sum will not be either d or -d. It turns out that as d gets large, there will typically be a lot of cancellation.

Let’s first run an experiment in python. We let d = 500 and compute the sum of 100,000 random vectors each with \pm 1 entries.

import numpy as np 
import seaborn as sns 

d = 500; sums = []

for _ in range(100_000):
    v = np.random.choice([-1, 1], size=d)
    sums.append(np.sum(v))

_ = sns.histplot(data = sums)

We see that most of the sums are quite a bit smaller than 500, and barely any are smaller than -75 or larger than 75.

This is the so-called square root cancellation. If we take a random vector of dimension d with randomly generated \pm 1 entries, we expect the order of the sum to be \sqrt{d}. Here \sqrt{500} \approx 22.36, and it is a general phenomenon that most of the sums will lie between, say, 4\sqrt{d} and \sqrt{d}. This general principle goes much deeper than \pm 1 vectors and turns out to be very well-studied concept in several areas of mathematics (see this post for Number Theory or this post for Probability). In fact, the famous Riemann Hypothesis can be reformulated as showing a certain sum exhibits square root cancellation.

The square root cancellation heuristic asserts that “most” vectors exhibit square root cancellation. Put another way, if we come across a vector in practice, we’d expect there to be square root cancellation as in the above example.

It turns out that the example we will discuss below generalizes considerably. The only thing that really matters is that each coordinate has mean zero (which is easily remedied by translation of a fixed constant). There are also some technical assumptions that the coordinates of v are not too large, but this is never an issue for vectors that only take on finitely many values. The motivated reader can consult the Lindeberg-Lévy Central Limit Theorem for more along this direction.

Before going any deeper, let’s explain what’s going on in the aforementioned Attention is All you Need paper.

Attention is All You Need Example

In the above equation, we have

\frac{QK^T}{\sqrt{d_k}} .

Each entries of QK^T is the dot product a row of Q and a row of K:

q \cdot k = q_1 k_1 + \cdots + q_{d_k} k_{d_k}.

It turns out that this is the sum of d_k numbers, and by the square root cancellation principle, we expect the sum to be of order \sqrt{d_k}. Hence dividing through by this number allows the sum to be, on average, of size comparable to 1 (rather than the much larger \sqrt{d_k}.

Keeping the sums from being too large helps us avoid the problem of exploding gradients in deep learning.

Back to the Math

Recall we had a vector v of d dimensions,

v = (v_1 , \ldots , v_d),

and we are interesting in what the sum is, on average. To make the question more precise, we assume each coordinate is independently generated randomly with mean zero. The key assumption is independence, that is the generation of one coordinate does not impact the others. The mean zero assumption can be obtained by shifting the values (for instance instead of recording a six sided dice roll, record the dice roll minus 3.5).

Here are some examples in python that you are welcome to check

v1 = np.random.choice([1,2,3,4,5,6],size=500) - 3.5
v2 = np.random.exponential(1,size=500) - 1 
v3 = np.random.normal(0,1,size=500) 

For such a vector, and d a bit large (say over 30), we expect

|v_1 + \cdots + v_d| \approx \sqrt{v_1^2 + \cdots + v_d^2}.

Specializing to the case where the coordinates are \pm 1, we recover the \sqrt{d} from earlier. Now why should we expect that the sum is of size \sqrt{d}?

One way to see this is to explicitly compute the expected value of

\left( v_1^2 + \cdots + v_d^2 \right),

which is precisely $d$. This computation can be done by expanding the square and using that the covariance of two different coordinates is 0. We will put the classical computation at the end of the post for those interested.

It is a common theme in this area that working with the square of the sum is theoretically much easier than working with the sum itself.

The Central Limit Theorem

It is worth reiterating that the square root cancellation heuristic generalizes much further than the \pm 1 example, but let’s continue our focus there. The astute reader will notice that the above histplot looked like the normal distribution. In fact, it is. If we consider the sum

\frac{v_1 + \cdots + v_d}{\sqrt{d}},

the histplot will be normal with mean zero and variance. Put another way,

v_1 + \cdots + v_d \approx \sqrt{d},

and we can quantify the \approx precisely. Thus if we have a sum of d numbers (where we expect the numbers to be on average 0), we can divide the sum by \sqrt{d} in order to make the sum \approx 1.

The Variance Computation

We will show

E[ \left( v_1^2 + \cdots + v_d^2 \right) ] = \sum_{j} E[v_j^2]

Indeed,

E[ \left( v_1^2 + \cdots + v_d^2 \right) ]= E[\sum_{ i,j} v_i v_j ] =\sum_{ i,j} E[v_i v_j ] ,

using FOIL from algebra and linearity of expectation. It is thus enough to show,

\sum_{ i,j} E[v_i v_j ]  = 0,

for every i \neq j. But this follows from independence.

How Does ChatGPT read?

How would ChatGPT read the infamous “Hello, World!” Does it see each character, sequentially

H e l l o , W o r l d !

Or maybe it sees each word as well as the punctuation:

Hello , World !

By the end of this post we will have a full understanding of this. On the way, we will learn about unicode, UTF-8, and byte pair encoding (BPE).

In order to understand how ChatGPT sees data, we have to understand the data on which it is trained on. The majority of the data used to train GPT-3 comes from the Common Crawl dataset, which is text scraped from the internet. Thus we turn our attention to understanding how text is encoded on the web.

Code Points

In order for our computers to store and transfer text, we need a way of converting characters (i.e. elements of an alphabet, punctuation, etc.) to bits. Thanks to binary numbers it is enough to convert characters to integers (though encoding schemes like the popular UTF-8 provide a more complex and efficient conversion code points to bits, as we will see later).

Thus we first turn our attention to mapping characters to integers, denoted a character encoding. This leads us to Unicode.

A Brief History of Unicode

The earliest character encoding was ASCII (pronounced like as-kee), which stands for the American Standard Code of International Information Exchange. One key problem with it is already evident from the name..what if non-Americans would like to exchange information?

ASCII provides code points for 128 characters, including the English alphabet and common punctuation. ASCII is typically sufficient for sending English messages. You can get the ASCII encoding of the letter A (and vice versa) in python with the following built in function.

print(ord("A")) #ASCII code point of A
print(chr(65)) #character of code point 65

In addition to the aforementioned symbols, there are also code points that correspond to non-printable information, which can cause some confusion.

ASCII contains most of the characters you will need if your goal is to communicate in English, and was widely adopted in the 1960s. However, ASCII cannot support languages with a different alphabets, accented characters, emojis, and more.

Thus a group of people set to create more inclusive standards for representing text, that was also backwards compatible with the already widely adopted ASCII. After several iterations, Unicode is now the widely adopted standard. It is supported by a variety of blue chip companies, as can be seen from their member’s page.

What is Unicode?

Unicode is a way to convert nearly 150,000 characters to integers. For instance, here is a nice list of the integer to character conversions. You can input unicode directly into html via &# followed by the decimal representation. For instance.

<p>&#70000</p>
<p> &#70000 </p>

renders as 𑅰 and 🤠, respectively.

You can also directly write unicode on your local machine by following a tutorial (Mac, Windows, and Linux).

Thus Unicode extends ASCII to accommodate nearly all desired written text with nearly 150,000 characters assigned a code point. It turns out this encoding plays a large part in encoding text in the web and consequently the training of ChatGPT. But before we see this connection, we have to discuss UTF-8.

UTF-8

While unicode is accommodating in terms of encoded characters, it is not terribly efficient. For instance, if you plan to write in mostly ASCII, it would make sense to make those characters require a smaller amount of space to encode. This is exactly the purpose that the UTF-8 encoding serves.

Recall that to store and send text, one needs to convert to bits. In practice, we work with bytes, which is just 8 bits. As 8 bits gives 2^8 = 256 possibilities, all of ASCII can be represented by 1 byte (with room to spare). UTF-8 is an attempt to convert the Unicode code points to bytes in an efficient manner.

A byte can be represented by two hexadecimal numbers. For instance, 0-20 are given by:

0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14

So we can represent the character H, which is 72 in ASCII, by 48 in hexadecimal. UTF-8 converts Unicode code points to either 1,2,3 or 4 bytes. For instance, the UTF-8 encoding of

Hello 🤠

is

48 65 6c 6c 6f 20 f0 9f a4 a0

Note that the first 5 bytes correspond to H e l l o, while the last 4 correspond to 🤠. UTF-8 is set up in such a way that it is clear the last 4 bytes are all part of one character.

Back to Chat GPT

Unicode has the advantage of being able to methodically encode nearly all text on the web to integers. An integer turns out to be perfect for inputting into a machine learning model. However, inputting raw Unicode, which consists of nearly 150,000 characters, would be inefficient and beyond current computational power. For instance, the encoding used for ChatGPT that we will discuss below has 100,261 tokens. Thus it is convenient to have a clever way of converting text to integers that goes beyond Unicode.

Byte Pair Encoding (BPE) is a preprocessing step that allows us to identify subwords that appear often in the text. The starting point of BPE, as the name suggests, is bytes. We start by encoding every single byte to an integer 0-255, which we call a token. Thus any Unicode text can be written as a sequence of tokens via the UTF-8 encoding. For instance, the Hello 🤠 above can be tokenized to

72 101 108 108 111 32 240 159 164 160

However, we can make this more efficient by adding additional tokens. For instance, the word “to” appears quite often in English text. However, it is currently encoded as

83 78

What we can do is create a new token for the word to so that instead of using 2 tokens for this common word, we only use one (this is the “pair” in BPE). Using tiktoken, released by openai, we can see that this is exactly what was done.

#will need to install tikoken: pip install tiktoken 
import tiktoken 
enc = tiktoken.encoding_for_model("gpt-3.5-turbo") #gpt-3.5-turbo - ChatGPT
enc.encode('to')

Running this code, we see that the integer 998 is reserved for to.

Byte Pair Encoding

So how exactly is the byte pair encoding performed? We will give a brief explanation and note the details can be found in ~20 lines of python code in Algorithm 1 of this paper of Senrich, Haddow, and Birch and also explained in Section 2 of the gpt-2 paper.

We start by taking a smallish sample of our text data. We then convert convert the text to bytes via the UTF-8 decoding. After this we see which pair of bytes appears the most often and assign a new token to that pair. We can see the first pair with the following python code (continued from above).

print([x for x in enc.decode_bytes([256])])

The result is 32 32, which corresponds to two consecutive spaces. BPE then repeats this process, with the possibility of joining the newly created token to any other token. In fact, this is repeated over 100,000 times!

We see that the first join is joining bytes 32 with itself. In fact, this is just two consecutive spaces. The first non-space join is that of i and n to form “in” (token 258).

It is worth mentioning that BPE is not the only method of tokenizing. For instance, Google’s Bard uses SentencePiece.

One Issue

It is well known that not every byte sequence is valid UTF-8 code. Thus, it is possible in theory for ChatGPT to produce non-valid UTF-8. Of course, this becomes increasing rare as the model is trained more and more. In fact the decoder provided by tiktoken has a kwarg to specify how to address this exact issue.

Recap

To see how ChatGPT is trained, we first have to understand the data. The data is scraped from the web, which lead us to the UTF-8 encoding. Such an encoding gives nearly 150,000 characters and is inefficient. This motivates looking at a compression technique, i.e. the Byte Pair Encoding.