Category Archives: Mathematics

The (near) Future of AI in Math

I believe that the way mathematicians conduct research will be fundamentally different in the near future. Below, I’ll describe my unique perspective, developed from working as a mathematician and also in big tech developing AI solutions. Whilst AI will play a pivotal role, I believe the key next step is the adoption of software best-practices.

Current State of Mathematical Research

This the current lifecycle of a research paper

  1. Brainstorm to strategize for ideas to explore
  2. Convert Idea into a concrete strategy
  3. Convince oneself that strategy will work
  4. Work on the implementation details
  5. Edit and verify correctness of paper
  6. Submit Paper to a journal
  7. Third Party reviews paper for correctness/value

To an outsider, it may come to a surprise that it is not uncommon for step (7) to be the most painful step in this journey. Reviews can take years and are somewhat subjective. Reviewers are unpaid and need to balance this duty with other competing goals. This is especially painful for early-career mathematicians. A postdoc can be awaiting a career changing decision whilst a paper sits idly in an anonymous review’s mailbox.

Verifying correctness of a paper is a thorny issue. Even after self and peer review, it is not uncommon to find papers with unfixable mathematical flaws. There are many stories of top-tier mathematicians with such errors.

Computer Formalization of Mathematical Proofs

These problems can largely be solved with software engineering. Formal theorem provers, like lean, provide a way to confirm proofs are correct. Thus, if new papers were written in lean, there would be no need for humans to check the validity of the proof. This may half the work for writers of the paper and decrease the work of a reviewer — assuming they actually do check validity — by an order of magnitude.

This sounds great, but in practice there are serious challenges. First off, writing a formal proof in lean currently quite challenging on its own. Mathematicians typically skip details that other experts can infer. Computers do not allow such a luxury. Every tiny step must be justified. Whilst this is a challenge that is immediately apparent when starting to use lean (see my previous blog post), I do not see this as a long term obstacle. The use of computers allows for automation and tooling, which will ease the burden in the long run.

Tooling

Good tooling makes software development quicker, easier and less error prone — and even more enjoyable. For instance, a good IDE, can allow developers to catch and avoid bugs quickly and seamlessly navigate large code bases. Lean also offers tooling (called tactics) that allows developers to go from a statement that is “almost” proved to fully proved.

Another promising direction for good tooling is AI. Tools like Github Copilot and Claude Code give useful autocomplete and even fully code based on structured requirements. Currently these tools have limited effectiveness for Lean specific tasks — likely due to limited available data for good lean proofs. However, I do not see this as a long term blocker.

There is a virtuous cycle: as the tooling gets better, more theorems will be converted to lean, further improving the available data.

Code Migration

In order for computer formalization to work at scale, we need to formalize existing math. This is a monumental challenge. Think of this a code migration, where:

  1. Legacy code (i.e. math papers) is not formal with incomplete — and sometimes incorrect — implementations
  2. Contributions have been made from hundreds of thousands of developers over the course of 2000 years.
  3. The subject matter experts typically have no experience in software engineering
  4. Research communities are largely siloed, with projects usually having less than 5 team members

Code migrations at much smaller scales are already incredibly challenging. Currently, what exists is one large code repository that is used to host all the different code (monorepo). Inside this repository, we can find various folders containing proofs coming from different areas of mathematics. The monorepo approach is known to work at large scale systems, such as at Meta or Google.

The Vision

We wonder which of the steps

  1. Brainstorm to strategize for ideas to explore
  2. Convert Idea into a concrete strategy
  3. Convince oneself that strategy will work
  4. Work on the implementation details
  5. Edit and verify correctness of paper
  6. Submit Paper to a journal
  7. Third Party reviews paper for correctness/value

can be transformed by adopting the advances in software engineering and AI. Already, mathematicians are using ChatGPT to brainstorm research ideas and even come up with proofs of simple lemmas. The latter can easily save hours of time.

Converting an idea to a concrete strategy and convincing oneself it will work is a trickier task. This usually involves deep domain expertise if one wants to be at the cutting edge of research.

Writing the implementation details and checking them is a long process. This usually involves breaking a large problem into many sub-problems and then dilligently verifying the correctness of every sub-problem. While not quite ready now, this is where good tooling could play a pivotal role. I believe a system that takes a statement of a lemma as an input and (usually) provides a correct formal proof as output should be within reach. We have some early signs of this, both through ChatGPT and also by using AI helpers within the lean code base itself. Nevertheless, there is much room for improvement. For a deep dive, Terry Tao posted a series of Youtube videos working through the same lemma with several different AI helpers.

Once we have formally verified the paper, all that is left to do is for the reviewer to provide feedback to decide on whether the paper will be accepted. This usually involves understanding the novelty and impact of a paper. For domain experts, this is a quick process. Furthermore, this is something that researchers should already be doing, as it is important to have a mental map of the research in one’s area.

Conclusion

Mathematics will inevitably adopt the tooling from software engineering and AI. Painful steps in mathematical research will be largely replaced, such as verification and certain implementation details. AI and software tooling will be a force multiplier, allowing individual mathematicians to accomplish much more than they can today.

AUC and Cross-entropy

In this post we will see that Log loss and AUC for binary classification problems are closely related.

Suppose we have a set of probabilities that are meant to predict whether a true label is 0 or 1. That is, let {\widehat{y_k}, y_k \in [0,1]} be the predicted and true labels for the {k}-th example, {k = 1, \ldots , m}.

Recall the AUC is the area under the ROC curve. We will use another interpretation, which is derived on the above wiki: AUC is the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

Note the larger that AUC is, the better the predictions are. For comparison of AUC to log loss, we will convert AUC to a loss function. We do so by studying {1-{\rm AUC}}.

Let {P} be the number of positive examples and {N} be the number of negative samples. Thus,

\displaystyle 1 - AUC = \frac{1}{PN}\sum_{p : y_p = 1} \sum_{n : y_n = 0} f_{AUC}(y_p,y_n),

where {f_{AUC}(x,y) = 1} if {y>x} and {0} otherwise.

We introduce another quantity:

\displaystyle I = \frac{1}{PN} \sum_{p : y_p = 1} \sum_{n : y_n = 0} \log(\widehat{y_p}) \cdot \log(1-\widehat{y_n}).

This quantity has a similar shape to the first quantity, except that {f_{AUC}(x,y)} is replaced by {\log(x) \cdot \log(1-y)}. Note that these two functions are quite similar. The latter is a smooth variant of the first, that penalizes much harsher for misclassifications that are poor. Put another way, the latter quantity penalizes harshly for confidently wrong predictions.

The above equation factors and is equal to

\displaystyle I = \left( \frac{1}{P} \sum_{p : y_p = 1} \log(\widehat{y_p}) \right) \left( \frac{1}{N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}) \right).

Let us compare this to the log loss:

\displaystyle {\rm LogLoss} = \frac{1}{P +N} \sum_{p : y_p = 1} \log(\widehat{y_p}) + \frac{1}{P + N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}).

The intermediate expression and LogLoss are connected by the AM-GM inequality. For simplicity, let us suppose {P = N}. Then, the AM-GM inequality implies

\displaystyle  \sqrt{I} \leq {\rm LogLoss}.

As {I} is similar to {1 - AUC}, we have that {1-{\rm AUC}} is roughly a lower bound on the log loss. Let us make some assumptions

  • Balanced data: {P \approx N}.
  • The predictions are not very wrong too often, i.e. {\widehat{y_p}} is not too often close to zero.
  • Predictions have similar errors for positive and negative, that is the log loss over positive and negative examples is similar.

Then following through the above argument, we find that

\displaystyle 1 - AUC \approx \sqrt{{\rm LogLoss}}.

The main new addition to what is above is to understand that equality in the AM-GM inequality is achieved when the two terms are equal.

Thus, we understand a couple things:

  • We expect that {1 - AUC \approx \sqrt{{\rm LogLoss}}}.
  • If this is not the case, then we can infer that the assumptions above are not met in our particular use-case.

PCA like a Mathematician

Principal Component Analysis, or PCA, is a fundamental dimensionality reduction technique using in Machine Learning. The general goal is to transform a {d}-dimensional problem into a {k}-dimensional one, where {k} is smaller than {d}. This can be used, for instance, to plot large dimensional data on a 2D or 3D plot, or to reduce the number of features in an effort to eliminate random noise in a machine learning problem (made precise below). In this post, we will discuss PCA in mathematical terms. This is adopted from a first course in Linear Algebra, the principle of maximum likelihood estimation, and the Eckhart-Young theorem from 1936.

{\smallskip}

Example: Suppose we are presented with {m} samples {x_1 , \ldots x_{m}}, where each is a 100 dimensional vector. After inspection, we realize that each data point is of the form {c_i z}, where {c_i} is a real number and {z} is a 100-dimensional vector. Thus, to describe each {x_i}, we simply need to specify {c_i}, rather than the 100 coordinates of {x_i}. This effectively reduces the dimensionality of our data from 100 to 1. {\spadesuit} {\smallskip}

The goal of PCA is to provide a methodical way of accomplishing this in general. There are a few issues to overcome.

  • The data may not be of the form {c_i z}, but rather {c_i z_1 + d_i z_2}, or even higher dimensional.
  • The data might only be approximately of the form {c_i z}, i.e. of the form {c_i z + e_i }, where {e_i} is a small error term.
  • We may not know, a priori, the form of the data. How do we determine the best {k}-dimensional representation of the data?

Let us address these issues below.

1. PCA in Mathematical Terms

Suppose we are presented with a matrix {X} of size {m \times d}, where each row is a sample and each column is a feature. We denote the rows of {X} are given by {x_1 , \ldots , x_m}. We suppose further, without loss of generality after scaling, that each column of {X} has zero mean and unit variance.

Our goal is to find the best approximation of the form

\displaystyle x_i \approx z_i^T W ,

for all {i = 1 , \ldots , m}, where {W} is a matrix of size {k \times d} and {z_i} is a vector of size {k}. Thus, each {x_i} is generated from lower dimensional data, via the {z_i}.

We suppose, more precisely, that

\displaystyle x_i =z_i^T W + \epsilon_i,

where {\epsilon_i \sim N(0, I)} is a error term. That is each coordinate of {\epsilon_i} is normally distributed with mean zero and variance 1. From here, we plan to derive PCA. One can wonder if the assumption on {\epsilon_i} is reasonable, which can be addressed on an individual basis. The assumption that {x_i} is generated by lower dimensional data is sometimes referred to as a latent representation, and the interested reader may consult this blog post for applications of this idea into generative models, such as image generation.

From this assumption, how do we find {W} and the {z_i}? The answer comes from the principal of Maximum Likelihood Estimation (MLE). Utilizing the formula for the probability density function of a multivariate normal distribution, for a fixed {W}, and {z_1 , \ldots , z_m}, the probability of observing {x_1 , \ldots , x_m} is given by

\displaystyle P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = \prod_{i=1}^m \frac{1}{(2 \pi )^{d/2}} \exp \left( - \frac{1}{2 } \| x_i - z_i^TW \|^2 \right).

The norm in the above equation is the classic euclidean norm. The principal of MLE states that we should choose {W} and the {z_i} to maximize this probability. It is typically easier to maximize a sum rather than a product. For instance, in Calculus, it is much easier to take the derivative of a sum than a product, as the product rule of derivatives is more complex than the sum rule. A classic mathematical trick that converts a product into a sum is to take logs. As the logarithm is a monotonically increasing function, maximizing the log of a function is equivalent to maximizing the function itself. Thus, we consider the log likelihood

\displaystyle \log P(x_1 , \ldots , x_m | W, z_1 , \ldots , z_m) = - \frac{m}{2} \log (2 \pi) - \frac{1}{2} \sum_{i=1}^m \| x_i - z_i^T W \|^2.

As our goal is to maximize the above, we can equivalently minimize the sum

\displaystyle \sum_{i=1}^m \| x_i - z_i^T W \|^2,

after removing the constant term that does not depend on {W} or the {z_i} and scaling the remaining summand. We can rewrite the above as

\displaystyle  || X - ZW||^2,

where {X} is the matrix of size {m \times d} whose rows are {x_1 , \ldots , x_m}, {Z} is the matrix of size {m \times k} whose rows are {z_1 , \ldots , z_m}, and {W} is the matrix of size {d \times k}.

2. The Eckhart-Young Theorem

The solution to this problem comes from the Eckart-Young theorem. Unfortunately, the original paper is behind a paywall, but we will provide the ideas here. Recall the rank of a matrix is the dimension of its column space.

Let us also recall the Singular Value Decomposition (SVD) of a matrix {X} of size {m \times d} is given by

\displaystyle X = U \Sigma V^T,

where {U} is an {m \times m} orthogonal matrix, {\Sigma} is a {m \times d} diagonal matrix, and {V} is a {d \times d} orthogonal matrix. The diagonal entries of {\Sigma} are called the singular values of {X}. It turns out that the columns of {U} are the eigenvectors of {X X^T} and the columns of {V} are the eigenvectors of {X^T X}.

{\smallskip}

Theorem: (Eckhart-Young) Let {X} be any matrix of size {m \times d}. Let {k \leq d}. Then

\displaystyle \min_{X_k} ||X - X_k||_{F}^2 \ \ {\rm s.t.} \ \ {\rm rank}(X_k) \leq k,

is given by

\displaystyle X_k = U \Sigma_{k}V^T,

where

\displaystyle X = U \Sigma V^T,

is the Singular Value Decomposition of {X} and {\Sigma_k} is {m\times d} matrix which has is zero except the first {k} columns are the same as {\Sigma}. {\spadesuit} {\smallskip}

There is a lot to unpack here:

  • How does this solve our original problem?
  • Why is this theorem true?

Let us address these in order. First note that {{\rm rank}(ZW) \leq {\rm rank}(Z) \leq k}, as {Z} has {k} columns. Thus, we were seeking a rank {k} approximation to {X} above. Here the rows of {U\Sigma_k} are precisely {z_1 , \ldots , z_m} from above. We also have {W} taking a particularly simple form, of a mostly zero matrix. This has the nice generalization of our original one dimensional example, where

\displaystyle x_i = \sum_{i=1}^k z_i \sigma_i,

where {\sigma_i} is the {i}th singular value of {X}.

{\smallskip}

Proof of Eckhart Young By SVD, it is enough to find {X_k} that minimizes

\displaystyle ||U \Sigma V^T - X_k||_{F}^2.

Since the Frobenius norm is invariant under orthogonal transformations, we may equivalently minimize

\displaystyle ||\Sigma - U^T X_k V||_{F}^2.

But {U^T X_k V} is a rank {k} matrix, as it is the product of two rank {k} matrices. Since {\Sigma} is a diagonal matrix, it is straightforward (or see here) to see that the best rank {k} approximation to {\Sigma} is to keep the first {k} columns of {\Sigma} and set the rest to zero, via {\Sigma_k}. Thus,

\displaystyle \Sigma_k = U^T X_k V,

and the result follows. {\spadesuit}

3. A Nice After Effect

Thanks to the singular value decomposition, we are equipped with an algorithm for reducing the dimension on non-training data. To see this, we may compute the latent variables, via

\displaystyle X_k = X V_k,

where {V_k} is the matrix of size {d \times k} whose columns are the first {k} columns of {V}. Thus, given a new data point {x}, we may compute the latent representation via

\displaystyle z = x^T V_k.

This is a nice feature of PCA, which is not afforded by other dimensionality reduction techniques, such as t-SNE. This allows the transform method to be available in sklearn’s PCA implementation, which affords the opportunity for PCA to be used in the feature engineering portion of a machine learning pipeline.

4. An Alternate Perspective

The more classical way to approach PCA is to search for a vector {v \in \mathbb{R}^d} satisfying

\displaystyle v = \arg \max_{u \in \mathbb{R}^d} || Xu||.

Thus, {v} is the vector that maximizes the variance of the data projected onto the line generated by {v}. However, the MLE approach above is more natural from a machine learning perspective.

That being said, I answered a question on math stackexchange awhile back that provides a derivation of PCA from this perspective. I was curious if the Lagrange Multiplier, outlined in that post, approach could be used to derive the Eckhart-Young theorem.

To see how to approach this, suppose we want to prove Eckhart-Young for {k=1}, i.e. we would like to find a rank 1 matrix {X_1}, that minimizes

\displaystyle ||X - X_1||_{F}^2.

As {X_1} is rank 1, it can be written as {uv^T}, where {u} and {v} are vectors. Thus, we are looking for {u} and {v} that minimize

\displaystyle ||X - uv^T||_{F}^2.

This is an optimization problem in {u} and {v}. Let us take a gradient with respect to {u} and {v} and set it equal to zero. By the definition of the Frobenius norm, and the multivariate chain rule, we have

\displaystyle \nabla_u ||X - uv^T||_{F}^2 = -2 (X - uv^T)v = 0,

and

\displaystyle \nabla_v ||X - uv^T||_{F}^2 = -2 (X - uv^T)^T u = 0.

Thus,

\displaystyle Xv = uv^Tv, \ \ \ X^T u = v u^T u.

These are precisely the equations satisfied by the SVD of {X}.

I’d like to see a cleaned up version of this argument, in the general case.

Carnival Of Mathematics

In this somewhat different post, I am hosting the long-running Carnival of Mathematics. First I’ll talk about 223 (the issue number) and then I’ll round up some mathematical posts from December 2023.

It’s primetime we talk about 223. First of all, it is a lucky prime, to which it is unknown if there are infinitely many. To write the number 223 as the sum of fifth powers requires 37 terms, more than any other number. This is an example of Waring’s problem has a rich history going back to Diophantus nearly 2000 years ago. Also, 223 is the number of permutations on 6 elements that have a strong fixed point (see below for Python code).

And now for what happened last month.

  • Quanta brings current research mathematics to a general audience. Solutions to long-standing problems happen all the time in math, and this video highlights some from the past year. The one I know the most about is at the end. The “Dense Sets have 3-term Arithmetic Progressions” problem has been the subject of intense study for over 70 years. Multiple Fields medal winners over several generations have studied the problem and a big breakthrough was made by two computer scientists (including a PhD student from my Alma mater).
  • Quanta also posted their Year in Math. Particularly interesting to me is Andrew Granville’s (my former Mentor) discussion of the intersection of computation and mathematics.
  • Here you can find a “Theorem of the Day,” where the author impressively keeps posting interesting theorems from mathematics on a daily basis.
  • An interesting story of a Math SAT problem with no correct answer.
  • John Cook’s explanation of a Calculus trick. See if you can figure out how his trick is a manifestation of the fact that for a right triangle with angle theta, sin(\theta) only depends on the ration of the non-hypotenuse sides.
  • If you are interested in some drawings with math and a little bit of humor, you can find that here.
  • Another article from Quanta highlighting a very recent breakthrough. Interesting to me is how quickly (3 weeks!) the result was formalized into computers. What this means is a group of people got together and input all of the results into a computer which then tested the validity of the results in an automated way. This process of inputting the results into the computer can be quite painful and I am impressed by the speed at which it was done. You can check out my previous blog post for my experience with it. Incidentally, one of the classical results I helped formalized in that blog post was used to formalize the recent result. The referee and review process for math papers is particularly painful for mathematicians. It is not uncommon to see mathematicians submit a paper and wait years for the decision to accept or reject to come in. Meanwhile, especially for younger mathematicians, their career hangs in the balance as they await these decisions. Adopting Software Engineering best practices to this process, as done above, I believe go a long way to alleviate some of this anguish.

Here is the Python code for the strong fixed point verification.

from itertools import permutations
perms = permutations([1,2,3,4,5,6])

def fixed_points(perm):
"""
Input: a permutation written as a tuple, i.e. (1,3,2) for 1->1, 2->3, 3->2
Output: a list of the indices of the fixed point of the permutation
"""
return [ind for ind,p_ind in enumerate(perm) if ind+1 == p_ind]

def is_strong_fixed_point(ind,perm):
"""
Input: An index and a permutation
Output: True if the index is a strong fixed point of the permutation, else False
See here: https://oeis.org/A006932
"""
for lesser_ind in range(0,ind):
if perm[lesser_ind] > perm[ind]:
return False
for greater_ind in range(ind+1,len(perm)):
if perm[greater_ind] < perm[ind]:
return False
return True

counter = 0
for perm in perms:
for fixed_point in fixed_points(perm):
if is_strong_fixed_point(fixed_point,perm):
counter +=1
break
print(counter)

Can Chat-GPT Do Math?

Chat-GPT is a new impressive AI chatbot released by Open AI. Impressive applications of it can be found all over the internet. But can it do math?

By math, we do not mean simply perform computations. Its own design ensures that there will be computational problems it will be unable to solve. What I am more interested in is if it can solve problems that require some mathematical reasoning.

To choose our problems, we use the MMLU dataset. Galactica, a large language model recently released by Meta AI, has achieved some good results on this dataset. Their findings are in section 5.3 of their paper.

In what follows, we’ll pick a few problems and see how Chat-GPT handles them. If it fails, we will try to reword the question in hopes of achieving some partial results.

We note that the MMLU data set has multiple choice questions, but we just present the questions to chat-GPT. We just choose question from the abstract algebra section of the MMLU. Afterwards, we’ll prod chat-GPT for intuition and a proof. We’ll end by suggesting a few applications of chat-GPT for mathematicians.

Question 1

Me: The inverse of -i in the multiplicative group, {1, -1, i , -i } is

Chat-GPT: The inverse of -i in the multiplicative group {1, -1, i, -i} is i. The inverse of an element x in a group is an element y such that $latex xy=yx=1$, where 1 is the identity element of the group. In this group, 1 is the identity element, so the inverse of an element x is an element y such that x*y=1.

For example, the inverse of i is -i, because i*(-i)=1. Similarly, the inverse of -i is i, because (-i)*i=1.

Remark: Chat-GPT answers this question perfectly and gives an excellent explanation. To be fair, this is an easier question.

Question 2

Me: How many homomorphisms are there of Z into Z_2? Please write in latex format.

Chat-GPT: There are two homomorphisms of \mathbb{Z} into \mathbb{Z}_2. They are defined as follows:

f_1:\mathbb{Z}\rightarrow\mathbb{Z}_2 such that f_1(n)=n\pmod{2} for all n\in\mathbb{Z}

f_2:\mathbb{Z}\rightarrow\mathbb{Z}_2 such that f_2(n)=0 for all n\in\mathbb{Z}

Remark: Again, Chat-GPT is spot on. However, I asked this question a few times and it once said 1, and another time said two but gave a false justification:

There are only two homomorphisms of Z (the integers) into Z2 (the integers modulo 2).

One homomorphism sends every integer to 0, and the other homomorphism sends every integer to 1.

Question 3

Me: Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5. Please write your answer in latex.

Chat-GPT:

The index of <p> in S_5 is the number of elements in S_5 that come before <p> in lexicographic order.

To find the index of <p>, we must first determine the cycle type of p. The cycle type of a permutation is a list of the lengths of its disjoint cycles, listed in decreasing order.

The cycle type of p is [2,2], since p consists of a 2-cycle and a 2-cycle.

Now we can use the cycle index of S_5 to determine the number of permutations in S_5 with cycle type [2,2].

The cycle index of S_5 is given by:

Z(S_5) = \frac{1}{5!} \sum_{i=1}^{5} a_i p_i^{5}

where a_i is the number of permutations in S_5 with i cycles, and p_i is a variable representing a cycle of length i.

Substituting the cycle type [2,2] into the cycle index gives:

\frac{1}{5!} (p_2^2 + p_2^2)

This simplifies to:

\frac{1}{120} (p_2^2 + p_2^2) = \frac{1}{60} (2p_2^2) = \frac{1}{30} p_2^2

The coefficient of p_2^2 in this expression is \frac{1}{30}, so the number of permutations in S_5 with cycle type [2,2] is \frac{1}{30}.

Since <p> is the only permutation in S_5 with cycle type [2,2], it is the \frac{1}{30}-th permutation in S_5.

Therefore, the index of <p> in S_5 is \boxed{\frac{1}{30}}.

Remark: Woah! This is very wrong! The very first sentence is completely wrong. Also, its final conclusion, 1/30 is not even an integer, which is certainly not right. Indeed, the index of a subgroup counts something and most be a positive integer.

We are able to salvage something though. Suppose we computed that the order of <p> is 5, using our algebra skills. Then we can prompt chat-GPT with the following.

Me: Now suppose H \leq S_5 is a subgroup of order 5. What is the index of H in S_5?

Chat-GPT: The index of a subgroup H in a group G is the number of left cosets of H in G. This is also equal to the number of right cosets of H in G.

Since the order of a subgroup divides the order of the group, the order of H must divide the order of S_5, which is 120. The only divisors of 120 that are less than 120 are 1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 24, 30, 40, and 120. Therefore, the only possible orders for H are 1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 24, 30, 40, and 120.

Since the order of H is 5, the index of H in S_5 is \frac{120}{5} = \boxed{24}.

Remark: Here chat-GPT answers perfectly with a great explanation. Perhaps it would be nice to mention that the main fact being used is Lagrange’s theorem, though this can easily be obtained by asking it “What theorem did you use above?”

But what’s the intuition?

What makes a research mathematician great is a certain “intuition” for problems. Let’s see if chat-GPT can explain any of that.

Me: Can you explain the intuition behind the Cauchy-Schwarz inequality? Please write any math formula in latex.

Chat-GPT: Sure! The Cauchy-Schwarz inequality is a useful inequality that arises in various areas of mathematics, such as vector algebra, inner product spaces, and probability theory. It states that for any two vectors x and y, the following inequality holds:

\left(\sum_{i=1}^n x_i y_i\right)^2 \le \left(\sum_{i=1}^n x_i^2\right)\left(\sum_{i=1}^n y_i^2\right)

Intuitively, the Cauchy-Schwarz inequality states that the inner product of two vectors x and y is “controlled” by the lengths of the vectors x and y. To see this, consider the special case where x and y are unit vectors (i.e., vectors of length 1). In this case, the inequality becomes:

\left(\sum_{i=1}^n x_i y_i\right)^2 \le \left(\sum_{i=1}^n x_i^2\right)\left(\sum_{i=1}^n y_i^2\right) = 1 \cdot 1 = 1

which means that the inner product of x and y is always less than or equal to 1. This is consistent with our intuition that the inner product of two unit vectors should be “small” if they are “orthogonal” (i.e., perpendicular) and “large” if they are “parallel” (i.e., pointing in the same direction).

I hope this helps! Let me know if you have any questions.

Remark: This is not too bad, but could certainly be improved. For instance, explaining when equality holds in Cauchy-Schwarz (when one vector is a multiple of the other) helps give a more precise meaning the the last paragraph. One can see this first few paragraphs of this blog post of Terence Tao to see an explanation from one of the world’s leading mathematicians. Also, here is a math stack exchange on the topic.

Prove it.

Me: Can you prove it?

Chat-GPT: Sure! Here is a proof of the Cauchy-Schwarz inequality:

First, we can rewrite the inequality as:

\left(\sum_{i=1}^n x_i y_i\right)^2 - \left(\sum_{i=1}^n x_i^2\right)\left(\sum_{i=1}^n y_i^2\right) \le 0

Next, we can expand the left-hand side as follows:

\sum_{i=1}^n x_i^2 y_i^2 + \sum_{i \ne j} x_i x_j y_i y_j - \sum_{i=1}^n x_i^2 y_i^2 - \sum_{i \ne j} x_i^2 y_j^2 \le 0

This simplifies to:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \sum_{i \ne j} (x_i y_j)^2 \le 0

Now, we can apply the arithmetic-geometric mean inequality to each term on the right-hand side:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \sum_{i \ne j} [(x_i y_j)^2]^{\frac{1}{2}}[(x_i y_j)^2]^{\frac{1}{2}} \le 0

This simplifies to:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \left(\sum_{i \ne j} [(x_i y_j)^2]^{\frac{1}{2}}\right)^2 \le 0

Finally, we can apply the arithmetic-geometric mean inequality again to the term on the right-hand side:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \left[\prod_{i \ne j} [(x_i y_j)^2]^{\frac{1}{2}}\right]^{\frac{2}{n-1}} \le 0

This simplifies to:

\sum_{i \ne j} (x_i y_i)(x_j y_j) - \prod_{i \ne j} (x_i y_j) \le 0

And we are done! This completes the proof of the Cauchy-Schwarz inequality.

I hope this helps! Let me know if you have any questions.

Remark: Unfortunately, this proof is not quite correct. Both the proof method, as well as the details (i.e. its application of AM-GM) are severely flawed.

Conclusion

To be honest, Chat-GPT is quite impressive for a model that is not tailored for mathematics in any way. I think it could already, in its current format, help a mathematician to prepare lecture notes or even easier parts of a paper. Perhaps its greatest utility would come in automating certain parts of grants and job applications