AUC and Cross-entropy

In this post we will see that Log loss and AUC for binary classification problems are closely related.

Suppose we have a set of probabilities that are meant to predict whether a true label is 0 or 1. That is, let {\widehat{y_k}, y_k \in [0,1]} be the predicted and true labels for the {k}-th example, {k = 1, \ldots , m}.

Recall the AUC is the area under the ROC curve. We will use another interpretation, which is derived on the above wiki: AUC is the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

Note the larger that AUC is, the better the predictions are. For comparison of AUC to log loss, we will convert AUC to a loss function. We do so by studying {1-{\rm AUC}}.

Let {P} be the number of positive examples and {N} be the number of negative samples. Thus,

\displaystyle 1 - AUC = \frac{1}{PN}\sum_{p : y_p = 1} \sum_{n : y_n = 0} f_{AUC}(y_p,y_n),

where {f_{AUC}(x,y) = 1} if {y>x} and {0} otherwise.

We introduce another quantity:

\displaystyle I = \frac{1}{PN} \sum_{p : y_p = 1} \sum_{n : y_n = 0} \log(\widehat{y_p}) \cdot \log(1-\widehat{y_n}).

This quantity has a similar shape to the first quantity, except that {f_{AUC}(x,y)} is replaced by {\log(x) \cdot \log(1-y)}. Note that these two functions are quite similar. The latter is a smooth variant of the first, that penalizes much harsher for misclassifications that are poor. Put another way, the latter quantity penalizes harshly for confidently wrong predictions.

The above equation factors and is equal to

\displaystyle I = \left( \frac{1}{P} \sum_{p : y_p = 1} \log(\widehat{y_p}) \right) \left( \frac{1}{N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}) \right).

Let us compare this to the log loss:

\displaystyle {\rm LogLoss} = \frac{1}{P +N} \sum_{p : y_p = 1} \log(\widehat{y_p}) + \frac{1}{P + N} \sum_{n : y_n = 0} \log(1-\widehat{y_n}).

The intermediate expression and LogLoss are connected by the AM-GM inequality. For simplicity, let us suppose {P = N}. Then, the AM-GM inequality implies

\displaystyle  \sqrt{I} \leq {\rm LogLoss}.

As {I} is similar to {1 - AUC}, we have that {1-{\rm AUC}} is roughly a lower bound on the log loss. Let us make some assumptions

  • Balanced data: {P \approx N}.
  • The predictions are not very wrong too often, i.e. {\widehat{y_p}} is not too often close to zero.
  • Predictions have similar errors for positive and negative, that is the log loss over positive and negative examples is similar.

Then following through the above argument, we find that

\displaystyle 1 - AUC \approx \sqrt{{\rm LogLoss}}.

The main new addition to what is above is to understand that equality in the AM-GM inequality is achieved when the two terms are equal.

Thus, we understand a couple things:

  • We expect that {1 - AUC \approx \sqrt{{\rm LogLoss}}}.
  • If this is not the case, then we can infer that the assumptions above are not met in our particular use-case.

Leave a Reply