Using “Evolved Notation” to derive the Hessian of cross-entropy loss

Yaroslav Bulatov
3 min readAug 30, 2019

I was recently reminded of a lesson learned at Stephen Boyd’s Convex Optimization class at Stanford a few years ago, back when Google was paying for random classes taken by employees.

The lesson is that you should rely on context to drop redundant information from mathematical notation. It may feel less precise, but a good choice of context makes things unambiguous while keeping derivations concise.

His concrete example was around positive definiteness. We can start by being explicit and writing something like this

This is a bit verbose, so we could use a special operator to denote this

However, we can go one step further and use “Evolved Notation”

The reason we can do this is that > is an ordering operator, and in a typical setting, positive-definiteness is the canonical way of defining partial order on the space of matrices.

I was reminded of this when I needed to estimate the Hessian of cross entropy loss and making the mistake of starting with “devolved” notation and stumbling into cumbersome algebra. The “evolved” notation could instead go something like this —

Treat logits theta as the parameter and class x as the random variable in one-hot encoding, then softmax can be written as probability of x in exponential family form

The reason for exponential family form is that we we can reuse known formulas for A’ and A’’ when differentiating loss. The following is derived in many places, ie Wainwright’s survey

With soft labels as empirical distribution q, we can write cross entropy loss as expectation

Having just one random variable and one parameter variable makes it unambiguous to drop variable specs from our definitions of functions, derivatives and expectations.

Now differentiate J, apply chain rule, and reuse mean interpretation of A’ for gradient. Differentiate again, and reuse covariance interpretation of A’’ for the Hessian. You can skip most algebra by reasoning what the mean and the covariance should be when the distribution consists of k one-hot vectors with explicit probabilities p1…pk. After simplifying, we get the following for gradient and Hessian of cross-entropy loss:

For implementation, you have to come down from the the tower of Evolved Notation© and fill in the details. In PyTorch we can double check the result against autodiff engine:

This uses “hessian” utility from Adam Paszke for autodiff and outer(p) for pp’ — colab

--

--