Suppose our problem is to minimize a quadratic y=cx2 by using gradient descent. Gradient descent has a learning rate parameter α, also known as step size, what value should we use? For the problem above, you can find that α=1/c takes you directly to the minimum regardless of starting position, hence it is the optimal step size in this setting. Meanwhile the largest convergent step size, ie, the value for which gradient fails to converge, is 2/c.

In two dimensions, you can see that the location of the starting point matters, but distance to the minimum doesn’t. Hence, we can restrict ourselves to to starting on the unit circle.

λ1x2+λ2y2

Since we can converge in a single step, limit ourselves to a single section defined by the starting point and 0. This gives us a 1-dimensional quadratic, whose curvature is between λ1 and λ2 depending on the starting point. Hence the “optimal” step size is between 1/λ1 and 1/λ2, and to get the average value we take expectation over possible starting points.

For instance take λ1=1 and λ2=0.5. Optimal step size is between 1 and 2, and ≈1.261 on average. You can get a formula in terms of λ1,λ2, but it’s not pretty (see notebook in the bottom).

In higher dimensions we have the following objective:

λ1x2+λ2y2+λ3z2+…

What should we use for coefficient values λi?

Roger Grosse looked at quadratic approximation of deep learning objectives, and found λi to be proportional to 1/i for some practical problems, see Section 3.5 of their Noisy Quadratic Model paper. Additionally, eigenvalues of some random Hermitian matrices decay similar to 1/i. Since the scale of the problem doesn’t matter, normalize the problem so that largest coefficient is =1, which makes the largest convergent step size=2.

We can use numerical integration to plot the optimal step size as a function of dimensionality. Since our largest coefficient is 1, largest convergent learning rate is 2. Meanwhile the average optimal step size approaches 2 for large number of dimensions.

Mathematica notebook