ICLR Optimization papers II
Critical Learning Periods in Deep Neural Networks
They observe a “critical period” when interfering with learning can have the large effect. Blur the image for a couple of epochs and see what happens. The largest effect happened if blur was introduced several epochs after the training started. This suggests a “critical period” when neurons are the most vulnerable to being disrupted. This is inspired by biology, where such critical periods have been observed in cat studies.
I found this observation interesting because this effect does not seem explainable by existing optimization literature. At some point, optimization researchers will have to step away from noisy quadratic/convex objectives, and study systems with more brain-like behaviors.
Their experiments were somewhat limited, on small models/datasets where choice of hyper-parameters has a strong effect. It would be interesting if this can be reproduced in a large model setting.
Measuring and regularizing networks in function space
This paper addresses the common criticism of regularization penalties. We apply them to model parameters, which are arbitrary parameterizations of the underlying function space. For instance, if our model takes length measurements as input, we could change some input units from millimeters to meters, and this will affect the quality of the solution and easy of optimization.
The general solution is to apply regularization penalties that are invariant to parameterization. For instance, the natural gradient penalizes large steps in KL-divergence.
In this paper, the authors penalize large steps in network outputs, ie the “L2 norm of the function space”. I would expect this to give similar results to KL-divergence penalty.
Another change from natural gradient is that their method is as cheap as gradient descent. Instead of building a local quadratic approximation like natural gradient, they take a regular SGD step in direction of descent, and then take a single correction step in the direction of a smaller function distance. This has linear complexity in number of parameters rather than quadratic in the case of natural gradient.
This could be characterized as a “first-order” approximation to invariant gradient descent, meanwhile natural gradient is typically formulated as a “second-order” approximation..
On the Convergence of a Class of Adam-Type Algorithms for Non-Convex Optimization
They provide analysis and a fix for Adam, which is known to fail to converge in certain degenerate cases. The fix, AdaFom, is to compute gradient variance by using an average over fixed size window, rather than exponentially decaying average.
The convergence is linked to the following quantities
From their convergence proofs it follows that we should track ratios B/A and C/A, both should go to zero in the case of convergence. These seem like useful quantities to throw into your tensorboard to monitor in addition to loss.
Toward Understanding the Impact of Staleness in Distributed Machine Learning
In this paper they empirically evaluate the impact that gradient staleness has on optimization, and find some trends:
An interesting addition of the paper is their method relies on “gradient coherence”, which measures how much the gradients are pointing in the same direction. For history=1, the gradient coherence is simply a rescaled version of cosine of the angle between two subsequent gradients.
They observed that higher coherence meant that one could tolerate more staleness and higher batch size.
Intriguingly, their gradient coherence graphs grow over time, whereas in regular convex optimization, the coherence converges to zero.
Here’s a simulation of a 2-D noisy quadratic and 2 forms gradient coherence plotted, angle between subsequent gradient steps (m=1), and angle between steps 5 apart (m=5)
Eventually the gradients become less and less correlated and subsequent steps are close to orthogonal.
In contrast, in their experiments they observe coherence increasing over time, so gradients become more correlated as you approach the solution
This suggests that there is something interesting going on that isn’t captured by simple models of optimization.
Perhaps this is related to a “compression” effect that’s known to happen during neural network optimization. This idea is explained by Sanjeev Arora in his ICML tutorial in the context of generalization bounds. As you train your neural network, most eigenvalues go to zero, so the network becomes “compressible” — you need just a few bits to reconstruct the weights. Since most networks are non-compressible, this means that SGD selects from a small subset of final networks. The small size of this set allows generalization.
To see how it connects with coherence, consider a simple quadratic model — as you start with a randomly initialized matrix, your eigenvalue spectrum is uniform, and the gradient descent may look like a random walk, implying low coherence
However, as the training proceeds, layers in the network become more compressible and the optimization landscape loses some of its effective dimensionality. This makes the optimization valley more narrow, and more likely for gradient steps to point in the same direction.
It’s still not clear to me why coherence improves reliability against stale gradients — I would expect a stale gradients pointing in a similar direction to cause training to overshoot the objective, whereas stale gradients pointing in random directions would tend to cancel out.
- gradient coherence notebook