ICLR Optimization papers III

Yaroslav Bulatov
4 min readJun 25, 2019


Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

(part I, part II)

Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, Roger Grosse


One approach to hyper-parameter choice is to apply gradient descent in the hyper-parameter space. For each setting of hyper-parameters, you run your optimization to convergence, get resulting loss, and then backprop through these steps to get the gradient. This is expensive.

What if instead you had a function, also known as the “best response function”, which was cheap to compute and gave you final values of neural network parameters, without having to run the optimization procedure?

Obviously this function would have to be very approximate (or we would just use it in place of SGD itself), but perhaps it’s a good enough approximation of gradient descent in hyper-parameter space.

Inspiration for the form of best-response function comes from a linear regression problem, where they observe that best response is a simple function of the weight decay hyper-parameter.

To simplify things further, they drop the sigmoid and model the best-response function as a linear function of hyper-parameters.

Now you can fit the parameters of the best-response function to statistics observed during optimization, and do gradient descent in hyper-parameter space, which gives a non-constant hyperparameter schedule. An encouraging observation was that these schedules transferred between optimization tasks, outperforming manually tuned constant schedules.

The “best-response” approach feels similar in spirit to hyper-networks and “synthetic gradient” papers — taking an expensive computation and approximating it with something that seems too simple to ever work.

I’m curious if these schedules will be useful in the large data setting, for example, by outperforming a constant learning rate schedule.

Meta-Learning Update Rules for Unsupervised Representation Learning

Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein



They parameterize “learning procedure” in a way that covers SGD, but also other, hereto unknown learning methods, and then learn parameters of this procedure by back-propagating the loss on target tasks through steps of the procedure.

The result is a set of parameters which gives you a way to learn the parameters of a network in a way that is more robust than SGD. For instance, they show that their resulting procedure can learn to fit a network with step function non-linearity, which is not possible with SGD.

It feels like this procedure could learn similar methods to those proposed by “Biologically Plausible” learning literature, which also try to do away with gradients.

Local SGD Converges Fast and Communicates Little

Sebastian U. Stich


What if you averaged parameter values instead of averaging gradients between workers? This would be the same for one step, but give you something different if you let workers go for several steps between averaging.

In this work they prove for the finite-sum convex setting, that this method method converges at the same rate as regular SGD, with communication delays O(1/(batch size)). This is a mathematical follow up on a paper from the same group which evaluates this method on CIFAR dataset.

I was interested in this paper because I was trying to find something that would let me schedule batch sizes more efficiently. A minibatch-SGD step with batch of size B could be viewed as B workers performing B SGD steps before communicating, so allowing larger communication delay should also allow you to use a larger batch size.

Batch size tuning was an important part of my “Imagenet in 18 minutes” project. Convergence improved when we started with smaller batches, and grew them over time. There’s a batch-size effect beyond what can be explained by linear learning rate scaling.

Following up on this work I discovered the paper in the next section below.

Communication trade-offs for synchronized distributed SGD with large step size

Kumar Kshitij Patel, Aymeric Dieuleveut


They show that for logistic regression, you can reduce communication frequency as you get closer to the solution. IE, the allowable delay grows in inverse proportion to distance to the final result

C communications, indexed by t

N_t = number of no-communication SGD steps after t’th communication

T = total number of examples (also number of steps, since workers use batch_size=1)

P workers

\nu=learning rate

\mu= strong convexity parameter

M = upper bound on third derivative of loss

This is in line with empirical observation that as optimization approaches its tail end, you can increase global batch size/number of machines.