Open in app

Sign In

Write

Sign In

Yaroslav Bulatov
Yaroslav Bulatov

1.6K Followers

Home

About

Feb 16

‌Gradient descent under harmonic eigenvalue decay

‌Better formatted version on ghost (due to Latex support)‌ Consider using gradient descent to minimize a quadratic objective with Hessian $H$. ‌ $$\begin{equation} f(w)=\frac{1}{2}(w-w_*)^TH(w-w_*) \label{loss} \end{equation}$$ ‌ We can specialize by letting $H$ have $i$’th eigenvalue proportional to $\frac{1}{i}$ . This decay was observed in some convolutional network problems and…

Optimization

4 min read

‌Gradient descent under harmonic eigenvalue decay
‌Gradient descent under harmonic eigenvalue decay
Optimization

4 min read


Jan 30

Critical batch-size and effective dimension in Ordinary Least Squares

Note: a better formatted version (due to lack of LaTeX support on medium) is here Why do we get diminishing returns with larger batch sizes? As you increase the mini-batch size, the estimate of the gradient gets more accurate. …

Optimization

5 min read

Critical batch-size and effective dimension in Ordinary Least Squares
Critical batch-size and effective dimension in Ordinary Least Squares
Optimization

5 min read


Dec 27, 2021

optimal learning rate for Gradient Descent on a high-dimensional quadratic

A better formatted version of this article is on Ghost, which has proper Latex support: https://machine-learning-etc.ghost.io/optimal-learning-rate-for-high-dimensional-quadratic/ Suppose our problem is to minimize a quadratic y=cx2 by using gradient descent. Gradient descent has a learning rate parameter α, also known as step size, what value should we use? For the problem…

Machine Learning

2 min read

optimal learning rate for Gradient Descent on a high-dimensional quadratic
optimal learning rate for Gradient Descent on a high-dimensional quadratic
Machine Learning

2 min read


Dec 15, 2021

How many matmuls are needed to compute Hessian-vector products?

Suppose you have a simple composition of d dense functions. Computing Jacobian needs d matrix multiplications. What about computing Hessian vector product? You can calculate it manually by differentiating function composition twice, grouping shared work together in temporary messages, and then counting the number of matrix multiplications. One trick is…

2 min read

How many matmuls are needed to compute Hessian-vector products?
How many matmuls are needed to compute Hessian-vector products?

2 min read


Jul 9, 2021

How to do matrix derivatives

Suppose you have the following scalar function of matrix variable W. What’s the derivative with respect to matrix W? Define“matrix derivative” Df as “the thing that you subtract from your variable to go in the steepest descent direction”. IE, your gradient descent update would use Df as follows:

Neural Networks

3 min read

How to do matrix derivatives
How to do matrix derivatives
Neural Networks

3 min read


Aug 30, 2019

Using “Evolved Notation” to derive the Hessian of cross-entropy loss

I was recently reminded of a lesson learned at Stephen Boyd’s Convex Optimization class at Stanford a few years ago, back when Google was paying for random classes taken by employees. The lesson is that you should rely on context to drop redundant information from mathematical notation. It may feel…

Machine Learning

3 min read

Machine Learning

3 min read


Jul 21, 2019

Large-scale AI and sharing of models

Background In “AI and Compute”, OpenAI reported that the cost of training AI models has been growing exponentially, observing a doubling period of 3.5 months. At the current rate, in 4 years, training the largest model will cost more than launching a rocket into orbit. If the trend continues, it would…

Machine Learning

3 min read

Large-scale AI and sharing of models
Large-scale AI and sharing of models
Machine Learning

3 min read


Jun 25, 2019

ICLR Optimization papers III

Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions (part I, part II) Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, Roger Grosse https://arxiv.org/abs/1903.03088 One approach to hyper-parameter choice is to apply gradient descent in the hyper-parameter space. For each setting of hyper-parameters, you run your optimization to convergence, get resulting loss, and then backprop through these steps to…

Machine Learning

4 min read

ICLR Optimization papers III
ICLR Optimization papers III
Machine Learning

4 min read


Jun 11, 2019

ICLR Optimization papers II

part I, part III Critical Learning Periods in Deep Neural Networks https://arxiv.org/abs/1711.08856 They observe a “critical period” when interfering with learning can have the large effect. Blur the image for a couple of epochs and see what happens. The largest effect happened if blur was introduced several epochs after the training started. This suggests a “critical period”…

Machine Learning

5 min read

ICLR Optimization papers II
ICLR Optimization papers II
Machine Learning

5 min read


May 24, 2019

ICLR optimization papers I: Fluctuation-Dissipation relations for SGD

(part II, part III) In this series of posts I will talk about optimization papers that caught my eye at ICLR 2019. The first post in the series is an overview of “Fluctuation-dissipation relations for stochastic gradient descent” by Sho Yaida. This paper uses beautifully simple math in order to…

Machine Learning

5 min read

ICLR optimization papers I: Fluctuation-Dissipation relations for SGD
ICLR optimization papers I: Fluctuation-Dissipation relations for SGD
Machine Learning

5 min read

Yaroslav Bulatov

Yaroslav Bulatov

1.6K Followers

asdf

Following
  • Kaushal Trivedi

    Kaushal Trivedi

  • Edward Ma

    Edward Ma

  • Jim Dowling

    Jim Dowling

  • Ben Mann

    Ben Mann

  • Mariana Oliveira Prazeres

    Mariana Oliveira Prazeres

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech