\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk

Robust learning rate finder with Kalman smoothing

Kalman smoothing can be applied to the learning rate range test to produce smooth learning rate curves from which a learning rate can be chosen. Some example runs:

A handful of lr-curves for various (dataset, batch size) combinations. Datasets vary left-to-right, batch size increases going down. The offset of the smoothed curve is just approximate, and doesn't need to be accurate for choosing the learning rate.

I recently needed an automated way to choose a reasonable learning rate for a large number of (model, dataset, batch size) combinations. Over on the fast.ai forums, a few people had success using the learning rate range test originally proposed by Leslie Smith in Cyclical Learning Rates for Training Neural Networks. Sylvain Gugger and Jeremy Howard modified the ideas slightly when implementing it in fast.ai (v2 API: LRFinder). Sylvain wrote a blog post about the implementation.

By plotting and viewing a learning curve output by the range test, you can roughly determine a good candidate learning rate; however, your eyes and visual cortex are doing a lot of work here that is hard to automate. It's hard to implement an algorithm to pick a learning rate in a way that doesn't get tripped up by the noisy nature of the learning curve or by the unstable behaviour of the model as the learning rate starts getting too high. I have had some success addressing this, and have an implementation that is robust enough for my needs.

By assuming that there is some true delta-loss associated with each time step, that this variable evolves over time like a random walk, and that we are observing noisy measurements of this variable when we do a range test, we can apply Kalman smoothing to the learning rate range test. To improve the effectiveness of the smoothing, it's also useful to assume that that the measurement noise is far higher than the process noise. Once the learning rate curve is smoothed, the actual choice of learning rate is very simple: just choose the learning rate that experiences the steepest drop in loss.

There are few ways to improve things further. The main issue with the approach is that the measurement noise often doesn't seem to be Gaussian, with positive jumps in loss being more common than negative jumps, and it seems reasonable that more sophisticated smoothing techniques could be applied to the problem to see better results.