Adam optimizer
Adam combines momentum and RMSProp into a single optimizer.
Momentum
The update step is not the gradient, but the exponential moving average of the gradient. This smooths out the gradients to maintain faster updates in an underlying direction.
It's best to think of the parameters as being multi-dimensional vectors, as direction and magnitude of the gradients are typically what are manipulated by the optimizer.
# From Fastai course: https://github.com/fastai/course22p2/blob/master/nbs/12_accel_sgd.ipynb
class Momentum(SGD):
def __init__(self, params, lr, wd=0., mom=0.9):
super().__init__(params, lr=lr, wd=wd)
self.mom=mom
def opt_step(self, p):
if not hasattr(p, 'grad_avg'): p.grad_avg = torch.zeros_like(p.grad)
[p.grad_avg = what + what?]
p -= self.lr * p.grad_avg
RMSProp
The momentum based step is scaled in magnitude so that parameters with small gradients still get large updates. It pushes all parameters to be updated by the same step-size. In this way, it resists any slowdown in the size of the update steps and a learning rate schedule should be used.
The scaling is done by maintaining a moving average of the 2-norm of the gradient vector, squared. (Why not just the 2-norm? It can't be avoidance of the square-root operation, as that is used later to scale the gradient). The parameter's gradient is divided by the square-root of this value. Thus, if the gradient was constant for many steps, the update step would have a magnitude of 1.0, before any learning rate is applied.
# From Fastai course: https://github.com/fastai/course22p2/blob/master/nbs/12_accel_sgd.ipynb
class RMSProp(SGD):
def __init__(self, params, lr, wd=0., sqr_mom=0.99, eps=1e-5):
super().__init__(params, lr=lr, wd=wd)
self.sqr_mom,self.eps = sqr_mom,eps
def opt_step(self, p):
if not hasattr(p, 'sqr_avg'): p.sqr_avg = p.grad**2
p.sqr_avg = p.sqr_avg*self.sqr_mom + p.grad**2*(1-self.sqr_mom)
[p -= self.lr * what?]