\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk
Show Question
Math and science::INF ML AI

Layer norm

Can you remember the implementation for layer norm?

Layer norm

Visualization

From Brendan Bycroft's blog. The tokens have their own mean and variance computed, but they share the learned offset and scale.

Karpathy has the following implementation of Layer norm:

Batch norm

For comparison, here is batch norm, where I think it’s common for the input shape to be (B,C,H,W), meaning that the means and variances are given to each channel:

Batch norm

The norm names refer to the dimensions over which things are averaged, but this isn’t really consistent with implementations. Some layer norms don’t include the token dimension in the mean/std calculation, so maybe they should be named by which dimension they don’t include (so inv token norm). Batch norm gives each channel their own private mean and variance parameter which to shift and scale the layers' activations, and normalises using mean/std from all other dimensions. So maybe batch norm could be called inv channel norm.

Batch norm differs it is use of statistics that are gathered from the training data and stored for a second level of mean-variance shifting and scaling. In this way, the learnable parameters in batch norm can be considered the residual shifting and scaling desirable after the scaling from the activation statistics.