Layer norm
Can you remember the implementation for layer norm?
Layer norm
Visualization
From Brendan Bycroft's blog. The tokens have their own mean and variance computed, but they share the learned offset and scale.
Karpathy has the following implementation of Layer norm:
Batch norm
For comparison, here is batch norm:
Layer norm "sample norm"
Layer norm might be better named sample norm; it works per-sample in the sense that it calculates a single mean and variance for all activations of a single sample. Regardless of the dimensions of a layer output, there will just be two numbers (mean and variance) used to shift and scale all activations of a sample. Having said that, during training, there is some interaction between samples in that they all contribute to the two learnable mean and variance parameters.
Batch norm
Batch norm isn't a great name either. Batch norm gives each channel their own private mean and variance parameter which which to shift and scale the layers' activations. So maybe batch norm is better called layer norm.
Batch norm also differs it is use of statistics that are gathered from the training data and stored for a second level of mean-variance shifting and scaling. In this way, the learnable parameters in batch norm can be considered the residual shifting and scaling desirable after the scaling from the activation statistics.