MSE, L1, L2 and Huber Loss
...or why your intuitions about L1 and L2 as losses might be wrong, and why MSE and Huber loss are probably what you want.
Seeing as we are on Anki, let the front be some technicalities and more details be on the back.
Gradient ranges
Let \( y \) be the target and \( p \) be the model output, both vectors of length \( L \). Let \( y_i \) and \( p_i \) refer to the ith elements of each. The symbol \( \mathcal{L} \) is used for loss. What can be said about the gradient \( \dv{\mathcal{L}}{p_i} \), of the loss with respect to one of the output elements? There is no batch dimension here, and we are concerned with per-sample loss.
- L1
\( \mathcal{L} = \sum_{i=1}^{L} |y_i - p_i| \)
The gradient \( \dv{\mathcal{L}}{p_i} \) can be one of the following values: [what?].
- MSE
\( \mathcal{L} = \frac{1}{L} \sum_{i=1}^{L} (y_i - p_i)^2 \)
Can a gradient \( \dv{\mathcal{L}}{p_i} \) take on any real value? [ yes/no ]
Does the gradient \( \dv{\mathcal{L}}{p_2} \) depend on the value of \( p_2 \)? [ yes/no ]
Does the gradient \( \dv{\mathcal{L}}{p_2} \) depend on the value of \( p_3 \)? [ yes/no ]
- L2
\( \mathbb{L} = \sqrt{\sum_{i=1}^{L} (y_i - p_i)^2} \)
Can a gradient \( \dv{\mathcal{L}}{p_i} \) take on any real value? [ yes/no ]
Does the gradient \( \dv{\mathcal{L}}{p_2} \) depend on the value of \( p_2 \)? [ yes/no ]
Does the gradient \( \dv{\mathcal{L}}{p_2} \) depend on the value of \( p_3 \)? [ yes/no ]
Huber loss
Huber loss is a mix of MSE and L1.
Huber loss. Definition.
Let \( e = \begin{bmatrix} e_1, e_2, ..., e_L \end{bmatrix} \) be the error vector, \(e = y - p \). The loss, as a vector before reduction, is given by:
The final loss can be either the sum or mean reduction of this vector.
This makes Huber loss quadratic for small errors and linear for large errors; the gradient signal will be linear for small errors and constant at either 1 or -1 for large value.
Quiz. MSE, L2 and L1.
How does MSE, L2 and L1 gradients wrt. the two elements vary as theta is varied? In other words, how does \( \dv{\mathcal{L}}{p_1} \) and \( \dv{\mathcal{L}}{p_2} \) vary with theta?
1. Large target along first dimension

Ans ↓

2. Even larger target along first dimension

Ans ↓. The difference between 1. and 2. highlight the scale difference between MSE and L2, and how L1 isn't effected at all.

3. Small target along first dimension

Ans ↓. This is another case where the errors along each dimension are similar, and one dimension doesn't strongly effect the other.

4. Large target at 45°

Ans ↓. The two components in MSE are always 90° out of phase. This case shows the same is not true for L2.

5. Small target at 45°

Ans ↓

Error, loss and gradient. Summary.
If the target, \( y \), and model output, \( p \), are vectors of length \( L \), in a batch of size \( N \), then the error terms, the loss and the gradient for MSE, L1 and L2 are as shown below. The losses are mean reduced across the batch dimension.
MSE ≠ L2
Let \( e = \begin{bmatrix} e_1, e_2, ..., e_L \end{bmatrix} \) be the squared error vector, \( (y - p)^2 \), which is the first computation for both MSE and L2.
whereas,
The L2 norm of the error vector is the value that, when squared, equals the sum of element-wise squares. When this value is large, say use the symbol \( V \in \mathbb{R} \) , a small change to it corresponds to a change of \( 2V \) in its square. This is basic derivatives: \( \dv{x}(x^2) = 2x \). So even if one of the components, \( e_1 \), of the error vector, \( e = (y - p)^2 = \begin{bmatrix}e_1 & e_2 & ... & e_L \end{bmatrix} \), is large, making its derivative large \( \dv{e_1} (e_1^2) = 2e_1 \), the loss value is at least as big as \( e_1 \), and so its corresponding rate of change will never be more than 1.
If there is an error component much larger than the others, say it is \(e_1\), it's also worth inspecting one of the small components, say \( e_2 \). Whereas \( \dv{\mathcal{L}}{e_1} \) was scaled down to \( \pm 1 \) by the magnitude of \( e_1 \), \( \dv{\mathcal{L}}{e_2} \) will also be scaled down, but to potentially a very small value due to the scaling effect of \( e_1 \). This highlights a major difference between a L2-norm based loss and MSE: the component-wise gradients are dependent when using L2, and are independent when using MSE.
L2 loss reduction
The loss formulations don't specify how to calculate the loss for a batch of samples. There are two options:
- Treat the batch of tensors as a single sample.
- Reduce via mean or sum over the batch dimension.
For MSE, treating the batch as a single sample leads to exactly the same calculation as reduction via mean. Why? Property of nested sums:
\[ \frac{1}{N_a}\sum_{a\in A} (\frac{1}{N_b} \sum_{b \in B} f(a,b)) = \frac{1}{N_a N_b} \sum_{(a,b) \in (A x B)} f(a,b) \]For the same reason, treating the batch as a single sample for L1 is exactly the same as reduction via sum. In either case, using sum/mean will simply differ by the constant factor of the batch size.
L2, in contrast, will result in very different calculations. If the whole batch is considered a large vector, then samples within the batch will effect each others' contribution to the loss in non-linear ways.
Focus on gradient landscape
I'd like to end by emphasizing gradients over loss. Backpropagation effectively never sees the loss; how massive or tiny a loss is doesn't affect how parameters are updated; the loss is thrown away once it's differentiated, and anyway, the objective function is often crafted in a way that the loss was more of an downstream effect of a desired gradient form. Losses like L1, L2 and MSE have behaviour that I think is considerably different from what my initial "they are all just distance metrics, with different weights to outliers" perspective would suggest; gradients from an L2 loss can be heavily scaled down if another dimension has a large error; MSE has independent gradients; L1 gradients can only take on 3 values. These are obvious characteristics once you focus on the gradients.