\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk
Show Question
\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \)
Math and science::INF ML AI

Jensen's inequality

If \( f \) is a convex (smile) function and \( X \) is a random variable then:

\[ \mathbb{E}[f(X)] \ge f(\mathbb{E}[X]) \]

If \( f \) is strictly convex and \( \mathbb{E}[f(X)] = f(\mathbb{E}[X]) \), then the random variable \( X \) is a constant.


Intuition

A few ways of visualizing Jensen's inequality.

Sense of stretching

Interpolation across \( \mathbb{E}[X] \)

Regardless of the weightings (probabilities) of \( x_1 \) and \( x_2 \), their expectation \( \mathbb{E}[x] \) will lie somewhere between \( x_1 \) and \( x_2 \). Mapping \( x_1 \) and \( x_2 \) through \( f \) and calculating the expectation gives us \( \mathbb{E}[f(X)] \), which will like somewhere on the line between \(f(x_1)\) and \( f(x_2) \). But if we were first to calculate \( \mathbb{E}[X] \) and then pass this through \( f \) to get \( f(\mathbb{E}[X]) \), then this value would be less-equal due to the increasing (convex) nature of \( f \).

A similar visualization with more points (from Mark Reid's blog post):

A similar take (from Andrew Ng's notes for CS229):

Example

Q1: Three squares

Three squares have average area \( \bar{A} = 100 m^2 \). The average of the lengths of their sides is \( \bar{l} = 10 m \). What can be said about the size of the largest of the three squares?

A1:
Let \( x \) be the length of the side of a square, and let the probability of \( x \) be \( \frac{1}{3}, \frac{1}{3}, \frac{1}{3} \) over the three lengths, \( l_1, l_2, l_3 \). Then the information that we have is:

* \( E[X] = 10 \)
* \( E[f(X)] = 100 \text{, where } f(x) = x^2 \)

\( f \) is a strictly convex function and the equality \( E[f(X)] = f(E[X]) \) holds, so by Jensen's equality, \( x \) must be a constant and all three lengths must be equal. So the area of the largest square (and all squares) is \(100 m^2 \).

More visualization: