\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk
Show Question
\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \)
Math and science::INF ML AI

Entropy of an ensemble

The entropy of an ensemble, \( X = (x, A_x, P_x) \), is defined to be the average Shannon information content over all outcomes:

\[ H(X) = \sum_{x \in A_x}P(x) \log \frac{1}{P(x)} \]

Properties of entropy:

  • \( H(X) \geq 0 \) with equality iff some outcome has probability 1.
  • Entropy is maximized if the probability of outcomes is uniform.
  • The Entropy is less than the log of the number of outcomes.

The last two points can be expressed as:

\[H(X) \leq \log(|A_x|) \text{, with equality iff all outcomes have probability } \frac{1}{|A_x|} \]

Proof on the back side.


Proof that \( H(X) \leq \log(|A_x|) \):

First note that:

\[ \begin{align*} E[\frac{1}{P(X)}] &= \sum_{i \in A_x} P(i) \frac{1}{P(i)} \\ &= \sum_{i \in A_x} 1 \\ &= |A_x| \\ \end{align*} \]

Now then:

\[\begin{align*} H(X) &= \sum_{i \in A_x} P(i) \log\frac{1}{P(i)} \\ &= \sum_{i \in A_x}P(i)f(\frac{1}{P(i)}) \\ \end{align*}\]

Where \( f(u) = \log(u) \) is a concave function. Then, Using Jensen's inequality we can say:

\[\begin{align*} H(X) &= E[f(\frac{1}{P(X)})] \\ &\leq f(E[\frac{1}{P(X)}]) \\ &= \log(|A_x|) \end{align*}\]

Perspectives on entropy:

  • If \( X \) is a random variable, we can create another random variable, let's call it \( Y \), by applying a fuction to the outcomes, \( Y = f(X) \). What if this function depended not on the value of the outcome, but on the probability of the outcome? As there is a function P that maps outcomes to probabilities, we could create a random variable \( Y = P(X) \). For Shannon information content and entropy, we create the variable \( Y = \log \frac{1}{P(X)} \). The entropy is the expected value of this random variable.