\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk
Show Question
\( \newcommand{\cat}[1] {\mathrm{#1}} \newcommand{\catobj}[1] {\operatorname{Obj}(\mathrm{#1})} \newcommand{\cathom}[1] {\operatorname{Hom}_{\cat{#1}}} \newcommand{\multiBetaReduction}[0] {\twoheadrightarrow_{\beta}} \newcommand{\betaReduction}[0] {\rightarrow_{\beta}} \newcommand{\betaEq}[0] {=_{\beta}} \newcommand{\string}[1] {\texttt{"}\mathtt{#1}\texttt{"}} \newcommand{\symbolq}[1] {\texttt{`}\mathtt{#1}\texttt{'}} \newcommand{\groupMul}[1] { \cdot_{\small{#1}}} \newcommand{\groupAdd}[1] { +_{\small{#1}}} \newcommand{\inv}[1] {#1^{-1} } \newcommand{\bm}[1] { \boldsymbol{#1} } \require{physics} \require{ams} \require{mathtools} \)
Math and science::INF ML AI

Negative log likelihood loss. A perspective.

Negative log likelihood loss is normally calculated as the positivized mean log likelihood. This is:

\[ \text{loss} = \sum_{i=0}{i=\text{n_steps}} \mathcal{P}(\text{data} | \text{model_out}) \]

As this mean is taken over many samples, it approximates an expectation—an expectation over log probabilities. Sound familiar? This is an approximation to entropy.

\text{entropy} = \sum_{i=0}{N} -p(x) \log(p(x))

Eneregy based

By trying to maximize entropy, we are trying to make the probability distribution as pointy as possible, giving weight only to the data point observed and zero everywhere else. This can be viewed as energy based "pushing down" but when high compatibility is instead given high values.