\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk

Motivating ELBO From Importance Sampling

This is a tl;dr post of a longer (and not yet existing) post on variational auto-encoders.

Derivation idea

The evidence lower bound (ELBO) expression appears naturally when you try to sample the posterior distribution with an approximate distribution. I think this way of arriving at the evidence lower bound is intuitive and reveals more about why concessions are being made.

Importance sampling allows us to calculate the expectation:

\[ \E_{z \sim \mathrm{P}_z}[f(z)] \]

by instead calculating:

\[ \E_{z \sim \mathrm{Q}_z}\left[f(z) \frac{\mathrm{P}_z(z)}{\mathrm{Q}_z(z)}\right] \]

We use this idea for variational inference. In order to calculate:

\[ \E_{z \sim \mathrm{P}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \right] \]

we instead calculate:

\[ \E_{z \sim \mathrm{Q}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right] \]

For reasons to do with using maximum-likelihood as our optimization objective, we are actually interested in:

\[ \begin{equation} \log \left( \E_{z \sim \mathrm{Q}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right] \right) \end{equation} \]

This is a log-likelihood term for just the single data point \( x_i \). There is a log-likelihood term for every data point. We can't wait for sampling to close in on the expectation inside the log, instead we want a snappier online calculation. So, we take an accuracy hit and instead calculate:

\[ \begin{equation} \E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right) \right] \end{equation} \]

So we are sampling to approximate the log, rather than taking the log of a completed approximation. This frees us to use gradients of a single sample. Jensen's inequality assures us that this new term (2) is less than (1), so we will use it as a proxy to optimize (1). We can rewrite this expression and arrive at:

\[ \E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) + \log \left( \mathrm{P}_{z|x}(z, x_i) \right) - \log \left( \mathrm{Q}_{z|x}(z, x_i) \right) \right]\]

And as the expectation of a sum is the sum of expectations, we get:

\[ \E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) \right] + D_{KL}(\small{\mathrm{Q}_{z|x} || \mathrm{P}_{z|x}} ) \]

Which are the ELBO and Kullback-Leibler divergence terms.

If we happened to choose a \( \mathrm{Q}_{z | x_i } \) distribution right on the mark and it equals \( \mathrm{P}_{z | x_i} \), then we would be calculating:

\[ \E_{z \sim \mathrm{P}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) \right] \]

This isn't quite what we were after, which was:

\[ \log \left( \E_{z \sim \mathrm{P}_{z|x_i}} \left[ \mathrm{P}_{x | z}(x_i, z) \right] \right) \]

But Jensen looks down fondly at us and tells us we did alright.