\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
abstract banner

Motivating ELBO From Importance Sampling

Every derivation of the evidence lower bound I've seen has been unsatisfying in terms of motivation—they seem to be just moving symbols around. For me, the most convincing way to arrive at the evidence lower bound is from the perspective of importance sampling.

Imagine you are trying to calculate an average, but each value contributing to the average requires carrying out importance sampling to estimate. Once the average is calculated, you will be maximizing it for some optimization task. When you can't afford the burden of nested importance sampling, a natural concession leads to the evidence lower bound.

Importance sampling allows us to calculate the expectation:

\[ \E_{z \sim \mathrm{P}_z}[f(z)] \]

by instead calculating:

\[ \E_{z \sim \mathrm{Q}_z}\left[f(z) \frac{\mathrm{P}_z(z)}{\mathrm{Q}_z(z)}\right] \]

We use this idea for variational inference. In order to calculate:

\[ \E_{z \sim \mathrm{P}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \right] \]

we instead calculate:

\[ \E_{z \sim \mathrm{Q}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right] \]

For reasons to do with using maximum-likelihood as our optimization objective, we are actually interested in the log of that term:

\[ \begin{equation} \log \left( \E_{z \sim \mathrm{Q}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right] \right) \end{equation} \]

This is a log-likelihood term for just the single data point \( x_i \). There is a log-likelihood term for every data point. We can't wait for sampling to close in on the expectation inside the log, instead we want a snappier online calculation. So, we take an accuracy hit and instead calculate:

\[ \begin{equation} \E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right) \right] \end{equation} \]

So we are sampling to approximate the log, rather than taking the log of a completed approximation. This frees us to use gradients of a single sample. Jensen's inequality assures us that this new term (2) is less than (1), so we will use it as a proxy to maximize (1). We can rewrite this expression and arrive at:

\[ \E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) + \log \left( \mathrm{P}_{z|x}(z, x_i) \right) - \log \left( \mathrm{Q}_{z|x}(z, x_i) \right) \right]\]

And as the expectation of a sum is the sum of expectations, we get:

\[ \E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) \right] - D_{KL}(\small{\mathrm{Q}_{z|x} || \mathrm{P}_{z|x}} ) \]

Which are the ELBO and Kullback-Leibler divergence terms.

If we happened to choose a \( \mathrm{Q}_{z | x_i } \) distribution right on the mark and it equals \( \mathrm{P}_{z | x_i} \), then we would be calculating:

\[ \E_{z \sim \mathrm{P}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) \right] \]

This isn't quite what we were after, which was:

\[ \log \left( \E_{z \sim \mathrm{P}_{z|x_i}} \left[ \mathrm{P}_{x | z}(x_i, z) \right] \right) \]

But Jensen looks down fondly at us and tells us we did all right.

(Last mod: )