# Motivating ELBO From Importance Sampling

This is a tl;dr post of a longer (and not yet existing) post on variational auto-encoders.

## Derivation idea

The evidence lower bound (ELBO) expression appears naturally when you try to sample the posterior distribution with an approximate distribution. I think this way of arriving at the evidence lower bound is intuitive and reveals more about why concessions are being made.

Importance sampling allows us to calculate the expectation:

$\E_{z \sim \mathrm{P}_z}[f(z)]$

$\E_{z \sim \mathrm{Q}_z}\left[f(z) \frac{\mathrm{P}_z(z)}{\mathrm{Q}_z(z)}\right]$

We use this idea for variational inference. In order to calculate:

$\E_{z \sim \mathrm{P}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \right]$

$\E_{z \sim \mathrm{Q}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right]$

For reasons to do with using maximum-likelihood as our optimization objective, we are actually interested in:

$$$\log \left( \E_{z \sim \mathrm{Q}_{z|x_i}} \left[\mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right] \right)$$$

This is a log-likelihood term for just the single data point $$x_i$$. There is a log-likelihood term for every data point. We can't wait for sampling to close in on the expectation inside the log, instead we want a snappier online calculation. So, we take an accuracy hit and instead calculate:

$$$\E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \frac{\mathrm{P}_{z|x}(z, x_i)}{\mathrm{Q}_{z|x}(z, x_i)} \right) \right]$$$

So we are sampling to approximate the log, rather than taking the log of a completed approximation. This frees us to use gradients of a single sample. Jensen's inequality assures us that this new term (2) is less than (1), so we will use it as a proxy to optimize (1). We can rewrite this expression and arrive at:

$\E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) + \log \left( \mathrm{P}_{z|x}(z, x_i) \right) - \log \left( \mathrm{Q}_{z|x}(z, x_i) \right) \right]$

And as the expectation of a sum is the sum of expectations, we get:

$\E_{z \sim \mathrm{Q}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) \right] + D_{KL}(\small{\mathrm{Q}_{z|x} || \mathrm{P}_{z|x}} )$

Which are the ELBO and Kullback-Leibler divergence terms.

If we happened to choose a $$\mathrm{Q}_{z | x_i }$$ distribution right on the mark and it equals $$\mathrm{P}_{z | x_i}$$, then we would be calculating:

$\E_{z \sim \mathrm{P}_{z|x_i}} \left[ \log \left( \mathrm{P}_{x | z}(x_i, z) \right) \right]$

This isn't quite what we were after, which was:

$\log \left( \E_{z \sim \mathrm{P}_{z|x_i}} \left[ \mathrm{P}_{x | z}(x_i, z) \right] \right)$

But Jensen looks down fondly at us and tells us we did alright.