# Motivating ELBO From Importance Sampling

This is a tl;dr post of a longer (and not yet existing) post on variational auto-encoders.

## Derivation idea

The evidence lower bound (ELBO) expression appears naturally when you try to sample the posterior distribution with an approximate distribution. I think this way of arriving at the evidence lower bound is intuitive and reveals more about why concessions are being made.

Importance sampling allows us to calculate the expectation:

by instead calculating:

We use this idea for variational inference. In order to calculate:

we instead calculate:

For reasons to do with using maximum-likelihood as our optimization objective, we are actually interested in:

This is a log-likelihood term for just the single data point \( x_i \). There
is a log-likelihood term for *every* data point. We can't wait for sampling to
close in on the expectation inside the log, instead we want a snappier online
calculation. So, we take an accuracy hit and instead calculate:

So we are sampling to approximate the log, rather than taking the log of a completed approximation. This frees us to use gradients of a single sample. Jensen's inequality assures us that this new term (2) is less than (1), so we will use it as a proxy to optimize (1). We can rewrite this expression and arrive at:

And as the expectation of a sum is the sum of expectations, we get:

Which are the ELBO and Kullback-Leibler divergence terms.

If we happened to choose a \( \mathrm{Q}_{z | x_i } \) distribution right on the mark and it equals \( \mathrm{P}_{z | x_i} \), then we would be calculating:

This isn't quite what we were after, which was:

But Jensen looks down fondly at us and tells us we did alright.