Motivating ELBO From Importance Sampling
This is a tl;dr post of a longer (and not yet existing) post on variational auto-encoders.
Derivation idea
The evidence lower bound (ELBO) expression appears naturally when you try to sample the posterior distribution with an approximate distribution. I think this way of arriving at the evidence lower bound is intuitive and reveals more about why concessions are being made.
Importance sampling allows us to calculate the expectation:
by instead calculating:
We use this idea for variational inference. In order to calculate:
we instead calculate:
For reasons to do with using maximum-likelihood as our optimization objective, we are actually interested in:
This is a log-likelihood term for just the single data point \( x_i \). There is a log-likelihood term for every data point. We can't wait for sampling to close in on the expectation inside the log, instead we want a snappier online calculation. So, we take an accuracy hit and instead calculate:
So we are sampling to approximate the log, rather than taking the log of a completed approximation. This frees us to use gradients of a single sample. Jensen's inequality assures us that this new term (2) is less than (1), so we will use it as a proxy to optimize (1). We can rewrite this expression and arrive at:
And as the expectation of a sum is the sum of expectations, we get:
Which are the ELBO and Kullback-Leibler divergence terms.
If we happened to choose a \( \mathrm{Q}_{z | x_i } \) distribution right on the mark and it equals \( \mathrm{P}_{z | x_i} \), then we would be calculating:
This isn't quite what we were after, which was:
But Jensen looks down fondly at us and tells us we did alright.