\( \newcommand{\matr}[1] {\mathbf{#1}} \newcommand{\vertbar} {\rule[-1ex]{0.5pt}{2.5ex}} \newcommand{\horzbar} {\rule[.5ex]{2.5ex}{0.5pt}} \newcommand{\E} {\mathrm{E}} \)
deepdream of
          a sidewalk

Posts

Inside Neural Network Training

Below are some videos that show how weights, activations and gradients change as a network is trained. The videos were made while trying to test the idea that layers closer to the input stabilize earlier than layers closer to the output. The below videos suggest this hypothesis is wrong. In fact, quite often the updates to the last layer are the first to slow down, and the updates to the first layer are the last. Read more...

Motivating ELBO From Importance Sampling

This is a tl;dr post of a longer (and not yet existing) post on variational auto-encoders. Most derivations of the evidence lower bound expression are unconvincing, as they just move symbols around without much motivation. -- Notation We have a list of \( N \) data points, \( X = (x_i)_{i=1}^N \). For each data point, we imagine there being a latent variable \( z_i \) that explains it. Latent variables live in some space such as \( \mathbb{R}^2 \). Read more...

Origin of Lebesgue Integration

This article follows the steps of Henri Lebesgue as he came upon his theory of integration. The story could be started earlier, but we don't lose too much by starting with Borel, Lebesgue's adviser, at the end of the 19th century. Borel and the measure of a set At the end of the 19th century, Émile Borel was thinking about the problem of measure, that is, the problem of describing the size of things. Read more...

Visualizing a Perceptron

A lot of machine learning techniques can be viewed as an attempt to represent high-dimensional data in fewer dimensions without losing any important information. In a sense, it is lossy compression—compressing the data to be small and amenable before being passed to some next stage of data processing. If our data consists of elements of \( \mathbb{R}^D \), we are trying to find interesting functions of the form: \[ f : \mathbb{R}^D \to \mathbb{R}^d \] where \( D \) is fixed, but \( d \) can be chosen freely. Read more...

Visualizing Matrix Multiplication

Whenever I come across a matrix multiplication, my first attempt at visualizing it is to view the multiplication as: multiple objects, combined together, many times Matrices are usually carrying a list of objects, with each object represented by a row or column of the matrix. Inspecting how matrices behave by looking these objects can be an effective way to understand what an author is trying to communicate when they use matrices. Read more...

Matrix Mnemonics

Reading matrix notation is burdened by the trivial things like rows being indexed before columns. An author can be trying to communicate something simple, yet the reader's cognitive load can be high as they unpack the matrix notation. Here, I'm experimenting with ways to make matrix notation more memorable. Indexing There is no fundamental reason why rows should appear before columns in matrix indexing. It's just a convention to be remembered. Read more...