Inside Neural Network Training
Below are some videos that show how weights, activations and gradients change as a network is trained.
The videos were made while trying to test the idea that layers closer to the input stabilize earlier than layers closer to the output. The below videos suggest this hypothesis is wrong. In fact, quite often the updates to the last layer are the first to slow down, and the updates to the first layer are the last. It's hard to see some of these dynamics in the videos; the details are much clearer in the source images (download link, 45 MB).
The generation code can be found at: https://github.com/kevindoran/dip/.
4 layer model
A convolutional neural network was given a noisy image as input and trained by mean squared error loss to match an image of a black vertical stripe on a white background. Adam optimizer was used.
The neural network structure is:
The three separate videos below record, for each training step, the:
- network weights and activations
- gradients with respect to loss for all weights and activations
- the size of the weight update applied
A grayscale colormap is used for activations, and the purple-to-green PiYG colormap is used for weights—weights are colored more purple as they get more negative, and more green as they get more positive, with zero being white.
1. Weights and activations
2. Gradients
3. Update magnitude
The next video displays the magnitude of update vector applied to the network weights at each step of the gradient descent procedure.
4 layer model (#2)
The model was modified by adding 2 channels (from 8 to 10) to the inner layers, and changing the last layer's filter from 3x3 to 1x1. The structure is:
1. Weights and activations
2. Gradients
3. Update magnitude
6 layer model
Extending the previous model to have 6 layers produces the results below. The structure is:
You can see how this network is excessively capable and the output image appears to be practically complete in the second to last activations. With weight regularization in place, the network learns to just ignore the remaining channel activations being input into the final layer.