Chapter 3: Behind the Scenes of Training

In the last chapter, we took an analytical approach to training a neural network, but it’s helpful to take a step back and understand what the math is actually telling us.

What’s Backpropagation Actually Doing?

Backpropagation tends to be seen as a bit of a monster when it comes to understanding neural networks, but it turns out to be the most intuitive method of finding the gradient. If you were to look at an untrained neural network’s prediction for a given input and then compare it to the label, what you’d find is that there’s a mismatch.

Backpropagation Connections

Ideally, we want the prediction to match the label. To do this, we need to adjust the output of the neural network to more accurately match the desired label. This means increasing the output of the neurons that should fire and decreasing the ones that shouldn’t. We have three ways of adjusting a neuron’s output.

The Bias

The bias is the most straightforward way of adjusting a neuron since it doesn’t rely on or affect anything besides the neuron it corresponds to. If a neuron is supposed to be more active, we increase its bias. If it’s too active, we decrease it. How much we adjust it depends on how far off the neuron is from the desired activation.

Cbl  =  δl\frac{ \partial C }{ \partial b^l } \;=\; \delta^l

Adjusting the Weights

Adjusting weights is a bit more involved, since their effect depends on the activations coming from the previous layer.

zjl=wjlal1+bjlz^l_j = \bold{w^l_j} \cdot \bold{a^{l-1}} + b^l_j

We can increase the activation of underactive neurons by increasing the value of the weights activating it and dimming the weights driving it down. Inversely, for overactive neurons, we can dim the weights activating it and strengthen the weights deactivating it to drive its activation value down.

To increase the activation of 0.4, we want to decrease the negative weights and increase the positive weight.

Backpropagation Weight Adjustment

To decrease the activation of 0.2, we want to increase the negative weights and decrease the positive weight.

Backpropagation Weight Adjustment

Weights connected to large activations in the previous layer have a greater influence on a neuron’s output than weights connected to smaller activations (assuming the weights are of similar size). So, if a particular input neuron was highly active, and it contributed to an error, its corresponding weight should be adjusted more than one from a less active neuron.

Cwjkl  =  akl1δjl\frac{ \partial C }{ \partial w^l_{jk} } \;=\; a^{l-1}_k \delta^l_j

The process of adjusting each weight proportionally to how much its input neuron “participated” causes connections with highly active neurons to get reinforced (or weakened) more strongly. In effect, neurons that activate together during training tend to strengthen their connection over time. In other words, neurons that fire together, wire together.

This principle is borrowed from neuroscience and explains how neural networks learn patterns. When certain combinations of features consistently appear together in the training data, the weights connecting those features become stronger, making the network more likely to recognize similar patterns in the future.

Adjusting the Previous Layer’s Activations

So far, we’ve seen how to adjust a neuron’s bias and weights to improve the output. But the error in a neural network isn’t caused by one layer alone. In order to update the weights and biases properly, we need to know how each neuron in the previous layer contributed to the error.

Every neuron in the current layer has an opinion on what the previous layer should do. Neurons whose output needs to be increased want to increase the activations of neurons associated with positive weights and decrease those associated with negative weights.

Backpropagation Weight Adjustment

The size of an adjustment is proportional to the weight because it determines how strongly an activation influences the next layer’s neurons, so a larger weight means that an activation is proportionally more responsible for the next layer’s error and needs a correspondingly larger adjustment to correct for it.

To reduce a neuron’s output, we do the opposite. We increase the activations coming through negative weights and decrease those coming through positive weights.

Backpropagation Weight Adjustment 2

Since every neuron has a slightly different opinion on how each neuron should change, we combine them to figure out the ideal adjustment.

Backpropagation Weight Adjustment 3

This process of passing the blame backward is what we call backpropagating the error, and it is described by the following equation.

δl  =  ((wl+1)Tδl+1)σ(zl)\delta^l \;=\; ((w^{l+1})^T \,\delta^{l+1}) \odot \sigma'(z^l)

Random Initialization

When backpropagation is updating a neuron’s weights and bias, it doesn’t consider what other neurons in the same layer are doing. If every parameter starts with the same value, every neuron in the layer is going to behave identically. This means they’ll receive the same updates since their gradients are identical, and they’ll end up learning the same features. Effectively, you end up with one neuron per layer.

To break this symmetry, we initialize the weights and biases with random values. These are typically sampled from a standard normal distribution. This ensures that each neuron starts off a little different, allowing them to specialize in different ways as training progresses.

There’s another reason random initialization matters: if all the weights and biases are set to zero, the network might not learn at all. Zero inputs lead to zero outputs, which can cause zero gradients, depending on the activation function.

What’s Gradient Descent Actually Doing?

When backpropagation processes a single training example, it computes adjustments that would make the network perform perfectly on just that one case. For instance, if we show the network a poorly written “3” that looks somewhat like an “8”, backpropagation might suggest changes that help distinguish that particular “3” from an “8”. But these changes might actually hurt the network’s ability to recognize clearer examples of “3”s or “8”s.

The process of finding gradients and averaging over all of them coerces the neural network to learn more general patterns in the data in order to make better predictions. This process is one way of protecting against overfitting. Overfitting is the tendency for networks to memorize training examples rather than learn generalizable patterns.

However, we use a variation of gradient descent known as stochastic gradient descent, which introduces controlled randomness in an effort to keep computational cost low. It turns out that the randomness actually helps the network find better solutions. It helps the network escape saddle points more efficiently and generalize better. Because the gradient is constantly changing due to the randomness in the mini-batches, the network is discouraged from overfitting and becomes more robust against noise.

What is the Neural Network Learning?

One way you can think of neural networks is as universal function approximators. This means that given a large enough neural network (layers and layer size), we can theoretically approximate any function to whatever accuracy we want.

This is much more apparent with our point classifier. In the widgets for the point classifier, there was an underlying function that determined the color of a point.

output={Blue if4(x5)24(x5)(y5)+y2<50Red if4(x5)24(x5)(y5)+y250 output = \begin{cases} \text{Blue if} & 4(x-5)^2 - 4(x-5)(y-5) + y^2 \lt 50 \\ \text{Red if} & 4(x-5)^2 - 4(x-5)(y-5) + y^2 \ge 50 \\ \end{cases}

The point classifier we had did pretty well at approximating this function. It took points on a Cartesian plane, xx and yy, and predicted whether or not a point was either red or blue.

The digit classifier you’ll develop in the next lab does something similar. It learns a highly complex function that takes an image (e.g., a 784-dimensional vector of pixel values) and outputs a prediction of which digit (0 through 9) the image represents. That answer is understandably unsatisfying. What does each layer do, and why does any neuron fire?

Thankfully, the MNIST dataset we’re using has been studied extensively, and they’ve found a couple of things. The initial layers of a neural network tend to learn basic features from the pixel dataset, such as edges, corners, or simple curves that appear at various places inside the network. In subsequent layers, the neural network combines these shapes to form more complex patterns such as loops (common in ‘0’, ‘6’, ‘8’, ‘9’). Finally, the network once again combines these shapes to ideally classify the digit. Whether or not our network actually does this is something we’ll explore much later.

This simple question goes extremely deep and turns out to be incredibly tough to answer. Neural networks are typically treated as black boxes. We don’t know why they work, but they do. This isn’t a useful understanding of neural networks. There’s an entire field called mechanistic interpretability that tries to answer that question. If you’d like to explore this more, I recommend reading this article by Neel Nanda of Google DeepMind.

Looking Forward

In section one, we’ve learned how to design and train a neural network. In the process of creating our neural network, many arbitrary decisions were made. In the next section, we’re going to take a look at each piece of the neural network, study it, and see how to improve it.

Neural Networks From Scratch

Prioritize understanding over memorization. Good luck!