Chapter 1: The Structure of Neural Networks

The Difficulty of Image Recognition

When first learning to code, you’re taught to break down programs into smaller, simpler parts that are more manageable. Then, after coding up each tiny piece, you can put them together in order to get a program that does what you originally intended.

However, there are some problems where this isn’t possible. Image recognition is a perfect example. It seems simple because our brains do it automatically, but we take for granted how easily we recognize the following numbers as 9 and 6:

Handwritten digit 9 from MNIST dataset. Handwritten digit 6 from MNIST dataset.

Actually writing a program that reads handwritten digits would be monstrously difficult. Why? Because handwriting is messy. Despite that, people can still somehow effortlessly identify digits despite struggling to describe exactly how they do it. Without being able to clearly define the process, we can’t write a traditional step-by-step program to solve it.

Because of this, researchers turned to another source for inspiration: the human brain. While we can’t actually emulate the human brain yet, we can use it as a loose analogy to create something similar. In practice, this gives us extremely promising results.

Artificial Neurons

The brain is composed of biological neurons that are incredibly complex. In order to mimic their behavior, we can create simple artificial neurons to model them. We’ll refer to artificial neurons as neurons throughout the book.

Neurons take in multiple inputs that help them decide whether or not they should fire. Each input then gets multiplied by a weight that signifies its influence. The inputs are summed together into a single value, and—if it exceeds the threshold value—the neuron fires and outputs a one. Otherwise, it’ll output a zero. We can describe a neuron algebraically as follows:

output={0if jwjajthreshold1if jwjaj>threshold output = \begin{cases} 0 & \text{if } \sum\limits_{j} w_{j}a_{j} \leq threshold \\ 1 & \text{if } \sum\limits_{j} w_{j}a_{j} \gt threshold \end{cases}

The variable aja_j is the j-th input to the neuron, and wjw_j is the corresponding weight. The summation, jwjaj\sum\limits_j w_j a_j, is typically written as the dot product between the inputs and weights, wa\overrightarrow w \cdot \overrightarrow a.

output={0if wathreshold1if wa>threshold output = \begin{cases} 0 & \text{if } \overrightarrow{w} \cdot \overrightarrow{a} \leq threshold \\ 1 & \text{if } \overrightarrow{w} \cdot \overrightarrow{a} \gt threshold \end{cases}

The artificial neuron we’ve described is known as the Perceptron. It’s outdated, but it provides us a good starting point for understanding neurons and neural networks.

Understanding Neurons

A good heuristic for understanding neurons is that they give an answer (output)—yes (one) or no (zero)—based on the answers to previous questions (inputs). Some answers are important (higher weights), and some are useless (near-zero weights). Answers can make it more likely to get a yes (positive weights), while others make it more likely to get a no (negative weights).

A helpful, although unrealistic, example is deciding whether or not someone—let’s call her Jane—would hang out with her friends. The neuron below represents how Jane might decide whether or not she’ll hang out with her friends.

Neuron Decision

The -2 being blue may throw you off a bit, but remember, lower thresholds mean neurons are MORE likely to fire.

We can see that this neuron wants to fire even when the answer to all of these is no, because it has a threshold of -2. Since this neuron is supposed to represent Jane’s decision-making, we see that she has an inclination to go out with her friends.

(010)+(06)+(03)=0 (0 \cdot \textcolor{#FF4040}{-10}) + (0 \cdot \textcolor{#5EA3FF}{6}) + (0 \cdot \textcolor{#5EA3FF}{3}) = 0
Output=1 because 0>2 \text{Output} = 1 \text{ because } 0 \gt \textcolor{#5EA3FF}{-2}

However, if Jane’s sick, she’s not willing to go out. At least not without a lot of convincing (hence the large negative weight).

(110)+(06)+(03)=10 (1 \cdot \textcolor{#FF4040}{-10}) + (0 \cdot \textcolor{#5EA3FF}{6}) + (0 \cdot \textcolor{#5EA3FF}{3}) = -10
Output=0 because 102 \text{Output} = 0 \text{ because } -10 \le \textcolor{#5EA3FF}{-2}

If Jane BOTH has the money AND enjoys the activity, she could be convinced to go regardless of being sick. If she doesn’t have the money or isn’t too fond of the activity, she won’t go.

(110)+(16)+(13)=1 (1 \cdot \textcolor{#FF4040}{-10}) + (1 \cdot \textcolor{#5EA3FF}{6}) + (1 \cdot \textcolor{#5EA3FF}{3}) = -1
Output=1 because 1>2 \text{Output} = 1 \text{ because } -1 \gt \textcolor{#5EA3FF}{-2}

Keep in mind that the neuron we’ve described can have different weights and thresholds that would change its decision-making. If Jane weighed “Am I sick?” a bit more heavily, then, when she’s sick, nothing could convince her to go out with her friends.

Breaking Down Artificial Neurons

Almost all artificial neurons are made up of two parts: the weighted sum and the activation function.

We introduced thresholds earlier as a decision boundary for our neuron, but in modern neural networks, we typically use the bias term, bb, instead. This allows us to rewrite our artificial neuron as follows:

Output={0if wa+b01if wa+b>0 \text{Output} = \begin{cases} 0 & \text{if } \overrightarrow{w} \cdot \overrightarrow{a} + b \leq 0 \\ 1 & \text{if } \overrightarrow{w} \cdot \overrightarrow{a} + b > 0 \end{cases}

Note that b=thresholdb = -threshold. The main reason for using the bias over the threshold is that it works out much nicer algebraically.

The bias can be thought of as a neuron’s baseline excitability. A positive bias means the neuron is eager to fire even with minimal input, while a negative bias means it’s more reluctant to activate. If we revisit Jane’s example, her threshold of -2 would translate to a bias of +2 if we wanted to keep the same decision-making behavior.

The expression wa+b\overrightarrow w \cdot \overrightarrow a + b is often referred to as the weighted input and is represented with the variable zz. The weighted input is passed along to what’s known as an activation function. In our Perceptron example, we’ve been using what’s called the Heaviside step function:

f(z)={0if z01if z>0f(z) = \begin{cases} 0 & \text{if } z \leq 0 \\ 1 & \text{if } z > 0 \end{cases}

We’ll discuss activation functions much more in-depth in later chapters.

So far, we’ve seen that neurons can be used to make sophisticated decisions, but alone, they’re not that exciting. They still aren’t able to properly classify digits, but it shouldn’t seem too far-fetched that once we start connecting many of them together to form a neural network—or “brain”—we start getting some very promising results.

The Architecture of Neural Networks

Neural networks are typically composed of multiple layers where every neuron in a layer connects to all neurons in the layer before it and after it. The first layer of a neural network is known as the input layer, and it contains input neurons. These aren’t really neurons, because they don’t have any weights, biases, etc. They just pass along the values of the inputs we want to give the network. It’s just convention to draw them as neurons and refer to them as such.

The last layer is known as the output layer, and it contains output neurons that store the information we want to get from the network. The layers in between are called hidden layers. While the name sounds cool, it doesn’t mean anything other than that they’re neither input nor output layers.

Neuron Network Layers

This structure is known as a “Multilayer Perceptron” or MLP; however, that name is used even in neural networks that don’t use Perceptrons. Because of that, it won’t be used in this book. Instead, they’ll be referred to as feedforward neural networks.

Feedforward neural networks get their name from the fact that the inputs from previous layers all feed into the succeeding layer. This isn’t the only architecture that neural networks can take on, but it’s the only one we’ll focus on in this book.

Designing a Neural Network to Classify Points

In the upcoming lab, you’ll start designing a feedforward neural network to classify handwritten digits. Before tackling that challenge, it’s good practice to start with a simpler problem. Below is a graph with blue and red points. Our goal is to create a neural network that can accurately predict the color of the points on the graph.

To design this network, we first need to determine what inputs it should receive. Since we’re classifying points on a 2D graph, we need two input neurons to represent the x and y coordinates of each point.

The network then needs to process this information through one or more hidden layers. For our problem, one hidden layer with three neurons turns out to be sufficient. This particular design choice is somewhat arbitrary. Determining the optimal number and size of hidden layers is more art than science, and we’ll explore the topic in greater depth later.

Finally, we need output neurons to tell us the classification result. Since we have two possible outputs (red or blue), we’ll use two output neurons. You might wonder why we don’t just use a single output neuron, with 0 representing red and 1 representing blue. There are two good reasons:

  1. Using separate output neurons generalizes better to problems with more than two categories. For digit recognition, for example, we’ll use 10 output neurons (one for each digit) rather than encoding the digits in binary.
  2. More importantly, giving each category its own neuron produces better results in practice. Each output neuron can focus exclusively on identifying the features specific to its assigned category.

The resulting neural network looks like this:

Neuron Network Example

Implementing a Neural Network

Before we can meaningfully see how to implement a neural network, we need to discuss the notation behind them. We’ll use wjklw^l_{jk} to represent the connections between the kthk^{th} neuron in the (l1)th(l-1)^{th} layer and the jthj^{th} neuron in the lthl^{th} layer. The weight highlighted below would be written as w213w^3_{21}.

Neuron Network Weight

The ordering of the subscripts might seem strange, but this is so that the notation matches our other variables. We use ajla^l_j to represent the activation of the jthj^{th} neuron in the lthl^{th} layer. The same notation is used for the biases, bjlb^l_j, and weighted sums, zjlz^l_j. Lastly, ff is our activation function. Using this notation allows us to relate the activation of a single neuron in layer ll to the neurons in the previous layer.

zjl=kwjklakl1+bjl z^l_j = \sum_k w^l_{jk} a^{l-1}_k + b^l_j
ajl=f(zjl) a^l_j = f{\left( z^l_j \right)}

We can significantly clean this expression up by rewriting it in a way that focuses on the layers of the neural network. We can create a weight matrix wlw^l whose entries are described by wjklw^l_{jk} where jj is the index of the row and kk is the index of the column. Similarly, the bias, weighted sum, and activations can be written as vectors blb^l, zlz^l, and ala^l, respectively.

To apply the activation function, we also need to discuss vectorization. Vectorizing a function means that we’re performing the element-wise application of a function onto the components of a vector (NumPy does this via broadcasting). Suppose we had the function f(x)=x2f(x) = x^2. Then vectorizing ff has the effect of squaring each component of a vector passed to it.

f([123])=[f(1)f(2)f(3)]=[149] f \left( \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \right) = \begin{bmatrix} f(1) \\ f(2) \\ f(3) \end{bmatrix} = \begin{bmatrix} 1 \\ 4 \\ 9 \end{bmatrix}

This allows us to rewrite our original expressions using vectors and matrices, so that we can represent what’s going on layer by layer. This makes the equations much easier to understand while also giving a speed boost since many libraries optimize matrix computations.

zl=wlal1+bl z^l = w^l a^{l-1} + b^l
al=f(zl) a^l = f{\left( z^l \right)}

Because the activations at each layer, ala^l, depend on the activations of the previous layer, al1a^{l-1}, this forms a recursive structure. This means that each layer is applying a transformation to the output of the previous layer. Viewing a neural network as a deeply nested composition of functions is a useful perspective that will come in handy.

We can generate a prediction by feeding an input into the neural network—denoted as a1a^1—and then, using the recursive definition, computing the activations of the following layers until we reach the output, aLa^L. LL is the total number of layers.

Using Our Neural Network

We have no way of knowing if this network can actually accurately classify these points without testing it. We’re going to do that in the widget below. In it, you can adjust the weights and biases manually in order to get the network to correctly classify the points.

Don’t worry about being methodical. You don’t have to get all of them classified correctly, but playing around with it for a bit should give you a better feel for how the network functions. The top slider is the biases for each neuron, and the rest are weights.

Score: 80 / 121

Neuron 1,1

Neuron 1,2

Neuron 1,3

Neuron 2,1

Neuron 2,2

While adjusting the sliders for the weights and biases, you likely ran into issues such as the decision boundary abruptly shifting at times. Ideally, what we’d want is a system where small changes in the weights and biases would cause small changes in the output of the neural network.

small Δweights, small Δbiases    small Δoutput \text{small } \Delta \text{weights, small } \Delta \text{biases} \implies \text{small } \Delta \text{output}

Michael Nielsen, in his book “Neural Networks and Deep Learning,” puts it perfectly:

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an “8” when it should be a “9”. We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a “9”. And then we’d repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.

However, this isn’t the case with the Perceptron. Perceptrons can only ever be 0 (off) or 1 (on). That means that adjusting their weights or biases can result in their output flipping. This flip can cause a cascading effect that causes other Perceptrons’ values to flip in some erratic way. This problem is already apparent in our tiny neural network, and it only gets worse as networks get larger.

Sigmoid Neurons

To solve this problem, we’re going to introduce a different type of neuron called the Sigmoid Neuron. Just like the Perceptron, it calculates a weighted sum and passes it to an activation function. Instead of returning 0 or 1, the Sigmoid Neuron returns a value between 0 and 1. Its activation function is called the sigmoid, and it can be written as follows:

σ(z)=11+ez \sigma(z) = \frac{1}{1+e^{-z}}

If you’re wondering why this function was chosen, the reason is somewhat arbitrary. It’s mostly because the sigmoid is reminiscent of a smoothed-out step function. This makes it behave similarly to the Perceptron. Although it can’t take on the values 0 or 1, it can get extremely close given a large enough negative or positive value, respectively. In theory, you could create similar neural networks with either Perceptrons or Sigmoid Neurons. Sigmoid Neurons just happen to have much nicer properties to work with.

So how can we interpret the output of a Sigmoid Neuron? This is context-dependent and can change based on what you want the neural network to do. Oftentimes, it can be seen as a confidence score. If the output neuron corresponding to the red and blue labels are 0.97 and 0.3, respectively, then the neural network much more strongly believes that the point is red rather than blue, so we would classify the point as red.

If you’re hesitant about the fact that the sigmoid can never be 0 or 1, keep in mind that real-world data is noisy, so absolute certainty is unrealistic. If you’re still hesitant, my response to you is: can you ever be truly certain of anything? (‘I think, therefore I am’ is not a valid answer.)

We will talk much more about activation functions in a later chapter. For now, just understand that the sigmoid isn’t inherently important, but its smoothness is. So now that we have this new activation function, let’s try the same tuning exercise we did before with it and see if it’s gotten any easier.

Score: 79 / 121

Neuron 1,1

Neuron 1,2

Neuron 1,3

Neuron 2,1

Neuron 2,2

Tuning this neural network was probably much more intuitive than with Perceptrons. You still might have experienced some issues with large shifts. Those are just a consequence of using toggles. With more fine-grained controls, you would see that the shifts are smooth.

You might have observed some interesting phenomena, such as holes appearing out of nowhere. This isn’t a flaw. Neural networks are meant to be flexible, so that they can model complex behavior. Although this flexibility does cause issues at times, it’s generally beneficial, and there are ways of managing it.

Along with this growth, it becomes exponentially harder to manually tune a neural network. Modern networks like ChatGPT can have trillions of parameters. This highlights a crucial question: How can we systematically find the optimal weights and biases without guessing?

Looking Forward

In the upcoming lab, you’re going to get a chance to design a neural network that can classify digits. It won’t be much better than random because the parameters won’t be tuned, and it’s not feasible to manually tune the weights and biases due to there being thousands. We’ll explore the issue of how to systematically update the weights and biases of a neural network by addressing what it means for a neural network to ‘learn.’

Neural Networks From Scratch

Prioritize understanding over memorization. Good luck!