Understanding Backpropagation

The fast and efficient computation of gradients is a defining feature of all modern machine learning frameworks such as PyTorch. The algorithm powering all of these frameworks and the modern deep learning revolution is known as backpropagation, and it's surprisingly easy to implement, so that's what we're going to be doing in this course.

"Backpropagation is also far more general than neural networks. Its application-independent name is reverse-mode differentiation, and it has been independently reinvented in fields ranging from weather forecasting to the analysis of numerical stability. We'll focus on the deep learning context, but the technique you're about to learn is a broadly useful tool for computing derivatives quickly." -Chris Olah

If the explanations in this chapter don't click the first time you read them, that's fine! It's recommended that you glean what you can from this chapter, and then come back to it again after reading the next chapter. Don't spend too much time trying to understand this on the first pass!

What Problem's Being Solved?

It would be easy to hand wave away the purpose of this algorithm by telling you that it allows us to evaluate the gradient of massive functions with billions of parameters quickly and efficiently. To truly appreciate backpropagation, it's helpful to look at where other attempts fail.

Numerical Differentiation via Finite Differences

A simple numerical approach we could try would be to use a modified version of the limit definition to approximate the gradient.

\frac{ \partial f }{ \partial x_i } \approx \frac{f(x+he_i) - f(x)}{h}

Here, $x$ is the vector of all parameters, $x_i$ represents any individual parameter, and $e_i$ is the $i^{th}$ basis vector. Using this method, we need to be careful when selecting a value for $h$ . It needs to be small enough to give us a good approximation that can eventually settle down at a minimum while not being so small that we start running into floating-point precision errors.

While this approach is straightforward, it has a critical flaw: computational cost. For every single partial derivative we want to compute, we need to evaluate the function to calculate $f(x+he_i)$ . This means that as the number of parameters grows, the number of times the function, $f$ , needs to be evaluated grows linearly with it. Recall that $f$ can already itself be computationally expensive due to its size.

Modern neural networks can be thought of as functions with millions, billions, or even trillions of parameters. This approach fails early on if you try to use it at such scales.

Symbolic Differentiation

Another technique we could try is known as symbolic differentiation. This is an automated version of manual differentiation, and it's done by using hard-coded mathematical rules to derive analytical expressions for the derivative.

Although this approach fixes the issues with floating-point precision and allows us to compute the gradient in one pass, it suffers from a problem known as expression swell. In cases where you're dealing with deeply nested compositions of functions, such as neural networks, taking their derivative leads to exponentially massive expressions compared to the original function. This explosion in size is caused by how derivative rules—particularly the chain rule—have a tendency to expand functions.

\frac{d}{dx} f(g(x)) = f'(g(x))g'(x)

Storing the massive expressions that result from symbolic differentiation takes up an enormous amount of memory, and evaluating them is extremely computationally expensive. It can even be slower than numerical differentiation for large expressions.

Backpropagation

The key insight into backpropagation comes from recognizing two things. First, complex functions are composed of many simpler functions. This structure forms a natural sequence of dependencies that can be exploited using the chain rule.

This is much easier to see with a concrete example. Take $x^2 + 3xy + 1$ . Every operation in this expression—squaring, multiplying, adding—takes some inputs and produces an output. If we treat each operation and each intermediate value as a node, and draw arrows to show which nodes feed into which, we get what's called a computational graph. It's a clean way to lay out exactly how a complex expression decomposes into simpler steps.

You can see one for $x^2 + 3xy + 1$ in the widget below—each node is either an input, an intermediate result, or a final output, and the arrows trace the flow of data through the computation.

The second key insight is that this computational graph gives us a clear, structured way to apply the chain rule while avoiding expression swell. Instead of deriving one massive symbolic expression to compute the gradient, we can work backward through the graph one node at a time and reuse intermediate values that were already computed during a first pass.

This two-phase process is the essence of backpropagation. In the forward pass, we evaluate each node while storing all of the intermediate values. In the backward pass, we start from the output and propagate the gradients backward through the graph. We compute the partial derivative exactly once using the chain rule and the stored intermediate values.

Because every node's gradient depends only on its immediate neighbors, no computations get repeated and we avoid expression swell. The full gradient of a function with $n$ parameters is computed in a single backward pass. The computational cost is comparable to just two forward evaluations of the function, regardless of how large $n$ is.

A common reaction on first understanding backpropagation is "wait, that's just the chain rule—how did it take so long to discover?" But it was genuinely hard to see. When the algorithm was invented, it wasn't obvious that derivatives were even the right tool for training networks, and that only becomes obvious once you realize derivatives are cheap to compute—a circular dependency. Once the question is framed the right way, the hardest part is already behind you. -Chris Olah

Looking Forward

Now that we have a broad idea of how backpropagation works, in the next chapter we'll implement a backpropagation engine from scratch that'll be able to automatically compute the gradients for any function.

Understanding Backpropagation

What Problem's Being Solved?

Numerical Differentiation via Finite Differences

Symbolic Differentiation

Backpropagation

Looking Forward

Chapter 1

Understanding Backpropagation

Chapter 2

Implementing Backpropagation

Chapter 3

Testing Backpropagation

Impart