Understanding Backpropagation

The fast and efficient computation of gradients is a defining feature of all modern machine learning frameworks such as PyTorch. The algorithm powering all of these frameworks and the modern deep learning revolution is known as backpropagation (or for those more savy, reverse mode automatic differentiation). It’s surprisingly easy to implement, so that’s what we’re going to be doing in this course.

If the explainations below don’t click at first, it’s recommended that you skip this section, read the next chapter, and then come back to it. Don’t spend too much time trying to understand this on the first pass!

What Problem’s Being Solved?

It would be easy to hand wave away the purpose of this algorithm by telling you that it allows us to evaluate the gradient of massive functions with billions of parameters quickly and efficiently. To truly appreciate backpropagation, it’s helpful to look at where other attempts fail.

Numerical Differentiation via Finite Differences

A simple numerical approach we could try would be to use a modified version of the limit definition to approximate the gradient.

fxif(x+hei)f(x)h \frac{ \partial f }{ \partial x_i } \approx \frac{f(x+he_i) - f(x)}{h}

Here, xx is the vector of all parameters, xix_i represents any individual parameter, and eie_i is the ithi^{th} basis vector. Using this method, we need to be careful when selecting a value for hh. It needs to be small enough to give us a good approximation that can eventually settle down at a minimum while not being so small that we start running into floating-point precision errors.

While this approach is straightforward, it has a critical flaw: computational cost. For every single partial derivative we want to compute, we need to evaluate the function to calculate f(x+hei)f(x+he_i). This means that as the number of parameters grows, the number of times the function, ff, needs to be evaluated grows linearly with it. Recall that ff can already by itself be computationally expensive due to its size.

Modern neural networks can be thought of as functions with million, billions, or even trillions of parameters. This approach fails early on as you try to use it at such scales.

Symbolic Differentiation

Another technique we could try is known as symbolic differentiation. This is an automated version of manual differentiation, and it’s done by using hard-coded mathematical rules to derive analytical expressions for the derivative.

Although this approach fixes the issues with floating-point precision and allows us to compute the gradient in one pass, it suffers from a problem known as expression swell. In cases where you’re dealing with deeply nested compositions of functions, such as neural networks, taking their derivative leads to exponentially massive expressions compared to the original function. This explosion in size is caused by how derivative rules—particularly the chain rule—have a tendency to expand functions.

ddxf(g(x))=f(g(x))g(x) \frac{d}{dx} f(g(x)) = f'(g(x))g'(x)

Storing the massive expressions that result from symbolic differentiation takes up an enormous amount of memory, and evaluating them is extremely computationally expensive. It can even be slower than numerical differentiation for large expressions.

Backpropagation

The key insight into backpropagation comes from recognizing two things. First, complex functions are composed of many simpler functions. This structure forms a natural sequence of dependencies that can be exploited using the chain rule.

This is much easier to understand if we take a look at a specific example such as x2+3xy+1x^2 + 3xy + 1. We can see that xx gets exponentiated, and also multiplied by 33 to get us x2x^2 and 3x3x respectively. We can do this with every operation and create a graph of dependencies as can be seen in the widget below.

×××++x---3---y---1------3x---3xy---x²+3xy---x²+3xy+1---

scroll to explore →

The second key insight is that this dependency graph gives us a clear, structured way to apply the chain rule while avoiding expression swell. Instead of deriving one massive symbolic expression to compute the gradient, we can work backward through the graph one node at a time and reuse intermediate values that were already computed during a first pass.

×××++x------3------y------1------------3x------3xy------x²+3xy------x²+3xy+1------

scroll to explore →

This two-phase process is the essence of backpropagation. In the forward pass, we evaluate each node while storing all of the intermediate values. In the backwards pass, we start from the output and propagate the gradients backward through the graph. We compute the partial derivative exactly once using the chain rule and the stored intermediate values.

Because every node’s gradient depends only on its immediate neighbors, no computations get repeated and we avoid expression swell. The full gradient of a function with nn parameters is computed in a single backward pass. The computational cost is comparable to just two forward evaluations of the function, regardless of how large nn is.

Looking Forward

Now that we have a broad idea of how backpropagation works, in the next chapter we’ll implement a backpropagation engine from scratch that’ll be able to automatically compute the gradients for any function.

Prioritize understanding over memorization. Good luck!

Impart is building the infrastructure for modern education. We help students take their learning into their own hands.

Impart

© 2026 Impart. All rights reserved.