Implementing the Backpropagation Engine

In the last chapter, we got an overview of how backpropagation exploits the chain rule to efficiently compute gradients by building a computational graph that avoids repeated computations. Now we’re going to write the code that automates this process for us so that by the end of this chapter, we’ll have a working backpropagation engine.

The Value Class

During backpropagation, we need to track the value of the numbers involved and how they were operated on so that we can eventually calculate the gradient. To support this, we’re going to create the skeleton of a class—which we’ll call Value—that’ll serve as the scaffolding for the rest of this project.

Starting off, we need to store the numbers that we’re working with, so they can be used to compute the gradient.

class Value:
    def __init__(self, data):
        self.data = data

We’ll also allow them to get operated on via operator overloading. We’re going to keep it simple for now by only allowing addition and multiplication. Each of these operations should return Value objects because, naturally, we’d expect Value(1) + Value(2) to give us Value(3).

Below, we’ve implemented the __add__ operation. The __mul__ operation is left as an exercise for you.

Ready

Graph Tracking

As we explained in the last chapter, backpropagation traverses through a computational graph to calculate the gradient. That means that when we perform an operation, we need to keep track of which inputs were involved in producing the result. So we need to extend our Value class to accomplish this goal.

We’re going to add two new fields to our class. The first, _prev, is a set of the Value objects that were used as inputs to produce the new value. The nodes (i.e. Values) stored here will be visited during our backward pass. We initialize it via the _children tuple passed into __init__. Our second field, _op, is used to store what operation was used to create the result (e.g. '+' or '*'), and it’s only ever used for debugging and visualizing what’s going on.

Now, we’re going to have you reimplement multiplication using this!

Ready

The Backward Pass

Now that our Value class has the information it needs to track the computation graph, we have everything we need to propagate gradients backward through it. The backward pass is where the actual calculus happens.

Gradients

If we’re going to compute the gradient for each node, we’ll also need somewhere to store it. So we’ll add a grad field that gets initialized to 0.0 in __init__. This field will store the partial derivative of the final output with respect to this particular node, $\frac{\partial L}{\partial \text{self}}$ .

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0          # ← new
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None  # ← new (explained below)

The `_backward` Closure

We’ll also need to give each Value object its own _backward function, whose purpose is to apply a step of the chain rule. This differs based on which operation was used. For the leaf Value objects we start with, it defaults to lambda: None since they are the final step in our backward pass.

For addition, the derivative of out with respect to each input is just 1, so gradient flows through unchanged. Given out = self + other, the chain rule gives us:

\frac{\partial L}{\partial \text{self}} = \frac{\partial L}{\partial \text{out}} \cdot \frac{\partial \text{out}}{\partial \text{self}} = \frac{\partial L}{\partial \text{out}} \text{ since } \frac{\partial \text{out}}{\partial \text{self}} = 1

\frac{\partial L}{\partial \text{other}} = \frac{\partial L}{\partial \text{out}} \cdot \frac{\partial \text{out}}{\partial \text{other}} = \frac{\partial L}{\partial \text{out}} \text{ since } \frac{\partial \text{out}}{\partial \text{other}} = 1

def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
    def _backward():
        self.grad  += out.grad   # d(out)/d(self)  = 1
        other.grad += out.grad   # d(out)/d(other) = 1
    out._backward = _backward
    return out

You may have noticed that we use += rather than = when accumulating gradients. This is because a node can appear more than once as an input to more than one operation. Using = would silently overwrite them, but using += ensures every contribution is counted along with the previous ones.

To give a concrete example, suppose you were computing b = a + a. When __add__ runs, both self and other are the same Python object—a. The backward pass then runs:

self.grad  += out.grad   # a.grad = out.grad
other.grad += out.grad   # a.grad = 2 * out.grad

With =, the second line overwrites the first and a.grad ends up as out.grad instead of 2 * out.grad.

For multiplication, the derivative of $a \cdot b$ with respect to $a$ is $b$ , and vice versa. Now try implementing the _backward closure for __mul__ yourself:

Ready

The `backward()` Method

A node’s _backward should only run after every downstream node has already run its _backward. This ensures that a node’s gradient gets fully accumulated before being propagated back to its inputs. Otherwise, we’d get incorrect results.

We guarantee this order with a topological sort in reverse: a topological ordering ensures each node appears after all of its inputs, so reversing it gives us the output-first traversal we need for the backward pass. For more information, see Appendix A.

Now that we have our nodes being called in the proper order, our backpropagation engine is almost complete. The last thing we need to do is to set every node’s gradient to be 0.0, except for the output node which gets set to 1.0. Although all Value objects have grad initialized to 0.0, this needs to get reset between calls to backward() to make sure gradients from the previous call won’t get accumulated with the new ones. This is because we use += and not = for the _backward() function. As for the output node’s gradient, it gets seeded to 1.0 because it’s the node we start from, and the derivative of a function with respect to itself is one — $\frac{\partial out}{\partial out}$ .

We’ve written build_topo for you. Now implement the remaining bit of backward() and we will have completed the backpropagation engine:

Ready

Looking Forward

This is all there is to create a simple working backpropagation engine. Any expressions built using + and * with our Value objects will have their gradient correctly and automatically calculated. Below is the full Value class implementation:

The ideas used here are the same as the ones behind PyTorch’s autograd, just at a much smaller scale. The key insight we’ve gone over is that complex gradients don’t need to be derived all at once. They can be broken down into primitive operations, and then we can let the chain rule do the rest as you walk backward through the graph.

In the upcoming chapter, we’re going to add a couple more operations to our engine and test it out to make sure it works!

Implementing the Backpropagation Engine

The Value Class

Graph Tracking

The Backward Pass

Gradients

The `_backward` Closure

The `backward()` Method

Looking Forward

Chapter 1

Understanding Backpropagation

Chapter 2

Implementing Backpropagation

Chapter 3

Testing Backpropagation

Impart

Implementing the Backpropagation Engine

The Value Class

Graph Tracking

The Backward Pass

Gradients

The _backward Closure

The backward() Method

Looking Forward

Chapter 1

Understanding Backpropagation

Chapter 2

Implementing Backpropagation

Chapter 3

Testing Backpropagation

Impart

The `_backward` Closure

The `backward()` Method