In the last chapter, we got an overview of how backpropagation exploits the chain rule to efficiently
compute gradients by building a computational graph that avoids repeated computations. Now we’re going
to write the code that automates this process for us so that by the end of this chapter, we’ll have a
working backpropagation engine.
The Value Class
During backpropagation, we need to track the value of the numbers involved and how they were operated
on so that we can eventually calculate the gradient. To support this, we’re going to create the skeleton
of a class—which we’ll call Value—that’ll serve as the scaffolding for the rest of this project.
Starting off, we need to store the numbers that we’re working with, so they can be used to compute the
gradient.
class Value: def __init__(self, data): self.data = data
We’ll also allow them to get operated on via operator overloading. We’re going to keep it simple for now
by only allowing addition and multiplication. Each of these operations should return Value objects because,
naturally, we’d expect Value(1) + Value(2) to give us Value(3).
Below, we’ve implemented the __add__ operation. The __mul__ operation is left as an exercise for you.
Ready
Graph Tracking
As we explained in the last chapter, backpropagation traverses through a computational graph to calculate
the gradient. That means that when we perform an operation, we need to keep track of which inputs were involved
in producing the result. So we need to extend our Value class to accomplish this goal.
We’re going to add two new fields to our class. The first, _prev, is a set of the Value objects that were
used as inputs to produce the new value. The nodes (i.e. Values) stored here will be visited during our backward
pass. We initialize it via the _children tuple passed into __init__. Our second field, _op, is used to store
what operation was used to create the result (e.g. '+' or '*'), and it’s only ever used for debugging and
visualizing what’s going on.
Now, we’re going to have you reimplement multiplication using this!
Ready
The Backward Pass
Now that our Value class has the information it needs to track the computation graph, we have
everything we need to propagate gradients backward through it. The backward pass is where the
actual calculus happens.
Gradients
If we’re going to compute the gradient for each node, we’ll also need somewhere to store it. So
we’ll add a grad field that gets initialized to 0.0 in __init__. This field will store the
partial derivative of the final output with respect to this particular node, ∂self∂L.
class Value: def __init__(self, data, _children=(), _op=''): self.data = data self.grad = 0.0 # ← new self._prev = set(_children) self._op = _op self._backward = lambda: None # ← new (explained below)
The _backward Closure
We’ll also need to give each Value object its own _backward function, whose purpose is to
apply a step of the chain rule. This differs based on which operation was used. For the leaf Value objects we start with, it defaults to lambda: None since they are the final step in our
backward pass.
For addition, the derivative of out with respect to each input is just 1, so gradient
flows through unchanged. Given out = self + other, the chain rule gives us:
∂self∂L=∂out∂L⋅∂self∂out=∂out∂L since ∂self∂out=1
∂other∂L=∂out∂L⋅∂other∂out=∂out∂L since ∂other∂out=1
You may have noticed that we use += rather than = when accumulating gradients. This is
because a node can appear more than once as an input to more than one operation. Using = would silently overwrite them, but using += ensures every contribution is counted along
with the previous ones.
To give a concrete example, suppose you were computing b = a + a. When __add__ runs,
both self and other are the same Python object—a. The backward pass then runs:
With =, the second line overwrites the first and a.grad ends up as out.grad instead
of 2 * out.grad.
For multiplication, the derivative of a⋅b with respect to a is b, and vice versa. Now try implementing the _backward closure for __mul__ yourself:
Ready
The backward() Method
A node’s _backward should only run after every downstream node has already run its _backward.
This ensures that a node’s gradient gets fully accumulated before being propagated back to its
inputs. Otherwise, we’d get incorrect results.
We guarantee this order with a topological sort in reverse: a topological ordering ensures
each node appears after all of its inputs, so reversing it gives us the output-first traversal
we need for the backward pass. For more information, see Appendix A.
Now that we have our nodes being called in the proper order, our backpropagation engine is almost
complete. The last thing we need to do is to set every node’s gradient to be 0.0, except for the
output node which gets set to 1.0. Although all Value objects have grad initialized to 0.0,
this needs to get reset between calls to backward() to make sure gradients from the previous call
won’t get accumulated with the new ones. This is because we use += and not = for the _backward() function. As for the output node’s gradient, it gets seeded to 1.0 because it’s the node we start
from, and the derivative of a function with respect to itself is one — ∂out∂out.
We’ve written build_topo for you. Now implement the remaining bit of backward() and we will
have completed the backpropagation engine:
Ready
Looking Forward
This is all there is to create a simple working backpropagation engine. Any expressions built
using + and * with our Value objects will have their gradient correctly and automatically
calculated. Below is the full Value class implementation:
The ideas used here are the same as the ones behind PyTorch’s autograd, just at a much smaller
scale. The key insight we’ve gone over is that complex gradients don’t need to be derived all at
once. They can be broken down into primitive operations, and then we can let the chain rule do
the rest as you walk backward through the graph.
In the upcoming chapter, we’re going to add a couple more operations to our engine and test it
out to make sure it works!