Testing Backpropagation

In the last chapter, we built a Value class that can compute gradients for any expression built from addition and multiplication. But we haven’t actually verified that it works. We could test it by creating some arbitrary test cases whose result we know, but that’s no fun. Backpropagation is famous for its role in training neural networks, so we’re going to test our tiny engine by seeing if we can train one.

You can skip this chapter if you’re not interested or if you don’t know anything about Neural Networks. If you’d like to learn more, we have a Neural Networks from Scratch course that you can check out. We won’t go into the details of training neural networks in this chapter.

Extending the Value Class

Our engine currently only supports + and *. To run a neural network we need an activation function, and to compute a loss we need subtraction and exponentiation. Here are the four additions:

import math

# Activation function
def sigmoid(self):
    t = 1 / (1 + math.exp(-self.data))
    out = Value(t, (self,), 'sigmoid')
    def _backward():
        self.grad += t * (1 - t) * out.grad
    out._backward = _backward
    return out

# Support for loss computation
def __neg__(self):
    return self * Value(-1)

def __sub__(self, other):
    return self + (-other)

def __pow__(self, exponent):
    assert isinstance(exponent, (int, float))
    out = Value(self.data ** exponent, (self,), f'**{exponent}')
    def _backward():
        self.grad += exponent * (self.data ** (exponent - 1)) * out.grad
    out._backward = _backward
    return out

Notice that __neg__ and __sub__ don’t need their own _backward functions. They reduce to operations we’ve already defined, so the correct gradients flow through automatically. Only sigmoid and __pow__ introduce genuinely new derivatives and therefore need their own backward logic.

Building the Neural Network

With those additions, we can build a small network entirely out of Value objects.

Neuron

A Neuron holds one weight per input and a bias. They are all initialized as Value objects, so we can get their gradient using backpropagation and update them as needed during gradient descent. During the forward pass, we compute a weighted sum starting from the bias, then apply the sigmoid to produce the final output.

import random

class Neuron:
    def __init__(self, nin):
        """Initializes one weight per input (random in [-1, 1])
        and a bias starting at zero."""
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(0.0)

    def __call__(self, x):
        """Computes the weighted sum b + w1*x1 + w2*x2 + ...
        then squashes the result through sigmoid to (0, 1)."""
        act = self.b
        for wi, xi in zip(self.w, x):
            act = act + wi * xi
        return act.sigmoid()

    def parameters(self):
        """Returns all trainable parameters for this neuron."""
        return self.w + [self.b]

Layer

A Layer is a list of independent neurons that all receive the same input. Each neuron produces one scalar output, so a layer with nout neurons returns a list of nout values.

class Layer:
    def __init__(self, nin, nout):
        """Creates nout independent neurons, each expecting nin
        inputs."""
        self.neurons = [Neuron(nin) for _ in range(nout)]

    def __call__(self, x):
        """Runs every neuron on the same input and returns
        their outputs as a list."""
        return [n(x) for n in self.neurons]

    def parameters(self):
        """Collects and flattens parameters from every neuron
        in this layer."""
        return [p for n in self.neurons for p in n.parameters()]

NeuralNetwork

A NeuralNetwork chains layers together: the output of one layer becomes the input of the next. The layers list specifies every layer size from input to output, and consecutive pairs define each Layer.

class NeuralNetwork:
    def __init__(self, layers):
        """Builds one Layer for each consecutive size pair in layers.
        e.g. NeuralNetwork([2, 4, 1]) creates layers of sizes 2→4 and 4→1."""
        self.layers = [Layer(layers[i], layers[i+1]) for i in range(len(layers) - 1)]

    def __call__(self, x):
        """Passes x through each layer in sequence. Unwraps
        the result to a scalar for single-output networks."""
        for layer in self.layers:
            x = layer(x)
        return x[0] if len(x) == 1 else x

    def parameters(self):
        """Collects and flattens parameters from every layer
        in the network."""
        return [p for layer in self.layers for p in layer.parameters()]

Every weight and bias is a Value, so every forward pass builds a fresh computation graph that our engine knows how to differentiate.

Training the Network

We’re going to create and train a tiny feedforward neural network to solve the XOR problem.

model = NeuralNetwork([2, 4, 1])

xs = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]
ys = [0.0, 1.0, 1.0, 0.0]

for step in range(5000):
    # Forward pass — builds a new graph each step
    ypred = [model([Value(xi) for xi in x]) for x in xs]

    # Mean squared error
    loss = Value(0)
    for yp, yt in zip(ypred, ys):
        loss = loss + (yp - Value(yt))**2

    loss.backward()

    # Gradient descent
    for p in model.parameters():
        p.data -= 0.1 * p.grad

    print(f"step {step}: loss = {loss.data:.4f}")

Running this, you’ll see the loss will decrease steadily toward zero, confirming that our engine is computing the gradients correctly.

Loss over training steps

0.00.51.01.52.001250250037494999press Train to begin
— training log will appear here —

Conclusion

We built a backpropagation engine in about 50 lines of code from scratch. This engine implements a topological graph that exploits the chain rule to efficiently compute gradients, and we confirmed that it works properly by training a neural network to compute the XOR operation.

The ideas used in this course map directly onto PyTorch’s autograd system. A Value object is similar to a tensor with requires_grad=True, and our .backward() function works the same as PyTorch’s.

From here, the gap between our engine and a real framework like PyTorch is mostly engineering. Tensors replace scalars so that operations run over entire batches at once, CUDA kernels push the computation onto GPUs, and a much larger library of operations covers everything from convolutions to attention. But the core ideas that power backpropagation in these libraries power what we’ve built.

Prioritize understanding over memorization. Good luck!

Impart is building the infrastructure for modern education. We help students take their learning into their own hands.

Impart

© 2026 Impart. All rights reserved.