Phase 4 · Session 13 · 60 min

How It Actually Learns

Big idea

Backpropagation is how neural networks compute gradients. It's the chain rule from calculus, applied with careful bookkeeping. You'll derive it for a 2-layer network, code it from scratch, and verify it matches the framework.

By the end, you'll be able to

Explain backpropagation as "blame, distributed backward via the chain rule"
Apply the chain rule to a simple composite function
Implement backprop for a 2-layer network in numpy
Use PyTorch's autograd to compute gradients automatically

The basketball coach

A basketball coach reviews tape: "Your release was rushed. Why? Because you got off-balance. Why? Because your pivot foot was wrong. Why? Because you didn't see the defender." Each step in the chain gets adjusted, and the player's whole shooting form improves.

Backpropagation does the same thing for a neural network. When the model makes a wrong prediction, you trace backward through the layers, figuring out which weights contributed how much to the mistake, and you nudge each one in proportion.

The forward pass and the mistake

The network pushes input forward, layer by layer, ending with a prediction and a cost:

input → layer 1 → layer 2 → … → output → compare to label → cost J

Now you want to update the weights to reduce J. You need ∂J/∂w for every weight w in the network.

For a single-layer model (linear or logistic regression), this was easy. You had a closed-form formula. In a deep network, weights in early layers affect the cost indirectly, through every layer that comes after. Untangling their contribution is the job of backprop.

The chain rule

The chain rule from calculus says: if y depends on z, and z depends on w, then:

\frac{d y}{d w} = \frac{d y}{d z} \cdot \frac{d z}{d w}

The derivative through a chain is the product of derivatives along the chain.

Example. Let $f (x) = (3 x + 2)^{2}$ . To find $df / d x$ using chain rule:

Outer function: $u^{2}$ where $u = 3 x + 2$ . Derivative of $u^{2}$ with respect to u is $2 u$ .
Inner function: $u = 3 x + 2$ . Derivative with respect to x is 3.
So $df / d x = 2 u \cdot 3 = 6 (3 x + 2)$ .

Check by expanding: $f (x) = 9 x^{2} + 12 x + 4$ . So $df / d x = 18 x + 12 = 6 (3 x + 2)$ . ✓

For neural networks. The cost J depends on the output, which depends on the last layer's weights, which depend on the previous layer's outputs, which depend on its weights, all the way back. The chain rule lets you compute ∂J/∂w for any w by multiplying derivatives along the chain.

Deriving backprop for a 2-layer network

This is the most rigorous derivation in the book. Take it slow.

Setup: a 2-layer network for binary classification.

z_{1} = X \cdot W_{1} + b_{1}

a_{1} = ReLU (z_{1})

z_{2} = a_{1} \cdot W_{2} + b_{2}

\overset{y}{^} = σ (z_{2})

J = log loss

Goal: compute ∂J/∂W₁, ∂J/∂b₁, ∂J/∂W₂, ∂J/∂b₂. You'll work backward from the output.

Step 1 — ∂J/∂z₂

The combination "log loss + sigmoid" has the beautiful property mentioned in Chapter 8:

\frac{\partial J}{\partial z _{2}} = \overset{y}{^} - y

This single line is doing the heavy lifting. The derivative of log loss with respect to ŷ has a 1/ŷ term. The derivative of sigmoid $σ^{'} (z_{2}) = \overset{y}{^} (1 - \overset{y}{^})$ . When you chain them together, things cancel out cleanly. Define this for shorthand:

δ_{2} = \overset{y}{^} - y

(δ is "delta," used to mean "the error signal at this layer.")

Step 2 — ∂J/∂W₂ and ∂J/∂b₂

$z_{2} = a_{1} \cdot W_{2} + b_{2}$ . By the chain rule:

\frac{\partial J}{\partial W _{2}} = a_{1}^{T} \cdot δ_{2} / N

\frac{\partial J}{\partial b _{2}} = mean (δ_{2})

(Dividing by N because you average over the dataset.)

Step 3 — propagate the error back to layer 1

You want $δ_{1} = \partial J / \partial z_{1}$ . Using chain rule, going through z₂ and a₁:

δ_{1} = (δ_{2} \cdot W_{2}^{T}) ⊙ ReLU^{'} (z_{1})

Where ⊙ is element-wise multiplication. ReLU'(z) is 1 if z > 0, else 0.

Read this slowly:

$δ_{2} \cdot W_{2}^{T}$ is the error at layer 2, "pulled back" through W₂. It tells you how much each hidden activation a₁ affected J.
You multiply by ReLU'(z₁) to account for the activation function. If a hidden neuron was inactive (z < 0, ReLU output 0), it didn't contribute to anything, so its error signal is 0.

Step 4 — ∂J/∂W₁ and ∂J/∂b₁

$z_{1} = X \cdot W_{1} + b_{1}$ :

\frac{\partial J}{\partial W _{1}} = X^{T} \cdot δ_{1} / N

\frac{\partial J}{\partial b _{1}} = mean (δ_{1})

Done. Five lines of math give you all four gradients. The pattern is universal:

Compute the output error δ.
Use it to compute ∂J/∂W and ∂J/∂b for the output layer.
Propagate the error back through W and the activation derivative to get the previous layer's δ.
Use that to compute ∂J/∂W and ∂J/∂b for the previous layer.
Repeat for as many layers as you have.

The "chain of blame" metaphor

Make this less abstract. A relay team failed. The coach has to assign blame.

The last runner crossed the finish late. Most of the blame goes to them, if they ran their leg badly. But maybe they got the baton late. Then they're less to blame, and the previous runner is more.

The coach traces backward. Each runner's blame depends on:

How much they slowed their own leg (analogous to the "local" derivative).
How much they were already late starting (analogous to the "incoming" error from the next runner).

That's exactly backprop. For each weight, its blame is:

How much it directly affected its layer (the local derivative).
How much its layer affected the cost (the chained-back error δ).

Multiply these two for each weight. That's its gradient.

Backprop from scratch for a 2-layer network

Open Colab. The whole training loop, manual gradients.

import numpy as np
import matplotlib.pyplot as plt

# Data: spirals.
np.random.seed(42)
def make_spiral(n, classes=2, noise=0.2):
    X = np.zeros((n*classes, 2))
    y = np.zeros((n*classes, 1))
    for c in range(classes):
        ix = range(n*c, n*(c+1))
        r = np.linspace(0.0, 1, n)
        t = np.linspace(c*4, (c+1)*4, n) + np.random.randn(n)*noise
        X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
        y[ix] = c
    return X, y

X, y = make_spiral(100, 2)
N = len(X)

# Activation functions and their derivatives.
def relu(z): return np.maximum(0, z)
def relu_deriv(z): return (z > 0).astype(float)
def sigmoid(z): return 1 / (1 + np.exp(-z))

# Initialize weights. Small random values.
n_hidden = 16
W1 = np.random.randn(2, n_hidden) * 0.5
b1 = np.zeros((1, n_hidden))
W2 = np.random.randn(n_hidden, 1) * 0.5
b2 = np.zeros((1, 1))

learning_rate = 0.5
n_iterations = 5000
loss_history = []

for i in range(n_iterations):
    # ------- FORWARD PASS -------
    z1 = X @ W1 + b1                    # shape (N, n_hidden)
    a1 = relu(z1)                       # shape (N, n_hidden)
    z2 = a1 @ W2 + b2                   # shape (N, 1)
    y_hat = sigmoid(z2)                 # shape (N, 1)

    # Loss (clipped to avoid log(0)).
    eps = 1e-15
    y_hat_clip = np.clip(y_hat, eps, 1 - eps)
    loss = -np.mean(y * np.log(y_hat_clip) + (1 - y) * np.log(1 - y_hat_clip))
    loss_history.append(loss)

    # ------- BACKWARD PASS -------
    # Step 1: Output error.
    delta2 = (y_hat - y)                # shape (N, 1)

    # Step 2: Gradients for layer 2.
    dW2 = a1.T @ delta2 / N             # shape (n_hidden, 1)
    db2 = delta2.mean(axis=0, keepdims=True)

    # Step 3: Propagate error back to layer 1.
    delta1 = (delta2 @ W2.T) * relu_deriv(z1)   # shape (N, n_hidden)

    # Step 4: Gradients for layer 1.
    dW1 = X.T @ delta1 / N              # shape (2, n_hidden)
    db1 = delta1.mean(axis=0, keepdims=True)

    # ------- UPDATE -------
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

    if i % 500 == 0:
        acc = ((y_hat > 0.5) == y).mean()
        print(f"Iter {i:5d}: loss={loss:.4f}, acc={acc:.3f}")

Output

The network learns to perfectly classify the spirals. With backprop coded by hand. Forty lines of numpy. You wrote the algorithm that trains every modern AI system, on a problem logistic regression cannot solve. You should feel something.

# Plot loss curve and decision boundary.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(loss_history)
axes[0].set_xlabel('Iteration'); axes[0].set_ylabel('Log loss')
axes[0].set_title('Training loss')

# Decision boundary.
xx, yy_grid = np.meshgrid(np.linspace(-1.5, 1.5, 200), np.linspace(-1.5, 1.5, 200))
grid = np.c_[xx.ravel(), yy_grid.ravel()]
z1g = grid @ W1 + b1
a1g = relu(z1g)
z2g = a1g @ W2 + b2
preds = sigmoid(z2g).reshape(xx.shape)

axes[1].contourf(xx, yy_grid, preds, levels=20, cmap='RdBu_r', alpha=0.6)
axes[1].scatter(X[y.flatten()==0, 0], X[y.flatten()==0, 1], color='orange', edgecolor='k')
axes[1].scatter(X[y.flatten()==1, 0], X[y.flatten()==1, 1], color='blue', edgecolor='k')
axes[1].set_title('Decision boundary')
plt.show()

The easy way — autograd

Frameworks like PyTorch handle backprop automatically.

import torch
import torch.nn as nn

# Convert data to tensors.
X_t = torch.tensor(X, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)

# Define the model: same architecture as your manual one.
model = nn.Sequential(
    nn.Linear(2, 16),
    nn.ReLU(),
    nn.Linear(16, 1),
    nn.Sigmoid()
)

# Define loss and optimizer.
loss_fn = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

# Training loop. Notice: no manual backprop!
for i in range(5000):
    y_hat = model(X_t)
    loss = loss_fn(y_hat, y_t)

    optimizer.zero_grad()
    loss.backward()           # autograd computes gradients
    optimizer.step()          # update weights

    if i % 500 == 0:
        acc = ((y_hat > 0.5).float() == y_t).float().mean().item()
        print(f"Iter {i:5d}: loss={loss.item():.4f}, acc={acc:.3f}")

Same result, far less code. loss.backward() ran the backprop you just wrote by hand. PyTorch built a "computation graph" as you did the forward pass and walked it backward to compute every gradient. This is called automatic differentiation, or autograd. You will probably never write backprop by hand again. But now you know what's happening when you call .backward().

Why this is a big deal

Backpropagation was published in 1986 (Rumelhart, Hinton, Williams), with earlier roots. Before backprop, training networks with more than 1-2 hidden layers was practically impossible. Once backprop existed, deep networks were trainable, and the rest of the field became possible.

The actual implementation in modern frameworks is a few hundred lines. PyTorch, TensorFlow, JAX all handle it for you. You write the forward pass, the framework computes the backward pass.

Vocabulary

Backpropagation (backprop)—The algorithm for computing gradients in a neural network by working backward via the chain rule.

Forward pass—Push input through, compute predictions and cost.

Backward pass—Compute gradients, working from output back to input.

δ (delta)—The error signal at a layer; ∂J/∂z for that layer.

Autograd / automatic differentiation—The system in modern frameworks that computes gradients automatically.

ActivityVerify backprop with numerical gradients· 20 min

A great way to confirm a backprop implementation is right: compare with numerical gradients (the slow but unambiguous way of computing derivatives).

def compute_loss(W1, b1, W2, b2):
    z1 = X @ W1 + b1
    a1 = relu(z1)
    z2 = a1 @ W2 + b2
    y_hat = sigmoid(z2)
    eps = 1e-15
    y_hat_clip = np.clip(y_hat, eps, 1 - eps)
    return -np.mean(y * np.log(y_hat_clip) + (1 - y) * np.log(1 - y_hat_clip))

# Pick one weight in W1.
i, j = 0, 0
h = 1e-5

W1_plus = W1.copy(); W1_plus[i, j] += h
W1_minus = W1.copy(); W1_minus[i, j] -= h

numerical_grad = (compute_loss(W1_plus, b1, W2, b2) - compute_loss(W1_minus, b1, W2, b2)) / (2 * h)
print(f"Numerical gradient: {numerical_grad:.6f}")

# Run your backprop and read off dW1[i, j].
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
y_hat = sigmoid(z2)
delta2 = (y_hat - y)
delta1 = (delta2 @ W2.T) * relu_deriv(z1)
dW1 = X.T @ delta1 / N

print(f"Backprop gradient:  {dW1[i, j]:.6f}")

The two numbers should match to ~6 decimal places. That's how you know backprop is right. Compute the gradient two ways. If they agree, the math is right.

Questions you might have

Do I need calculus to use neural networks?

Does backprop always work?

How fast is backprop?

Is this how the human brain learns?

Next upChapter 14 — Learning without a teacher

You now understand how a neural network learns: forward pass, compute cost, backprop the gradient, gradient descent update. That's the engine of all modern AI. Next: leave supervised learning behind. What can a model learn when there are no labels at all?

How It Actually LearnsLab · in development