Phase 2 · Session 06 · 60 min

Rolling Downhill

Big idea

Gradient descent is a simple, beautiful algorithm. Stand somewhere on the cost surface. Compute which way is downhill. Take a small step. Repeat. Done enough times, you find the minimum. You'll derive the gradient for MSE, code the algorithm from scratch, and watch your model train in real time.

By the end, you'll be able to

Explain gradient descent using the blindfolded hiker analogy
Derive the gradient of MSE with respect to w and b
Implement gradient descent in numpy from scratch
Predict what happens when the learning rate is too big or too small

The blindfolded hiker

Recall: the cost function is a surface. For linear regression, it's a 3D bowl. The lowest point is the best (w, b). You want to find it.

Imagine you're a hiker, blindfolded, standing somewhere on this hilly landscape. Your goal: get to the lowest point. You can't see anything. You can only feel the ground around your feet.

Your strategy:

Feel the ground around your feet. Find the direction that slopes most steeply downhill.
Take a small step in that direction.
Repeat.

Eventually you reach the bottom of a valley. You can tell because the ground around you feels flat: no direction is downhill anymore.

That's gradient descent. The "feeling the slope" part is mathematically called the gradient. The "small step" part is the update. The size of the step is the learning rate.

For linear regression, the cost surface is a perfect bowl (convex), so this strategy is guaranteed to find the actual best (w, b). For more complex models (Phase 4), the surface gets bumpy, and gradient descent might land in a not-best valley. We'll deal with that later.

A 60-second refresher on derivatives

To compute the gradient, you need derivatives. If you've taken calculus, you've seen this. If not, here's the quick version.

The derivative of a function tells you its slope at a point.

Consider $f (x) = x^{2}$ . Its graph is a parabola.

At x = 1, the parabola is going up. The slope (the derivative) is positive.
At x = −1, it's going down. The derivative is negative.
At x = 0, it's flat at the bottom. The derivative is zero.

There's a formula for the derivative of $x^{2}$ : it's $2 x$ . So at x = 1, the derivative is 2. At x = −1, it's −2. At x = 0, it's 0. Matches your intuition.

Three derivative rules you need:

Power rule. Derivative of $x^{n}$ is $n \cdot x^{n - 1}$ . So $x^{2} \to 2 x$ , $x^{3} \to 3 x^{2}$ .
Constant rule. Derivative of a constant is 0.
Chain rule. Derivative of $f (g (x))$ is $f^{'} (g (x)) \cdot g^{'} (x)$ . The "outer" derivative times the "inner" derivative.

Partial derivatives are derivatives when the function has multiple variables. You hold all other variables constant and take the regular derivative with respect to one. Notation: $\partial f / \partial x$ ("partial dee f, partial dee x") means "derivative of f, treating all other variables as constants."

Example: $f (w, b) = w^{2} + 3 w b + b^{3}$ .

$\partial f / \partial w$ (treat b as constant): $2 w + 3 b$
$\partial f / \partial b$ (treat w as constant): $3 w + 3 b^{2}$

That's all the calculus you need. If this is brand new, it'll feel weird for one chapter and become natural by Chapter 7.

The gradient of MSE

Now apply derivatives to the cost function. Recall:

J (w, b) = \frac{1}{2 N} i \sum (w x_{i} + b - y_{i})^{2}

(Using the 1/(2N) version because the 2 will cancel out cleanly. See the aside in Chapter 5.)

Take ∂J/∂w. Treat b as a constant. Use the chain rule on the square.

For each term inside the sum, the outer function is $u^{2}$ (where $u = w x_{i} + b - y_{i}$ ). Its derivative is $2 u$ . The inner function u has derivative $x_{i}$ with respect to w (because the only w-containing piece is $w x_{i}$ ).

So the derivative of one term $(w x_{i} + b - y_{i})^{2}$ with respect to w is:

2 \cdot (w x_{i} + b - y_{i}) \cdot x_{i}

Sum over i, multiply by 1/(2N), and the 2 cancels with the 1/2:

\frac{\partial J}{\partial w} = \frac{1}{N} i \sum (\overset{y}{^}_{i} - y_{i}) \cdot x_{i}

Take ∂J/∂b. Same idea. The derivative of u with respect to b is 1.

\frac{\partial J}{\partial b} = \frac{1}{N} i \sum (\overset{y}{^}_{i} - y_{i})

These two formulas are the gradient of MSE. They tell you, for the current (w, b), which direction to nudge each parameter to increase J. You move the opposite way to decrease J.

Pause and look at these formulas. They have a beautiful structure.

∂J/∂b is the average error (predicted minus actual). Big positive average error means b is too high; gradient is positive; you'll subtract from b. Big negative average error means b is too low; gradient is negative; you'll add to b. Self-correcting.
∂J/∂w is the average error weighted by x. The $x_{i}$ factor says "errors at large-x points should pull on w more than errors at small-x points." Makes sense: w is the slope, so high-leverage points (large x) should influence the slope more.

When the model is doing well, the average error is near zero, both gradients are small, and the steps are small. The algorithm naturally slows down as it approaches the answer. Aesthetically beautiful.

The update rule

Now you have gradients. The gradient descent update rule:

w_{new} = w_{old} - α \cdot \frac{\partial J}{\partial w}

b_{new} = b_{old} - α \cdot \frac{\partial J}{\partial b}

α (alpha) is the learning rate, which controls how big a step to take.

In English: "subtract a small fraction of the gradient from each parameter." The minus sign is because you want to go opposite to the gradient (downhill, not uphill).

Repeat this update many times. Each iteration, the cost drops a bit. Eventually, you converge.

The learning rate trade-off

The learning rate is the single most important knob in all of ML training.

Too small. Baby steps. The hiker eventually reaches the bottom but it might take centuries. Training is painfully slow.
Too big. Giant strides. The hiker overshoots the valley, climbs the other side, overshoots back. They oscillate forever, or worse, fly off the surface and the cost grows. This is called "diverging." A diverging model is a broken model.
Just right. Confident steps that get smaller as the gradient does. Smooth convergence in a reasonable number of steps.

In practice, you tune the learning rate by trying values like 0.1, 0.01, 0.001 and watching the cost curve.

Gradient descent from scratch

This is the most important code block in the book. Read it slowly, line by line.

import numpy as np
import matplotlib.pyplot as plt

# Some data with a clear linear trend plus noise.
np.random.seed(42)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = 2.5 * X + 5 + np.random.randn(10) * 2   # true w=2.5, b=5

def compute_predictions(w, b, X):
    return w * X + b

def compute_cost(w, b, X, y):
    """MSE."""
    y_hat = compute_predictions(w, b, X)
    return ((y_hat - y) ** 2).mean()

def compute_gradients(w, b, X, y):
    """Return (dJ/dw, dJ/db) for MSE."""
    N = len(X)
    y_hat = compute_predictions(w, b, X)
    errors = y_hat - y
    dJ_dw = (errors * X).mean()
    dJ_db = errors.mean()
    return dJ_dw, dJ_db

# Initial guess.
w, b = 0.0, 0.0

# Hyperparameters.
learning_rate = 0.01
n_iterations = 1000

# Track the cost over iterations.
cost_history = []

for i in range(n_iterations):
    # 1. Compute gradients at current (w, b).
    dJ_dw, dJ_db = compute_gradients(w, b, X, y)

    # 2. Update parameters.
    w = w - learning_rate * dJ_dw
    b = b - learning_rate * dJ_db

    # 3. Track cost for plotting.
    cost_history.append(compute_cost(w, b, X, y))

    # 4. Print progress every 100 iterations.
    if i % 100 == 0:
        print(f"Iter {i:4d}: w={w:.3f}, b={b:.3f}, cost={cost_history[-1]:.3f}")

print(f"\nFinal: w={w:.3f}, b={b:.3f}")
print(f"True:  w=2.500, b=5.000")

Output

Stop and stare. You started with w=0, b=0. After 1000 iterations of gradient descent, your model converged to w=2.52, b=4.70, very close to the true w=2.5, b=5.0. You just trained a model. You just ran the same algorithm that trains GPT-4, on a tiny dataset.

plt.plot(cost_history)
plt.xlabel('Iteration')
plt.ylabel('MSE')
plt.title('Cost dropping during gradient descent')
plt.show()

The plot shows a steep drop in the first 50 iterations, then a slow flattening as you approach the minimum. That curve is the universal shape of model training.

Play with the learning rate

def train(X, y, learning_rate, n_iterations=1000):
    """Run gradient descent and return final cost."""
    w, b = 0.0, 0.0
    for _ in range(n_iterations):
        dJ_dw, dJ_db = compute_gradients(w, b, X, y)
        w -= learning_rate * dJ_dw
        b -= learning_rate * dJ_db
    return compute_cost(w, b, X, y), w, b

for lr in [0.0001, 0.001, 0.01, 0.1, 0.5]:
    final_cost, final_w, final_b = train(X, y, lr)
    print(f"lr={lr:.4f}: w={final_w:.3f}, b={final_b:.3f}, final cost={final_cost:.3f}")

Output

A live demonstration of the U-shape. Too small wastes time. Too big diverges. The right value is found by experimentation.

Why this generalizes to everything

Here's the punchline that makes Chapter 6 the conceptual peak of Phase 2:

Every modern ML model is trained with gradient descent (or some variant of it).

ChatGPT, Claude, self-driving car perception, voice recognition, image generators, recommendation systems: gradient descent. The cost function changes. The model changes. The recipe is always:

Make a prediction.
Compute the cost.
Compute the gradient.
Take a small step.
Repeat, billions of times.

The reason this is universal: it's the only known approach that works for models with billions of parameters. For 2 parameters, you could try many combinations and pick the best. For GPT-4 with hundreds of billions of parameters, that's hopeless. Gradient descent is the only way to train these things.

When you understand gradient descent, you've understood the engine of the entire field.

Vocabulary

Gradient—The vector of partial derivatives of the cost with respect to each parameter. Tells you which way to move parameters to increase cost; you move opposite.

Learning rate (α or sometimes η)—Step size for gradient descent.

Iteration / step—One round of: compute gradient, update parameters.

Convergence—When the cost has stopped dropping meaningfully.

Questions you might have

Why subtract instead of add?

Does gradient descent always work?

What if there are multiple valleys?

Why not just compute the answer directly?

Next upChapter 7 — Real-world regression

You now understand the heart of how machines learn. Every model in this book uses gradient descent. Next: linear regression on real data. Predicting house prices from multiple features at once. We'll generalize the equation to vectors, see why models care about the scale of features, and meet a real-world dataset.