How Wrong Are We?
Before a model can learn, it needs a way to measure how wrong it is. The cost function is that measure. You'll derive Mean Squared Error from a simple question: "given a guessed line, how badly does it miss the data points?"
- Compute the error for a single prediction (predicted minus actual)
- Explain why we square errors instead of just adding them up
- Write the MSE formula from memory and explain each symbol
- Implement MSE in numpy and use it to compare two lines
The setup
Imagine two scatter plots side by side. Same dots. Each has a different line drawn through it. One is clearly worse than the other.
This line is bad. That line is good. But how bad is bad? Can you put a number on it?
The natural answer: measure the distance from each dot to the line. That's exactly what a cost function does.
The error of a single prediction
For each data point, the model has a prediction () and an actual value (). The error for that point is:
Visually, the error is the vertical gap between the dot and the line. Big gap means big error. Dots on the line have zero error.
A concrete example. Suppose your line predicts ŷ = 80 for some student, but the student actually got y = 75. The error is ŷ − y = 80 − 75 = +5. The model overshot by 5.
For another student, ŷ = 70 but y = 75. Error is 70 − 75 = −5. The model undershot by 5.
Both predictions were equally wrong (off by 5 units), but their errors have opposite signs. This will matter in a moment.
Why you can't just add the errors
Naive idea: total error = sum of all errors.
(Σ is the Greek letter sigma. It means "add up." The little "i" underneath says "for each example, indexed by i.")
Why this fails. Suppose you have just two data points. Your line overshoots one by 5 and undershoots the other by 5. Errors: +5 and −5. Sum: 0.
The line looks "perfect" by this measure. It is not. It missed both points equally badly. Positives and negatives canceled out.
You need a fix.
Two options for fixing the sign problem
Option A: take the absolute value. . Always positive.
This works mathematically. The cost function would be the average of the absolute errors. This is called Mean Absolute Error (MAE):
Downside: |x| has a sharp corner at x = 0. The function is not smooth there. In Chapter 6, you'll need to take derivatives of the cost function, and derivatives of functions with sharp corners are problematic.
Option B: square the errors. . Squaring makes any number non-negative.
Smooth (no sharp corners). Easy to take derivatives of. Has the side effect of punishing big errors more than small ones: an error of 10 contributes 100 to the cost, but an error of 1 contributes only 1. This is often what you want.
Option B wins. MSE is the workhorse of regression. Lock in the formula:
Notice MSE is a function of w and b. Different (w, b) pairs give different MSEs. The whole game of training will be: find the (w, b) that minimizes MSE.
Working an example by hand
Make this concrete. Tiny dataset with 3 points:
| i | xᵢ (hours) | yᵢ (score) |
|---|---|---|
| 1 | 2 | 60 |
| 2 | 5 | 75 |
| 3 | 8 | 90 |
Try line 1: w = 5, b = 50.
| i | xᵢ | yᵢ | ŷᵢ = 5xᵢ + 50 | error | error² |
|---|---|---|---|---|---|
| 1 | 2 | 60 | 60 | 0 | 0 |
| 2 | 5 | 75 | 75 | 0 | 0 |
| 3 | 8 | 90 | 90 | 0 | 0 |
MSE = (1/3) × (0 + 0 + 0) = 0. This line goes through every point exactly. Best possible MSE.
Try line 2: w = 4, b = 55.
| i | xᵢ | yᵢ | ŷᵢ = 4xᵢ + 55 | error | error² |
|---|---|---|---|---|---|
| 1 | 2 | 60 | 63 | 3 | 9 |
| 2 | 5 | 75 | 75 | 0 | 0 |
| 3 | 8 | 90 | 87 | -3 | 9 |
MSE = (1/3) × (9 + 0 + 9) = 6. Worse than line 1.
Try line 3: w = 0, b = 75 (a flat line at the average).
| i | xᵢ | yᵢ | ŷᵢ | error | error² |
|---|---|---|---|---|---|
| 1 | 2 | 60 | 75 | 15 | 225 |
| 2 | 5 | 75 | 75 | 0 | 0 |
| 3 | 8 | 90 | 75 | -15 | 225 |
MSE = (1/3) × (225 + 0 + 225) = 150. Much worse.
The MSE numbers (0, 6, 150) capture your intuition: line 1 is the best fit, line 2 is decent, line 3 is bad. You've turned "fits well" into a number.
The cost function defines "best"
Here's the move that makes it all click:
Finding the best line is the same as finding the (w, b) with the lowest MSE.
Whichever (w, b) gives the smallest MSE is, by definition, the best fit. The cost function turned the fuzzy goal "fit well" into the precise math problem "minimize this number."
This is the central trick of all of machine learning. Every algorithm in this book does:
- Define a model with adjustable parameters.
- Define a cost function that measures wrongness.
- Adjust parameters to minimize the cost.
Linear regression, neural networks, ChatGPT, all of it. The model gets fancier, the cost function changes (you'll meet log loss in Chapter 8), the optimization gets cleverer, but the recipe is the same.
Implement MSE from scratch
Open Colab.
import numpy as np
# Three data points.
X = np.array([2, 5, 8])
y = np.array([60, 75, 90])
# A guessed line.
w = 4
b = 55
# Step 1: Compute predictions.
y_hat = w * X + b
print("Predictions:", y_hat)
# Step 2: Compute errors.
errors = y_hat - y
print("Errors:", errors)
# Step 3: Square them.
squared_errors = errors ** 2
print("Squared errors:", squared_errors)
# Step 4: Take the mean.
mse = squared_errors.mean()
print("MSE:", mse)
That's MSE. Five lines. Now wrap it as a reusable function:
def mse(w, b, X, y):
"""Return the mean squared error of line y = wx + b on data (X, y)."""
y_hat = w * X + b
return ((y_hat - y) ** 2).mean()
# Compare three lines from the worked example.
print("Line 1 (w=5, b=50):", mse(5, 50, X, y)) # 0.0
print("Line 2 (w=4, b=55):", mse(4, 55, X, y)) # 6.0
print("Line 3 (w=0, b=75):", mse(0, 75, X, y)) # 150.0
The numbers match the table you did by hand.
Visualize the cost surface
Here's the visual that makes the next chapter click. Compute MSE for many (w, b) pairs and plot the surface.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Real-ish data with a bit of noise.
np.random.seed(42)
X = np.array([1, 2, 4, 6, 8, 10, 12])
y = np.array([55, 65, 70, 78, 90, 96, 113])
# Sweep over many w, b values.
w_range = np.linspace(0, 10, 50)
b_range = np.linspace(30, 70, 50)
W, B = np.meshgrid(w_range, b_range)
# Compute MSE for each (w, b) pair.
Z = np.zeros_like(W)
for i in range(W.shape[0]):
for j in range(W.shape[1]):
Z[i, j] = mse(W[i, j], B[i, j], X, y)
# Plot.
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(W, B, Z, cmap='viridis', alpha=0.7)
ax.set_xlabel('w (slope)')
ax.set_ylabel('b (intercept)')
ax.set_zlabel('MSE')
ax.set_title('Cost surface: MSE as a function of w and b')
plt.show()
The output: a smooth bowl. Every point on the bowl floor represents a different (w, b). The height of the bowl at that point is the MSE.
The lowest point of the bowl is the best line. Finding it visually is easy. Finding it algorithmically is what gradient descent does. That's next chapter.
Vocabulary
Questions you might have
A small math aside — the 1/2 convention. Some textbooks (and Andrew Ng's Coursera course) write MSE with a 1/2 in front:
The 1/2 is a math convenience. When you take the derivative for gradient descent, the chain rule will produce a factor of 2 from the squared term, and the 1/2 cancels it out, giving cleaner formulas. The 1/2 doesn't change which (w, b) is best; it just rescales the cost. You'll see it both ways in this book.
You can score any line. The best line minimizes MSE. But how do you find the (w, b) with the lowest MSE? With infinite possible (w, b) combinations, you can't check them all. Next: gradient descent — the blindfolded hiker. We'll write it from scratch and watch the MSE drop with every step.