Phase 3 · Session 09 · 50 min

When Models Lie

Big idea

A model that fits training data perfectly might be useless on new data. The "pattern" it learned was specific to training quirks, not the real world. This is overfitting. You'll see it visually with a polynomial regression demo, code train/test splits, and derive L2 regularization.

By the end, you'll be able to

Explain overfitting using the "memorize vs understand" analogy
Code a train/test split correctly
Plot train and test error vs model complexity, and identify the sweet spot
Write the L2-regularized cost function and explain what λ does

Two students

Two students prepare for an exam.

Alex memorized the textbook. Page numbers, examples, even typos. They have 100% accuracy on every example in the book.

Sam read the textbook once. They didn't memorize anything. But they understood the concepts.

The exam has new problems not in the book. Who does better?

Sam, almost always. Alex's "knowledge" was specific to the book; the moment a slightly different question shows up, they're stuck. The memorization didn't transfer to understanding.

Alex is an overfitted model. Sam is a well-generalized one.

Memorization vs generalization

Your real goal in training is generalization: doing well on new examples you haven't seen. You don't actually care if the model gets the training examples right. You care about future examples.

But training only optimizes for training examples. So a sufficiently powerful model can simply memorize the training set, achieving zero training error and zero useful learning.

This is overfitting: the model learned the noise and quirks of training data instead of the underlying pattern.

The opposite is underfitting: the model is too simple to capture the actual pattern. It does poorly on both training and new data.

The sweet spot: a model complex enough to capture the real pattern, simple enough to not memorize noise.

Train / test split

How do you detect overfitting? Reserve some data the model has never seen.

The recipe:

Take your dataset. Randomly shuffle.
Split: 80% train, 20% test (or 70/30, or 90/10; doesn't matter much).
Train only on the training set.
Evaluate only on the test set. Test accuracy is your honest estimate of new-data performance.

This is non-negotiable. Reporting only training accuracy is the cardinal sin of beginner ML.

The test set must be held out from the start. Don't peek at it during training. Don't tune your model based on it. (If you do, the test set effectively becomes part of your training data.)

Train/test split in code

Open Colab.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)

# Split: 80% train, 20% test. random_state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Train: {len(X_train)} examples")
print(f"Test:  {len(X_test)} examples")

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy:  {test_acc:.3f}")

For iris (a relatively easy dataset), train and test accuracy should be similar (around 95+%). When they're far apart, that's overfitting.

The overfitting curve

Plot training and test error as a function of model complexity.

error
  │
  │\
  │ \                            ●●●●● test error
  │  \                       ●●●●
  │   \                  ●●●●
  │    \             ●●●
  │     \●●●●●●●●●●●●●
  │      ●●●●●●●●●
  │              ●●●●●●●●●●●●●●● train error
  │
  └─────────────────────────────── complexity
   simple                  complex

As complexity grows, training error monotonically drops (more capacity to fit). Test error drops, then rises. The valley of the test curve is the sweet spot.

Past the sweet spot, the model uses extra capacity to memorize training noise, and test error climbs. This U-shape is universal in ML.

The overfitting demo

Polynomial regression on noisy data. Watch overfitting happen live.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# True relationship: y = sin(x), with noise.
np.random.seed(0)
n_samples = 30
X = np.sort(np.random.uniform(0, 2*np.pi, n_samples))
y_true = np.sin(X)
y = y_true + np.random.normal(0, 0.3, n_samples)

# Split: 70% train, 30% test.
indices = np.arange(n_samples)
np.random.shuffle(indices)
train_idx, test_idx = indices[:21], indices[21:]
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

# Try polynomial degrees 1 to 15. Fit each, record train and test MSE.
degrees = list(range(1, 16))
train_errors, test_errors = [], []

for d in degrees:
    model = make_pipeline(PolynomialFeatures(degree=d), LinearRegression())
    model.fit(X_train.reshape(-1, 1), y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_train.reshape(-1, 1))))
    test_errors.append(mean_squared_error(y_test, model.predict(X_test.reshape(-1, 1))))

# Plot.
plt.plot(degrees, train_errors, label='Train MSE', marker='o')
plt.plot(degrees, test_errors, label='Test MSE', marker='s')
plt.xlabel('Polynomial degree')
plt.ylabel('MSE')
plt.yscale('log')
plt.legend()
plt.title('Train vs Test error: classic overfitting curve')
plt.show()

The plot shows train MSE dropping monotonically, test MSE dipping and then climbing. Sweet spot is around degree 3 or 4.

Visualize the actual fitted curves:

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, d in zip(axes, [1, 4, 15]):
    model = make_pipeline(PolynomialFeatures(degree=d), LinearRegression())
    model.fit(X_train.reshape(-1, 1), y_train)
    x_plot = np.linspace(0, 2*np.pi, 200)
    y_plot = model.predict(x_plot.reshape(-1, 1))
    ax.scatter(X_train, y_train, label='train', color='blue')
    ax.scatter(X_test, y_test, label='test', color='red')
    ax.plot(x_plot, y_plot, color='green', label=f'degree {d}')
    ax.plot(x_plot, np.sin(x_plot), color='gray', linestyle='--', label='true')
    ax.set_title(f'Degree {d}')
    ax.legend()
    ax.set_ylim(-2, 2)
plt.tight_layout()
plt.show()

Three subplots: degree 1 underfits (a straight line, missing the curvature); degree 4 fits well (close to sin curve); degree 15 wildly overfits (snake-like curve threading through training points but missing the test points and the true sin).

Regularization

How do you fight overfitting? Several tools. The most universal is regularization: add a penalty to the cost function for big weights.

The intuition. A model with large weights is "leaning hard" on its features. Tiny input changes produce big output changes. The function it represents wiggles a lot. Wiggle = memorization of training noise.

A model with small weights is smoother. Tiny input changes produce tiny output changes. Smooth functions tend to generalize better.

So: penalize big weights.

L2 regularization (also called Ridge regression). Add the sum of squared weights to the cost:

J_{reg} (w, b) = J_{orig} (w, b) + λ j \sum w_{j}^{2}

(The bias b is usually not penalized, by convention. Only the weights.)

λ (lambda) is a knob:

λ = 0: no regularization. Original cost. Maximum overfitting potential.
λ very large: the model is forced to use tiny weights. Maximum smoothness, possibly underfit.
λ in between: the sweet spot.

The new gradient. The gradient of the regularization term is just $2 λ \cdot w_{j}$ (one extra term per weight). So the update rule becomes:

w_{j}^{new} = w_{j}^{old} - α \cdot (\frac{\partial J _{orig}}{\partial w _{j}} + 2 λ w_{j}^{old})

Each step now pulls the weights toward zero by a small amount. Big weights get tugged down harder; small weights barely change. This is sometimes called "weight decay."

L1 regularization (Lasso). Same idea, but uses absolute value instead of squared:

J_{L 1} = J_{orig} + λ j \sum ∣ w_{j} ∣

L1 has the property of driving some weights to exactly zero, doing automatic feature selection. L2 keeps all weights but small. Both are useful; L2 is more common in deep learning.

Regularization in action

from sklearn.linear_model import Ridge

degrees_to_try = [4, 15]
lambdas = [0, 0.001, 0.1, 10]

fig, axes = plt.subplots(2, len(lambdas), figsize=(16, 8))
x_plot = np.linspace(0, 2*np.pi, 200)

for row, d in enumerate(degrees_to_try):
    for col, lam in enumerate(lambdas):
        # Ridge in sklearn uses alpha for what we called lambda.
        model = make_pipeline(PolynomialFeatures(degree=d), Ridge(alpha=lam))
        model.fit(X_train.reshape(-1, 1), y_train)
        y_plot = model.predict(x_plot.reshape(-1, 1))
        ax = axes[row][col]
        ax.scatter(X_train, y_train, color='blue', s=20)
        ax.plot(x_plot, y_plot, color='green')
        ax.plot(x_plot, np.sin(x_plot), color='gray', linestyle='--')
        ax.set_title(f'degree={d}, λ={lam}')
        ax.set_ylim(-2, 2)
plt.tight_layout()
plt.show()

A 2×4 grid. Top row: degree-4 polynomial. Bottom: degree-15 polynomial. Each column: a different λ.

In the bottom row, the wild degree-15 curve calms down dramatically as λ increases, eventually looking nearly identical to the well-behaved degree-4 curve. Regularization rescues the model from overfitting without changing the model's structure. That's the magic.

Vocabulary

Overfitting—Model fits training data well but fails on new data.

Underfitting—Model too simple to capture the real pattern.

Generalization—Performance on data the model hasn't seen.

Train/test split—Holding out data to evaluate honestly.

Regularization—Penalty in the cost function that discourages complex models.

L2 regularization (Ridge)—Penalizes the sum of squared weights.

L1 regularization (Lasso)—Penalizes the sum of absolute weights.

λ (lambda)—The regularization strength. Big λ = simpler model.

Questions you might have

How do I know which complexity is best?

Can a simple model overfit?

What if my training data IS the real world?

Why is it called 'lambda'?

Next upChapter 10 — Mini-project: Titanic

You have everything you need to build a real ML project. Next: a hands-on mini-project — predicting who survived the Titanic, using logistic regression and everything you've learned. It's a famous Kaggle starter problem.

When Models LieLab · in development