Phase 2 · Session 07 · 50 min

Real-World Regression

Big idea

Real datasets have many features, not one. House prices depend on size, bedrooms, location, age — dozens of variables. Linear regression generalizes gracefully. The math becomes vector math. The code becomes a few extra lines. Two new traps appear: features at different scales, and features that don't actually help.

By the end, you'll be able to

Write a multivariate linear regression model in vector notation
Explain why feature scaling matters for gradient descent
Train a real multivariate model on a real dataset using sklearn
Inspect a model's learned weights and interpret them

From one feature to many

In previous chapters, you predicted from one feature:

\overset{y}{^} = w x + b

Now imagine three features:

$x_{1}$ : square footage
$x_{2}$ : number of bedrooms
$x_{3}$ : distance to downtown (km)

The model:

\overset{y}{^} = w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3} + b

Each feature gets its own weight. Each weight tells you how much that feature contributes. If $w_{1} = 200$ , every additional square foot adds $200 to the predicted price. If $w_{3} = - 50000$ , every additional kilometer from downtown subtracts $50,000.

For n features, in general:

\overset{y}{^} = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b

Vector notation

Writing all those weights is annoying. Vector notation packs it up.

Let x be a vector of features (an n-dimensional vector), w a vector of weights. Then:

\overset{y}{^} = w \cdot x + b

The dot is a dot product: multiply corresponding components and sum.

w \cdot x = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n}

So w · x + b is exactly what you wrote above, just shorter. You'll see this notation everywhere in ML papers. It scales the same way to 10 features, 100, 10,000.

The cost function generalizes too. With multiple features, MSE is still:

J (w, b) = \frac{1}{2 N} i \sum (\overset{y}{^}_{i} - y_{i})^{2}

The only thing that changed: $\overset{y}{^}_{i}$ is now $w \cdot x_{i} + b$ , where $x_{i}$ is the feature vector for example i.

Gradient descent with vectors

The update rule generalizes too. Now you have a gradient component for each weight:

\frac{\partial J}{\partial w _{j}} = \frac{1}{N} i \sum (\overset{y}{^}_{i} - y_{i}) \cdot x_{ij} for each j

\frac{\partial J}{\partial b} = \frac{1}{N} i \sum (\overset{y}{^}_{i} - y_{i})

Where $x_{ij}$ is "the j-th feature of the i-th example." Update each weight:

w_{j}^{new} = w_{j}^{old} - α \cdot \frac{\partial J}{\partial w _{j}}

Same logic, more components. In code, this is just a numpy vector operation; no extra loop.

The scaling problem

Here's a subtle issue that bites every beginner.

In the housing example:

Square footage ranges from about 500 to 5000
Bedrooms range from 1 to 5
Distance ranges from 0 to 30

These are wildly different ranges. Gradient descent's "step" affects every parameter by the same learning rate. If the learning rate is right for the bedroom weight, it's wrong for the square-footage weight. The cost surface becomes a long, narrow valley instead of a round bowl, and gradient descent zigzags painfully.

The fix: feature scaling. Standardize each feature before training.

Two common methods:

Min-max scaling. Squash to [0, 1].

x^{'} = \frac{x - min}{max - min}

Standardization (z-score). Mean 0, standard deviation 1.

x^{'} = \frac{x - μ}{σ}

where μ is the feature's mean and σ is its standard deviation.

Standardization is the more common choice in ML. After scaling, all features have similar ranges, the cost surface is a nice round bowl, and gradient descent rolls smoothly.

Multivariate linear regression on California housing

The full pipeline. Open Colab.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# 1. Load the data.
data = fetch_california_housing(as_frame=True)
df = data.frame
print("Shape:", df.shape)        # (20640, 9)
print("Columns:", df.columns.tolist())

# Separate features (X) from target (y).
X = df.drop('MedHouseVal', axis=1).values
y = df['MedHouseVal'].values

# 2. Train/test split. 80/20.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Scale features. Fit scaler on train ONLY, then transform both.
# This avoids "leaking" test data information into training.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train.
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# 5. Evaluate.
y_pred_train = model.predict(X_train_scaled)
y_pred_test = model.predict(X_test_scaled)
print(f"Train MSE: {mean_squared_error(y_train, y_pred_train):.3f}")
print(f"Test MSE:  {mean_squared_error(y_test, y_pred_test):.3f}")

# 6. Inspect what the model learned.
feature_names = df.columns.tolist()[:-1]   # all except 'MedHouseVal'
for name, weight in zip(feature_names, model.coef_):
    print(f"  {name:12s}: weight = {weight:+.3f}")
print(f"  {'intercept':12s}: {model.intercept_:+.3f}")

Output

What do these mean? The MSE on test is 0.556, which (since the target is in units of $100k, squared) means the typical prediction is off by about √0.556 × $100k ≈ $75k. Not amazing, but reasonable for such a simple model.

The largest absolute weights tell you what the model relies on most:

MedInc (median income): big positive. Higher income areas → higher prices. Makes sense.
Latitude and Longitude: big negative. The north and east have lower prices in California. The model has learned California geography.
AveRooms: small negative. Surprising! More rooms per household, lower prices? This is probably because more rooms per household correlates with rural/lower-density areas, where prices are lower despite the size. This is the kind of weird thing a real model surfaces.

The scaling demo

To dramatize feature scaling, run the same model without scaling, using gradient descent.

# Without scaling: gradient descent struggles or diverges.
from sklearn.linear_model import SGDRegressor

# SGDRegressor uses gradient descent (with mini-batches).
# Without scaling, it'll struggle with this data.
print("Without scaling:")
model_unscaled = SGDRegressor(max_iter=1000, learning_rate='constant', eta0=0.001)
try:
    model_unscaled.fit(X_train, y_train)
    print(f"  Test MSE: {mean_squared_error(y_test, model_unscaled.predict(X_test)):.3f}")
except Exception as e:
    print(f"  Failed: {e}")

print("\nWith scaling:")
model_scaled = SGDRegressor(max_iter=1000, learning_rate='constant', eta0=0.001)
model_scaled.fit(X_train_scaled, y_train)
print(f"  Test MSE: {mean_squared_error(y_test, model_scaled.predict(X_test_scaled)):.3f}")

The unscaled version often produces astronomical MSE or NaN values. The scaled version converges to a reasonable result. Same algorithm, same data. The only difference is whether we standardized first. This is why feature scaling is the first thing every ML practitioner does on a new dataset.

More features aren't always better

Three things to watch for:

1. Useless features dilute signal. Adding "average outdoor temperature on the day the house was listed" gives the model another variable to figure out is unhelpful, occupying training capacity that could have gone to better features.

2. Highly correlated features confuse the model. Including "square feet" and "square meters" is the same information twice. The model becomes unstable: which weight should be big? Either one. Both at once becomes ambiguous. (Term: multicollinearity.)

3. Some features actively mislead. "Year built" trained on 1990 data: model thinks old houses are cheap. Apply to 2026 data, badly mispriced historic properties.

This is where feature engineering matters. Picking, transforming, and combining features is its own craft.

Vocabulary

Multivariate (or multiple) linear regression—Linear regression with more than one input feature.

Vector notation—Using w and x to package multiple weights/features.

Dot product—Multiply corresponding components and sum: w · x = Σⱼ wⱼxⱼ.

Feature scaling / standardization—Transforming features to similar numerical ranges before training.

Feature engineering—The art of crafting features for a model.

ActivityHousing price predictor· 30 min

Take the Colab notebook above and:

Load the data, plot histograms of features.
Train linear regression with one feature (MedInc). Print MSE.
Add features one at a time. Watch MSE drop.
Add a deliberately useless feature (random numbers). Watch what happens (MSE barely changes; the weight on the useless feature is near zero).
Compare scaled vs unscaled training.
Identify the most-important features by absolute weight.

Stretch challenge. Engineer a new feature: rooms_per_person = AveRooms / AveOccup. Does it help? It often does, because it captures something neither feature does individually.

Questions you might have

Can the model handle words as features, like 'neighborhood name'?

What if I don't know which features will help?

Why does the model give negative weights to some features?

Did Zillow really train a linear regression?

Next upChapter 8 — Cats or dogs?

Phase 2 is done. You've built a real model that predicts house prices from data. But what if the thing you're predicting isn't a number? What if it's a category, like "is this email spam"? That's classification. Next: a new model (logistic regression), a new cost function (log loss), and a new squish function (sigmoid). Same machinery, new toolkit.

Real-World RegressionLab · in development