Drawing Lines Through Dots
The simplest "learning" task: given dots on a graph, find the line that fits them best. Linear regression — derived as a direct extension of from algebra class.
- Read a scatter plot and sketch a reasonable best-fit line
- Translate y = mx + b into ML notation (ŷ = wx + b)
- Compute predictions by hand for a small dataset
- Use numpy and scikit-learn to fit a line in code
A line is a function with two knobs
This is the bridge from algebra class to ML. Take it slowly.
Recall from algebra. It's the equation of a line:
- m is the slope: how steep the line is. If you increase x by 1, y goes up by m.
- b is the intercept: where the line crosses the y-axis. The value of y when x = 0.
In ML, you use this exact same equation, but you re-cast its meaning. You're going to use a line to make predictions. So:
- x is the input (the thing we know about an example, like hours studied)
- y is the true output for that example (the actual exam score)
- ŷ ("y-hat") is the model's predicted output (what the line says for this x)
- The line is your model:
The line predicts ŷ for any input x. The slope m tells you how much ŷ changes per unit increase in x. The intercept b is the prediction when x = 0.
Notation switch (gentle)
ML papers and textbooks use w instead of m (for "weight") and keep b (for "bias," same as intercept). Same equation, different letters:
Why the rename? Because when you have multiple input features, you'll have multiple slopes. Calling them all "m" gets confusing. Calling them scales naturally. You'll see this in Chapter 7. From now on, w is the slope and b is the intercept.
Why "weight" and "bias"? "Weight" because it tells you how much weight this feature should get when computing the prediction. A big weight means this feature matters a lot. A small weight means it doesn't. "Bias" is older terminology from neural networks, and it just means "the constant that shifts the prediction up or down." Don't confuse it with bias-the-fairness-concept from Chapter 3. Same word, different meaning.
Parameters are what you learn
Lock this in: the model is the equation . The parameters are the specific numbers w and b that make this line this particular line.
When you say "I trained a model," you mean: I found good values of w and b for the data. That's literally it for linear regression. Every concept that comes after (gradient descent, neural networks, transformers) is sophisticated machinery for finding good parameters.
For linear regression in 1D, there are only 2 parameters: w and b. For a neural network, there might be billions. The principle is identical.
Computing a prediction by hand
Given specific values of w and b, computing ŷ for any x is just arithmetic.
Example: suppose w = 5 and b = 50, and x = 8 (a student studies 8 hours).
The model predicts a score of 90.
Try a few of these on paper. Vary w and b. Different (w, b) pairs give different predictions for the same x. The whole point of training is to pick the (w, b) that gives good predictions for all the points in your dataset, not just one.
Quick exercise: for each row of (w, b, x), compute ŷ.
| w | b | x | ŷ = ? |
|---|---|---|---|
| 3 | 10 | 4 | ? |
| 0.5 | 20 | 100 | ? |
| -2 | 50 | 5 | ? |
| 1 | 0 | 7 | ? |
(Answers: 22, 70, 40, 7.)
Linear regression in three ways
You'll compute predictions three ways, in increasing levels of abstraction.
Way 1 — pure Python
Write the formula directly.
# A single prediction by hand.
w = 5 # slope
b = 50 # intercept
x = 8 # input: hours studied
y_hat = w * x + b
print("Prediction:", y_hat) # 90
Way 2 — multiple predictions with numpy
Real datasets have many inputs. You use arrays.
import numpy as np
w = 5
b = 50
# A whole array of inputs.
x_array = np.array([1, 2, 4, 6, 8, 10, 12])
# Numpy applies the operation to every element. This is called "vectorization."
y_hat_array = w * x_array + b
print(y_hat_array)
What just happened: numpy's array multiplication is element-wise. Adding 50 adds 50 to each element. You computed 7 predictions in one line.
Way 3 — plot the line
import matplotlib.pyplot as plt
# Some made-up data.
x_data = np.array([1, 2, 4, 6, 8, 10, 12])
y_data = np.array([55, 65, 70, 75, 90, 95, 115]) # actual exam scores
# Your guessed line.
w, b = 5, 50
x_line = np.linspace(0, 13, 100) # 100 points from 0 to 13
y_line = w * x_line + b
plt.scatter(x_data, y_data, label='actual data')
plt.plot(x_line, y_line, color='red', label=f'line: y = {w}x + {b}')
plt.xlabel('Hours studied')
plt.ylabel('Exam score')
plt.legend()
plt.show()
The plot shows the dots and your guessed line. Some dots are above the line, some below. Your line is approximately right, but not perfect.
Pause and look at it. Could you do better? Drag the line in your imagination — slope up a bit, intercept down a bit. The question for next chapter is: how do you measure how good a line is? Without that measurement, "better" has no meaning.
When linear regression makes sense
Linear regression assumes the relationship between x and y is roughly a straight line. That sounds restrictive but covers a lot of real data:
- Hours studied vs. exam score
- Square footage vs. house price
- Years of experience vs. salary
- Outdoor temperature vs. ice cream sales
It does not make sense when:
- The relationship is wildly non-linear (a thrown ball's trajectory, a parabola)
- The output is a category (spam vs not spam — classification, Phase 3)
Fit a line with scikit-learn
You've been computing predictions for guessed w and b. What if you let sklearn find the best w and b for you?
from sklearn.linear_model import LinearRegression
# Your data, reshaped into the form sklearn expects.
# X must be 2D: (n_examples, n_features). Even with one feature, you need 2D.
X = np.array([[1], [2], [4], [6], [8], [10], [12]])
y = np.array([55, 65, 70, 75, 90, 95, 115])
# Create the model. Train it. Print the learned w and b.
model = LinearRegression()
model.fit(X, y)
print("Learned w:", model.coef_[0]) # the slope
print("Learned b:", model.intercept_) # the intercept
# Predict for a new student.
x_new = np.array([[7]]) # 7 hours studied
y_pred = model.predict(x_new)
print(f"Predicted score for 7 hours: {y_pred[0]:.1f}")
Two-line training. The slope and intercept print out something like w = 5.3, b = 50.2.
sklearn just found the "best" line. You don't yet know how it defined "best." You don't yet know how it found it. That's Chapters 5 and 6. But you can already use this in real projects.
Vocabulary
In Colab:
- Make a small dataset by hand. 10 rows. Two columns:
hours_studiedandexam_score. Make up plausible numbers. - Plot it as a scatter chart.
- Fit a
LinearRegressionfrom sklearn. - Print the learned w and b.
- Predict scores for hypothetical study times: 0 hours, 5 hours, 20 hours.
- Stretch: add a row of bizarre data (say, 5 hours studied with a score of 5; or 15 hours with a score of 40). Refit the model. How much did w and b change? Why?
Questions you might have
Today you drew a line through dots. Next, we make "best fit" precise. We define what it means for one line to be better than another, and derive the most important formula in introductory ML: Mean Squared Error.