Margin
Phase 1 · Session 03 · 45 min

Data Is Everything

Big idea

ML doesn't have opinions, it has data. If the data is biased, broken, or unrepresentative, the model's predictions will be too. "Garbage in, garbage out" isn't a slogan — it's the most common cause of ML failure in the real world.

By the end, you'll be able to
  • Explain "garbage in, garbage out" with a real example
  • Use pandas to inspect a dataset for problems
  • Identify three kinds of bias that can sneak into a dataset

The Amazon hiring AI story

In 2014, Amazon started building an internal ML tool to screen resumes. They fed it ten years of resumes from hires and the people they'd rated highly. The model was supposed to learn what a "good" Amazon engineer's resume looked like and rank new applicants automatically.

By 2015, the team noticed something disturbing. The model was systematically penalizing resumes that contained the word "women's," as in "women's chess club captain" or "women's college." It was downgrading graduates of two all-women's colleges. It had taught itself to prefer male candidates.

Why? The training data. Amazon's tech workforce in the previous decade had been heavily male. The model learned the pattern: successful Amazon engineers have these features, so male-coded features are predictive of success. The model wasn't sexist on purpose. It was doing what it was asked to do: find the pattern in the historical data. The historical data was sexist.

Amazon scrapped the project in 2018.

This story isn't unusual. The same problem has appeared in dozens of other ML systems: facial recognition with much higher error rates on darker-skinned faces (training data was mostly light-skinned), healthcare risk-prediction algorithms that systematically underestimated Black patients' medical needs (the proxy used was "past spending," and Black patients had historically received less care), mortgage approval models that perpetuated decades of redlining baked into historical lending data.

Garbage in, garbage out

The metaphor: imagine learning a language only by reading shampoo bottles. You'd technically be "fluent." You'd be very good at predicting what comes after the word "for" ("smoother, healthier hair"). Try to have an actual conversation, and you'd sound bizarre. The data shaped your language ability into a weird, narrow shape.

ML models are exactly like this. Their entire understanding of the world is whatever was in their training data.

The principle. A model is the average of its training data. This is a slight oversimplification, but the right intuition. There's no magic. There's no "common sense" hiding in the algorithm. The model only knows what it saw.

Three kinds of bias

1. Historical bias — the data reflects an unfair world.

Amazon's hiring example. Even with perfect data collection, if the historical world was biased, the patterns the model learns will be biased. If you train a "successful CEO" predictor on Fortune 500 CEOs from the past 50 years, you'll get a model that thinks CEOs are mostly white men, because historically, they were. The model didn't invent the bias. It faithfully reproduced reality.

2. Sampling bias — your data isn't representative.

If you train a face-detection model on photos scraped from Western websites in 2010, the dataset will be heavily skewed toward light-skinned faces. The model will detect those faces well and others badly. This isn't because the algorithm "doesn't like" some faces. It's because the model never saw enough of them to learn what they look like.

3. Label bias — the labels themselves are subjective or wrong.

Imagine labeling tweets as "toxic" or "not toxic." Different labelers will draw the line in different places. If your labelers happen to share a particular cultural background, the model will inherit their definition and apply it to everyone. This is why content moderation ML is so hard. There's no universal definition of "toxic" to label against.

Why this matters now

ML systems are making real decisions about real people. Loan approvals. Resume screening. Court bail recommendations. School admissions. Medical diagnoses. Insurance pricing. Each has had documented cases of ML bias causing real harm.

The takeaway isn't "ML is bad." It's:

  • Models are tools. Like any tool, they reflect the assumptions and limitations of the people who built them.
  • The data is the foundation. A clever algorithm trained on bad data is worse than a simple algorithm trained on good data. Always.
  • "The algorithm decided" is not an excuse. Someone chose the data, the labels, the success metric, and the deployment. That someone is responsible.

Vocabulary

Bias (ML/societal sense)A systematic skew in a model's behavior, usually inherited from skewed data. Note that in Chapter 4, "bias" will appear in a totally different mathematical sense (the b in y = mx + b). Same word, different meaning. Don't get confused.
Representative dataA dataset that proportionally reflects the real-world population the model will be used on.

Inspect a dataset like a journalist

Open Colab. You'll load the California housing dataset (a real public dataset; the median income feature has known socioeconomic correlations) and inspect it for problems.

import pandas as pd
from sklearn.datasets import fetch_california_housing

# Load the dataset.
data = fetch_california_housing(as_frame=True)
df = data.frame

# Step 1: How big is it? What columns does it have?
print("Shape:", df.shape)         # (rows, columns)
print("Columns:", df.columns.tolist())
print()

# Step 2: Sample of the data.
print(df.head())
print()

# Step 3: Summary statistics. Look for weird ranges.
print(df.describe())

Things to notice in the output:

  • MedInc (median income in the block) ranges from about 0.5 to 15. Those aren't dollars; they're tens of thousands of dollars. Knowing the units of your features matters.
  • HouseAge has a max of 52. That's a hard cap, suggesting older houses got binned together, which means you can't tell a 52-year-old house from a 100-year-old one. Information loss baked into the dataset.
  • Population has a max of 35,682 in a single block. That's huge. Probably an outlier. Could distort a model.
# Step 4: Check for missing data.
print(df.isnull().sum())

This dataset has no missing values. Most real-world datasets do. To see what missing data looks like, deliberately introduce some:

# Make a copy and corrupt 10% of the AveRooms column.
import numpy as np
df_messy = df.copy()
random_indices = np.random.choice(df_messy.index, size=int(len(df_messy) * 0.1), replace=False)
df_messy.loc[random_indices, 'AveRooms'] = np.nan

print("Missing values per column after corruption:")
print(df_messy.isnull().sum())
# Step 5: Look for a possible proxy for protected attributes.
# This dataset doesn't have race or gender, but it has Latitude and Longitude.
# Location correlates strongly with race in California.
# A model that uses Latitude/Longitude is implicitly using race.

import matplotlib.pyplot as plt
plt.scatter(df['Longitude'], df['Latitude'], c=df['MedHouseVal'],
            cmap='RdYlGn', s=2, alpha=0.5)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('California: house value by location')
plt.colorbar(label='Median house value (×$100k)')
plt.show()

The output shows the shape of California, with green dots concentrated along the coast (high values) and red dots in the central valley (low values). This map looks just like a map of California's racial and economic geography. A model trained on this data will use location to predict price, which means it'll inherit those patterns. Removing the explicit race column wouldn't help; latitude and longitude carry the same information.

Questions you might have

Next upChapter 4 — Drawing lines through dots

You've spent three chapters on ML at a high level. Next: your first real model. Linear regression — drawing a line through some dots. It's the foundation for almost everything that comes after. We derive the math from y = mx + b, write the code, and have a working predictor by the end of the chapter.

Data Is EverythingLab · in development