Phase 3 · Session 10 · 75 min

Mini-Project: Titanic

Big idea

Apply everything from Phases 1 to 3 to a real dataset. Titanic survival prediction is the famous Kaggle starter problem. It's small enough to fit in a notebook, complex enough to require thinking, and has been done by hundreds of thousands of people.

This chapter is mostly code. No new concepts.

By the end, you'll be able to

Load a real CSV, clean it, and split it
Train and compare two or more classification models
Evaluate accuracy on a held-out test set
Engineer one or two features that help

The Titanic

In 1912, the RMS Titanic struck an iceberg and sank. Of the 2,224 passengers and crew, only 706 survived. Survival was not random. Women survived more than men. First-class passengers survived more than third-class. Children survived more than adults.

You'll train an ML model to predict, given basic information about a passenger, whether they survived. This is called the Titanic problem and it's the single most famous starter project in machine learning.

The dataset

The standard Kaggle Titanic dataset. About 900 rows. Columns:

Pclass: ticket class (1, 2, or 3)
Sex: male / female
Age: in years (some missing)
SibSp: number of siblings / spouses aboard
Parch: number of parents / children aboard
Fare: ticket price
Embarked: where they boarded (C, Q, S)
Survived: the label (0 or 1)

Full code walkthrough

Open Colab. Work through this end-to-end.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# 1. Load the data.
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
print(df.shape)
df.head()

Look at the columns. Real-world data is messy.

# 2. Look at survival rates by group.
print("Overall survival rate:", df['Survived'].mean())
print()
print("Survival rate by sex:")
print(df.groupby('Sex')['Survived'].mean())
print()
print("Survival rate by class:")
print(df.groupby('Pclass')['Survived'].mean())

Output shows clearly: women survived at ~74%, men at ~19%. First class ~63%, third class ~24%. The model has obvious patterns to find.

# 3. Visualize age distribution.
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df[df['Survived']==1]['Age'].hist(ax=axes[0], bins=20, color='green', alpha=0.7)
axes[0].set_title('Age distribution: Survived')
df[df['Survived']==0]['Age'].hist(ax=axes[1], bins=20, color='red', alpha=0.7)
axes[1].set_title('Age distribution: Did not survive')
plt.show()

Children disproportionately survived. The "women and children first" policy is visible in the data.

# 4. Clean the data.
# - Age has missing values. Fill with median.
df['Age'] = df['Age'].fillna(df['Age'].median())

# - Embarked has 2 missing. Fill with most common.
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# - Drop columns we won't use (Name, Ticket, Cabin are messy).
df = df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

# - Convert categorical to numeric.
df['Sex'] = (df['Sex'] == 'female').astype(int)         # 1 if female, 0 if male
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)  # one-hot encode

print("Cleaned:")
df.head()

# 5. Train/test split.
X = df.drop('Survived', axis=1).values.astype(float)
y = df['Survived'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 6. Scale features.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 7. Train logistic regression.
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
print(f"Logistic Regression train acc: {log_reg.score(X_train_scaled, y_train):.3f}")
print(f"Logistic Regression test acc:  {log_reg.score(X_test_scaled, y_test):.3f}")

# 8. Try random forest.
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)   # random forest doesn't need scaled features
print(f"Random Forest train acc: {rf.score(X_train, y_train):.3f}")
print(f"Random Forest test acc:  {rf.score(X_test, y_test):.3f}")

Output

# 9. Inspect what logistic regression learned.
feature_names = df.drop('Survived', axis=1).columns.tolist()
for name, w in zip(feature_names, log_reg.coef_[0]):
    print(f"  {name:15s}: weight = {w:+.3f}")

What the weights mean:

Sex: large positive (being female strongly helps survival)
Pclass: large negative (higher class number = worse class = worse survival)
Age: small negative (older slightly worse)
Fare: small positive (paying more = often higher class = slight advantage even after controlling for class)

The model is interpretable. Each weight tells a story.

Stretch — feature engineering

# Engineer new features.
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

# Re-do the train/test/scale/fit cycle.
X = df.drop('Survived', axis=1).values.astype(float)
y = df['Survived'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg2 = LogisticRegression(max_iter=1000)
log_reg2.fit(X_train_scaled, y_train)
print(f"With new features, test acc: {log_reg2.score(X_test_scaled, y_test):.3f}")

Often this nudges accuracy up by 1-2 points.

Workflow summary

The general workflow:

Explore. Load. Print shapes and columns. Group-by for survival rates by sex, class, age. Plot at least one histogram.
Clean. Handle missing values. Convert categoricals (one-hot encode). Drop messy columns.
Split. 80/20 train/test, with random_state=42.
Train. Logistic regression first. Print train and test accuracy.
Iterate. Try random forest. Engineer features. Tune max_depth for the random forest.
Reflect. What worked? What didn't? Why?

Questions you might have

Should I use the most complex model?

My accuracy is stuck at 80%. Is that bad?

Why doesn't more feature engineering always help?

This dataset is from 1912. Is this... ethical?

Next upChapter 11 — Brains made of math

You've now built a real ML project end-to-end. Phase 4 starts next, and that's where things get cool. Neural networks — the technology behind every modern AI breakthrough. Linear regression is one neuron. The brain of ChatGPT is hundreds of billions of them. Same machinery, more layers.

Mini-Project: TitanicLab · in development