Phase 4 · Session 11 · 60 min

Brains Made of Math

Big idea

A neuron is a tiny decision-maker: weighted sum, bias, activation, output. One neuron is logistic regression. A neural network is many neurons wired into layers. You'll write the matrix math for a layer, build a 2-layer network from scratch in numpy, and see why depth lets networks solve problems linear models can't.

By the end, you'll be able to

Sketch a single neuron and label its parts
Write the matrix equation for a layer's forward pass
Implement a forward pass for a 2-layer network in numpy
Use TensorFlow Playground to build a network that solves a problem logistic regression can't

One neuron is logistic regression

Recall logistic regression from Chapter 8:

z = w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n} + b

output = σ (z)

Weighted sum, bias, squish through sigmoid. That structure is a neuron.

   x₁ ──w₁──┐
   x₂ ──w₂──┤
   x₃ ──w₃──┼──[sum + b]──[activation]──> output
   x₄ ──w₄──┘

Each input × its weight, all summed, plus a bias, through an activation function.

Generalization 1 — any activation function. The sigmoid was chosen for logistic regression because we wanted the output to be a probability. In a neural network's hidden layers, you don't need a probability; you just need some non-linearity. The most common modern choice is ReLU ("rectified linear unit"):

ReLU (z) = max (0, z)

If z is positive, output z. If z is negative, output 0. Dead simple.

Why ReLU instead of sigmoid? Sigmoid's gradient vanishes for big |z| (you saw this in Chapter 8). ReLU's gradient is 1 for positive z (no vanishing) and 0 for negative z. Faster training, better convergence. Modern deep networks use ReLU almost everywhere except the final layer (where you might still use sigmoid for binary classification, or softmax for multi-class).

Other activations exist (tanh, LeakyReLU, GELU, Swish), but ReLU and sigmoid cover most of what you need to know.

Generalization 2 — a neuron's output isn't always a probability. With ReLU, intermediate neurons' outputs can be any non-negative number. That's fine. They're just internal computations, not final predictions.

So a neuron is:

output = activation (w \cdot x + b)

Where activation is whatever function you pick (ReLU, sigmoid, tanh, etc.).

A layer is a set of parallel neurons

A layer is a group of neurons that all take the same inputs but compute different outputs (because they have different weights and biases).

Suppose your input has 3 features and you want a layer of 4 neurons. Each of the 4 neurons takes all 3 inputs, has its own 3 weights and 1 bias, applies activation, outputs one number. The layer outputs 4 numbers total.

Writing this neuron by neuron is tedious. Matrix notation packs it up.

The math. Let:

x be the input vector (length n_in)
W be a weight matrix of shape (n_out, n_in), where row j is the j-th neuron's weight vector
b be a bias vector of length n_out

Then the layer's pre-activation values are:

z = W \cdot x + b

And the layer's output is:

a = activation (z)

If you have many input examples stacked into a matrix X of shape (N, n_in), the whole batch goes through at once:

Z = X \cdot W^{T} + b

A = activation (Z)

(Notation note: ᵀ is "transpose," flipping rows and columns. Some textbooks orient W differently to avoid the transpose. Don't worry about which convention; what matters is "the matrix multiplication that combines inputs with weights.")

Stacking layers

Take the output of one layer and feed it as the input to the next:

input  →  layer 1  →  layer 2  →  ...  →  layer L  →  output
(features) (n₁ neurons) (n₂ neurons)        (final layer)

Each layer has its own weight matrix and bias vector. The math:

a_{0} = x

a_{1} = activation (W_{1} \cdot a_{0} + b_{1})

a_{2} = activation (W_{2} \cdot a_{1} + b_{2})

\overset{y}{^} = a_{L}

Layers between input and output are hidden layers. They're not directly visible from outside. The whole stack is a neural network (or multi-layer perceptron, MLP).

Why depth helps — the Universal Approximation Theorem

Here's the magical fact. A network with even one hidden layer can approximate any reasonable function, given enough neurons. This is the Universal Approximation Theorem. It says that no matter how complex the relationship between inputs and outputs, there's a neural network that represents it.

The intuition: each neuron can split the input space into two halves with a line. Combine many neurons in a layer, and you can carve the space into many regions. Stack layers, and each layer can build more complex shapes from the previous layer's regions. With enough layers and neurons, you can represent any pattern.

But: the theorem says possible, not easy. Training a network to actually find the right weights for a complex function is hard. You use gradient descent (with backpropagation, Chapter 13). The math is the same as before; just more parameters.

Empirical observation: deep is often better than wide. A network with 5 layers of 100 neurons typically outperforms a network with 1 layer of 500 neurons on hard problems, even though both have the same total parameter count. Reason: deep networks compose features. Each layer builds on the previous one's abstractions. You'll see this in Chapter 12.

Build a forward pass from scratch

Open Colab. Build a 2-layer network: input → hidden layer (ReLU) → output (sigmoid).

import numpy as np
import matplotlib.pyplot as plt

# Activation functions.
def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forward pass for a 2-layer network.
def forward(X, W1, b1, W2, b2):
    """
    X:  input matrix, shape (N, n_features)
    W1: hidden layer weights, shape (n_features, n_hidden)
    b1: hidden layer biases, shape (n_hidden,)
    W2: output layer weights, shape (n_hidden, 1)
    b2: output layer bias, shape (1,)
    Returns: predicted probabilities, shape (N, 1)
    """
    z1 = X @ W1 + b1          # pre-activations of hidden layer
    a1 = relu(z1)             # hidden layer outputs
    z2 = a1 @ W2 + b2         # pre-activations of output
    a2 = sigmoid(z2)          # output probabilities
    return a2

# Try it with random weights on the spiral data.
np.random.seed(42)
n_per_class = 100

# Generate two interlocking spirals.
def make_spiral(n, classes=2, noise=0.2):
    X = np.zeros((n*classes, 2))
    y = np.zeros(n*classes, dtype=int)
    for c in range(classes):
        ix = range(n*c, n*(c+1))
        r = np.linspace(0.0, 1, n)
        t = np.linspace(c*4, (c+1)*4, n) + np.random.randn(n)*noise
        X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
        y[ix] = c
    return X, y

X, y = make_spiral(100, 2)

plt.scatter(X[y==0, 0], X[y==0, 1], color='orange', label='class 0')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', label='class 1')
plt.title('Two-spiral dataset')
plt.legend(); plt.show()

The plot shows two interlocking spirals. No straight line can separate them.

# Random weights for a 2-layer network: 2 inputs → 16 hidden → 1 output.
n_hidden = 16
W1 = np.random.randn(2, n_hidden) * 0.5
b1 = np.zeros(n_hidden)
W2 = np.random.randn(n_hidden, 1) * 0.5
b2 = np.zeros(1)

predictions = forward(X, W1, b1, W2, b2)
print("Predictions shape:", predictions.shape)
print("First 5 predictions:", predictions[:5].flatten())

Random weights, so the predictions are garbage (around 0.5 for everything). But the structure works: data flows in, predictions come out. Now you just need to train it.

Training a neural network requires computing gradients, which requires backpropagation (Chapter 13). For now, use TensorFlow / Keras to handle backprop automatically, and revisit doing it from scratch later.

Train a 2-layer network with Keras

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(16, activation='relu', input_shape=(2,)),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train.
history = model.fit(X, y, epochs=200, verbose=0)

# Final accuracy.
loss, acc = model.evaluate(X, y, verbose=0)
print(f"Training accuracy: {acc:.3f}")

# Plot the loss.
plt.plot(history.history['loss'])
plt.xlabel('Epoch'); plt.ylabel('Loss')
plt.title('Training loss')
plt.show()

Output: training accuracy in the high 90s. Loss curve smoothly drops.

# Plot the decision boundary.
xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 200), np.linspace(-1.5, 1.5, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
preds = model.predict(grid, verbose=0).reshape(xx.shape)

plt.contourf(xx, yy, preds, levels=20, cmap='RdBu_r', alpha=0.6)
plt.scatter(X[y==0, 0], X[y==0, 1], color='orange', label='class 0', edgecolor='k')
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', label='class 1', edgecolor='k')
plt.legend(); plt.title('Decision boundary: a 2-layer network on spirals'); plt.show()

A beautiful spiral-shaped decision boundary. That's a neural network solving a problem logistic regression cannot. Same gradient descent, just two layers instead of one.

Vocabulary

Neuron / unit / node—One small computing unit: weighted sum + bias + activation.

Activation function—The non-linearity at the end of each neuron. Sigmoid, ReLU, tanh.

Layer—A group of neurons taking the same inputs. Mathematically a matrix multiplication plus an activation.

Hidden layer—A layer between input and output.

Neural network—Stacked layers.

ReLU—max(0, z). The standard hidden-layer activation.

Multi-layer perceptron (MLP)—A standard feedforward neural network.

ActivityTensorFlow Playground· 30 min

Open playground.tensorflow.org. This tool runs real neural networks in your browser, with live visualization of every neuron.

Start with the "two clusters" dataset. 0 hidden layers. The network is logistic regression. Solves instantly with a straight boundary.
Switch to the "circle" dataset. 0 hidden layers. The network can't separate. Loss stays high.
Add 1 hidden layer of 4 neurons. Click play. Boundary curves. Hover over hidden neurons to see what each one is "looking at."
Switch to spiral. 1 hidden layer of 4. Struggles. Add another hidden layer. Better. Add more neurons. Better still.
Compete with yourself: solve spiral with the fewest total neurons. (Possible with 2 layers of 6-8 neurons.)

The aha moment. Watching the hidden neurons activate. Each first-layer neuron is its own little classifier, drawing a line through the input space. The next layer combines lines into curves. The next combines curves into more complex shapes. The network is building features hierarchically, automatically, from gradient descent. Nobody told it what features to use.

Questions you might have

Why is this called a neural network if it's not really how the brain works?

How many layers and neurons should I use?

Why does ReLU work better than sigmoid?

Are neural networks always better than logistic regression?

Next upChapter 12 — Going deep

Today you saw that adding layers helps. Tomorrow we ask why. Each layer learns more abstract features built from the previous layer's features. Edges become shapes. Shapes become objects. We'll train an image classifier and visualize what each layer "sees."

Brains Made of MathLab · in development