Topic 12: Multilayer Perceptrons¶

Introduction¶

Linear Models and Their Limitations¶

Linear models, such as linear regression and binary classifiers, cannot represent certain functions:
- Example: Linear regression cannot represent quadratic functions.
- Linear classifiers cannot represent the XOR function.

Addressing Limitations with Feature Maps¶

One approach is to define features or basis functions. For instance:
- Linear regression can represent a cubic polynomial using the feature map \(\psi(x) = (1, x, x^2, x^3)\).
This approach is unsatisfactory for two reasons:
1. Feature Engineering: Requires pre-specifying features, which involves significant engineering work.
2. High Dimensionality: May require an excessively large number of features for certain functions. For example, feature representation for cubic polynomials scales cubically with the number of input features.

Neural Networks as a Solution¶

Instead of pre-defined features, we connect simple processing units into a neural network:
- Each unit computes a linear function followed by a nonlinearity.
- In aggregate, these units can compute complex nonlinear functions.
These networks are historically called multilayer perceptrons.

Multilayer Perceptrons (MLPs)¶

Neuron-Like Processing Unit¶

The general form of a neuron-like processing unit is:

\[ a = \phi\left(\sum_j w_j x_j + b \right), \]

where:

\(x_j\): Inputs to the unit.
\(w_j\): Weights.
\(b\): Bias.
\(\phi\): Nonlinear activation function.
\(a\): Unit's activation.

Examples of activation functions:

Linear Regression: \(\phi(z) = z\).
Binary Linear Classifiers: \(\phi\) is a hard threshold at zero.
Logistic Regression: \(\phi(z) = \sigma(z) = \frac{1}{1 + e^{-z}}\).

Neural Networks¶

Neural networks combine multiple neuron-like units to perform computations.
Feed-forward neural networks: Computations are arranged sequentially in a directed graph without cycles.
Recurrent neural networks: Graphs can have cycles, allowing feedback; these are more complex.

Structure of MLPs¶

Layers: Consist of input, output, and hidden layers.
- Input layer: Takes input features.
- Output layer: Produces outputs, one unit for each value the network outputs (e.g., single output for regression, \(K\) outputs for \(K\)-class classification).
- Hidden layers: Intermediate layers whose computations are learned during training.
- The units in these layers are known as input units, output units, and hidden units, respectively.
Depth: Number of layers.
Width: Number of units in a layer.
“Deep learning” refers to training neural nets with many layers.
Fully connected: Each unit in one layer connects to every unit in the next layer.

MLP Computations¶

To mathematically describe the computations:

\[ \begin{aligned} h_i^{(1)} &= \phi^{(1)}\left(\sum_j w_{ij}^{(1)} x_j + b_i^{(1)}\right) \\ h_i^{(2)} &= \phi^{(2)}\left(\sum_j w_{ij}^{(2)} h_j^{(1)} + b_i^{(2)}\right) \\ y_i &= \phi^{(3)}\left(\sum_j w_{ij}^{(3)} h_j^{(2)} + b_i^{(3)}\right) \end{aligned} \]

Note that we distinguish \(\phi^{(1)}\) and \(\phi^{(2)}\) because different layers may have different activation functions.

Vectorized Form¶

By representing all activations and weights as vectors and matrices:

\[ \begin{aligned} \mathbf{h}^{(1)} &= \phi^{(1)}\left(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}\right) \\ \mathbf{h}^{(2)} &= \phi^{(2)}\left(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}\right) \\ \mathbf{y} &= \phi^{(3)}\left(\mathbf{W}^{(3)} \mathbf{h}^{(2)} + \mathbf{b}^{(3)}\right) \end{aligned} \]

For all training examples, computations are stored in matrices:

\[ \begin{aligned} \mathbf{H}^{(1)} &= \phi^{(1)}\left(\mathbf{X} \mathbf{W}^{(1)\top} + \mathbf{1} \mathbf{b}^{(1)\top}\right) \\ \mathbf{H}^{(2)} &= \phi^{(2)}\left(\mathbf{H}^{(1)} \mathbf{W}^{(2)\top} + \mathbf{1} \mathbf{b}^{(2)\top}\right) \\ \mathbf{Y} &= \phi^{(3)}\left(\mathbf{H}^{(2)} \mathbf{W}^{(3)\top} + \mathbf{1} \mathbf{b}^{(3)\top}\right) \end{aligned} \]

This vectorized form allows efficient computations over the entire dataset, often implemented using matrix operations in libraries like NumPy.

Feature Learning¶

Neural nets can be thought of as a way of learning nonlinear feature mappings. We hope to learn a feature representation where the data become linearly separable.

Expressive Power¶

Linear models have limited expressive power and cannot represent functions like \(\text{XOR}\). For MLPs, this limitation depends on the activation function.

Linear networks¶

Deep linear networks are not more powerful than shallow ones. Using the linear activation function \(\phi(x) = x\) (ignoring biases for simplicity), the network function can be written as:

\[ \mathbf{y} = \mathbf{W}^{(L)} \mathbf{W}^{(L-1)} \cdots \mathbf{W}^{(1)} \mathbf{x}. \]

This is equivalent to a single linear layer with weights:

\[ \mathbf{W} = \mathbf{W}^{(L)} \mathbf{W}^{(L-1)} \cdots \mathbf{W}^{(1)}. \]

Thus, deep linear networks have the same power as a single linear layer, i.e., a linear model.

Universality¶

Nonlinear activation functions make MLPs universal, allowing even shallow networks to approximate arbitrary functions under certain conditions.

Demonstration for Binary Inputs:

Binary inputs \(\pm1\) are used to map input vectors to outputs based on a truth table with \(2^D\) rows for \(D\) inputs.
The network consists of \(2^D\) hidden units, each recognizing one input vector using weights and biases.
Hidden activations are designed so exactly one unit is active for any input, mapping inputs to outputs based on the table.

Compactness vs. Universality:

Universality is limited by the size of the network, which may need to be exponentially large for some functions.
Compactness is desirable for:
1. Efficient computation of predictions.
2. Training networks that generalize from limited examples, avoiding mere memorization.

Soft thresholds¶

The step function (hard threshold) activation is useful for manual weight design but unsuitable for training with gradient descent due to zero derivatives almost everywhere. This applies to both linear classifiers and MLPs.

Solution:

Replace the hard threshold with a soft threshold (e.g., logistic function).
This allows learning weights using gradient descent.

Impact on Expressive Power:

No loss of expressive power occurs.
A hard threshold can be approximated using a soft threshold by scaling weights and biases.

The power of depth¶

Deep networks can represent some functions more compactly than shallow networks. For example, consider the parity function for binary-valued inputs:

\[ f_{\text{par}}(x_1, \ldots, x_D) = \begin{cases} 1 & \text{if } \sum_j x_j \text{ is odd}, \\ 0 & \text{if it is even}. \end{cases} \]

A shallow network requires exponential size to represent the parity function.
A deep network can compute it with a size that is linear in the number of inputs.