Topic 13: Backpropagation¶

Introduction¶

Shallow vs. Deep Models: Shallow models compute predictions as a linear function, while deeper models can represent a broader set of functions.
Backpropagation:
- Central algorithm for training neural networks.
- Computes partial derivatives of the loss function with respect to network parameters.
- Used in gradient descent, similar to linear and logistic regression.

Relation to Chain Rule:

Backpropagation applies the Chain Rule from multivariate calculus with additional twists and challenges.

The Chain Rule revisited¶

Model and Loss Function

For univariate inputs and a single training example \((x, t)\):

\[ \begin{aligned} z & = wx + b\\ y &= \sigma(z) \\ \mathcal{L} &= \frac{1}{2}(y - t)^2 \end{aligned} \]

Regularizer

A regularizer encourages simpler explanations by penalizing large weights:

\[ \begin{aligned} \mathcal{R} & = \frac{1}{2}w^2 \\ \mathcal{L}_{\text{reg}} &= \mathcal{L} + \lambda \mathcal{R} \end{aligned} \]

\(\lambda\) is a hyperparameter; the larger it is, the more strongly the weights prefer to be close to zero.

Goal

To perform gradient descent, compute the partial derivatives \(\partial \mathcal{L}_{\text{reg}} / \partial w\) and \(\partial \mathcal{L}_{\text{reg}} / \partial b\).

How you would have done it in calculus class¶

Recall that partial derivatives can be calculated in the same way as univariate derivatives. The cost function can be expanded in terms of \(w\) and \(b\), and derivatives are computed using repeated applications of the univariate Chain Rule.

\[ \mathcal{L}_{\text{reg}} = \frac{1}{2} \left( \sigma(wx + b) - t \right)^2 + \frac{\lambda}{2} w^2 \]

Derivatives¶

Using calculus, the derivatives are computed as follows:

\[ \begin{aligned} \frac{\partial \mathcal{L}_{\text{reg}}}{\partial w} &= \frac{\partial}{\partial w} \left[ \frac{1}{2} \left( \sigma(wx + b) - t \right)^2 + \frac{\lambda}{2} w^2 \right] \\ &= \frac{1}{2} \frac{\partial}{\partial w} \left( \sigma(wx + b) - t \right)^2 + \frac{\lambda}{2} \frac{\partial}{\partial w} w^2 \\ &= \left( \sigma(wx + b) - t \right) \frac{\partial}{\partial w} (\sigma(wx + b)-t) + \lambda w \\ &= \left( \sigma(wx + b) - t \right) \sigma'(wx + b) \frac{\partial}{\partial w} (wx + b) + \lambda w \\ &= \left( \sigma(wx + b) - t \right) \sigma'(wx + b) x + \lambda w \\ \frac{\partial \mathcal{L}_{\text{reg}}}{\partial b} &= \frac{\partial}{\partial b} \left[ \frac{1}{2} \left( \sigma(wx + b) - t \right)^2 + \frac{\lambda}{2} w^2 \right] \\ &= \frac{1}{2} \frac{\partial}{\partial b} \left( \sigma(wx + b) - t \right)^2 + \frac{\lambda}{2} \frac{\partial}{\partial b} w^2 \\ &= \left( \sigma(wx + b) - t \right) \frac{\partial}{\partial b} (\sigma(wx + b)-t) + 0 \\ &= \left( \sigma(wx + b) - t \right) \sigma'(wx + b) \frac{\partial}{\partial b} (wx + b) \\ &= \left( \sigma(wx + b) - t \right) \sigma'(wx + b) \end{aligned} \]

Observations¶

The calculations are very cumbersome, requiring many terms to be copied from one line to the next, which increases the likelihood of mistakes. For realistic neural networks, this process becomes impractical.
Many expressions are redundant. For example, the first three steps in both derivations are nearly identical.
Repeated calculations occur, such as \(wx + b\) and \(\sigma(wx + b) - t\), which are computed multiple times. This inefficiency can be reduced by factoring out repeated expressions.

The idea behind backpropagation is to share these repeated computations wherever possible, making the calculations clean and modular.

Multivariable chain rule: the easy case¶

The univariate Chain Rule states:

\[ \frac{d}{dt} f(g(t)) = f'(g(t)) g'(t). \]

An infinitesimal change in \(t\) causes changes in \(g(t)\) and subsequently in \(f(g(t))\). Extending this to multivariable functions, the Chain Rule applies to functions of multiple variables, such as \(f(x(t), y(t))\). When \(t\) changes slightly, both \(x\) and \(y\) are affected, contributing to the change in \(f\).

The multivariable Chain Rule is:

\[ \frac{d}{dt} f(x(t), y(t)) = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}. \]

This rule combines the individual contributions of \(x\) and \(y\) to the total rate of change in \(f\).

An alternative notation¶

To simplify derivative calculations, an alternative notation is introduced. Instead of writing derivatives explicitly, we define:

\[ \bar{v} \triangleq \frac{\partial \mathcal{L}}{\partial v}. \]

This notation emphasizes that \(\bar{v}\) represents a value we compute, rather than a purely mathematical expression. While nonstandard, it reduces clutter in equations.

Using this notation, the multivariable Chain Rule can be rewritten as:

\[ \bar{t} = \bar{x} \frac{dx}{dt} + \bar{y} \frac{dy}{dt}. \]

Here, \(dx/dt\) and \(dy/dt\) are evaluated algebraically to determine the formula for \(\bar{t}\), while \(\bar{x}\) and \(\bar{y}\) are values computed earlier in the process.

Using the computation graph¶

This section introduces backpropagation, also known as reverse mode automatic differentiation (autodiff).

Example Setup¶

For clarity, we revisit the example:

\[ z = wx + b, \quad y = \sigma(z), \quad \mathcal{L} = \frac{1}{2} (y - t)^2, \quad \mathcal{R} = \frac{1}{2} w^2, \quad \mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda \mathcal{R}. \]

Computation Graph¶

The computation graph represents all computed values as nodes, with directed edges indicating dependencies between nodes. The goal of backpropagation is to compute derivatives \(\bar{w}\) and \(\bar{b}\) efficiently using the Chain Rule.

To compute a derivative (e.g., \(\bar{t}\)), backpropagation starts with the result of the computation (e.g., \(\mathcal{L}_{\text{reg}}\)) and works backward through the graph, hence the name.

Algorithm¶

Let \(v_1, \dots, v_N\) represent nodes in a topological order, where parents precede children. The algorithm consists of:

Forward Pass: Compute all node values \(v_i\).
Backward Pass: Compute derivatives \(\bar{v}_i\), starting with \(\bar{v}_N = 1\) (e.g., \(v_N = \mathcal{L}_{\text{reg}}\)).

The algorithm:

Forward: For \(i = 1, \dots, N\)
- Compute \(v_i\) as a function of its parents \(\text{Pa}(v_i)\).
Backward: For \(i = N-1, \dots, 1\)
- Compute:
  
  \[ \bar{v}_i = \sum_{j \in \text{Ch}(v_i)} \bar{v}_j \frac{\partial v_j}{\partial v_i} \]

Here, \(\text{Pa}(v_i)\) and \(\text{Ch}(v_i)\) denote the parents and children of \(v_i\), respectively.

Example Walkthrough¶

For the example:

\[ \begin{aligned} \bar{\mathcal{L}}_{\text{reg}} &= 1\\ \bar{\mathcal{R}} &= \bar{\mathcal{L}}_{\text{reg}} \frac{\partial \mathcal{L}_{\text{reg}}}{\partial \mathcal{R}} = \bar{\mathcal{L}}_{\text{reg}} \lambda \\ \bar{\mathcal{L}} &= \bar{\mathcal{L}}_{\text{reg}} \frac{\partial \mathcal{L}_{\text{reg}}}{\partial \mathcal{L}} = \bar{\mathcal{L}}_{\text{reg}} \\ \bar{y} &= \bar{\mathcal{L}} \frac{\partial \mathcal{L}}{\partial y} = \bar{\mathcal{L}} (y - t) \\ \bar{z} &= \bar{y} \frac{\partial y}{\partial z} = \bar{y} \sigma'(z) \\ \bar{w} &= \bar{z} \frac{\partial z}{\partial w} + \bar{\mathcal{R}} \frac{\partial \mathcal{R}}{\partial w} = \bar{z} x + \bar{\mathcal{R}} w \\ \bar{b} &= \bar{z} \frac{\partial z}{\partial b} = \bar{z} \end{aligned} \]

Final Computation¶

To summarize:

\[ \bar{\mathcal{L}}_{\text{reg}} = 1, \quad \bar{\mathcal{R}} = \bar{\mathcal{L}}_{\text{reg}}\lambda, \quad \bar{\mathcal{L}} = \bar{\mathcal{L}}_{\text{reg}}, \quad \bar{y} = \bar{\mathcal{L}}(y - t), \quad \bar{z} = \bar{y} \sigma'(z), \quad \bar{w} = \bar{z} x + \bar{\mathcal{R}} w, \quad \bar{b} = \bar{z} \]

Advantages of Backpropagation¶

No redundancy: Repeated computations are avoided.
Modularity: The process breaks down into reusable components.
Flexibility: Modifications to the loss function or other components only require localized changes.

This approach is significantly cleaner and more efficient than the naive method discussed earlier.

Backprop on a multilayer net¶

Backpropagation is now applied to a multilayer neural network to compute the loss derivatives.

Example Setup¶

We use a network with the following setup:

\[ \begin{aligned} z_i &= \sum_j w_{ij}^{(1)} x_j + b_i^{(1)} \\ h_i &= \sigma(z_i) \\ y_k &= \sum_i w_{ki}^{(2)} h_i + b_k^{(2)} \\ \mathcal{L} &= \frac{1}{2} \sum_k (y_k - t_k)^2 \end{aligned} \]

This represents a multilayer network with squared error loss for multiple output units.

Computation Graph¶

The computation graph is drawn for clarity. For simplicity, vectorized computations can be used when dealing with many nodes to reduce clutter.

Derivation of Gradients¶

The backpropagation steps are:

Starting from the loss:

\[ \bar{\mathcal{L}} = 1, \quad \bar{y}_k = \bar{\mathcal{L}} (y_k - t_k) \]
Gradients for the output layer parameters:

\[ \bar{w}_{ki}^{(2)} = \bar{y}_k h_i, \quad \bar{b}_k^{(2)} = \bar{y}_k \]
Propagation to hidden layer outputs:

\[ \bar{h}_i = \sum_k \bar{y}_k w_{ki}^{(2)} \]
Gradients for the hidden layer parameters:

\[ \bar{z}_i = \bar{h}_i \sigma'(z_i), \quad \bar{w}_{ij}^{(1)} = \bar{z}_i x_j, \quad \bar{b}_i^{(1)} = \bar{z}_i \]

Key Focus¶

The derivation of \(\bar{h}_i\) involves the multivariable Chain Rule, demonstrating how the error signal propagates through the network.

Vectorized Form¶

For efficiency, the derivations can be rewritten in matrix form:

Forward Pass:

\[ \begin{aligned} \mathbf{z} &= \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \\ \mathbf{h} & = \sigma(\mathbf{z})\\ \mathbf{y} &= \mathbf{W}^{(2)} \mathbf{h} + \mathbf{b}^{(2)}\\ \mathcal{L} &= \frac{1}{2} \|\mathbf{t} - \mathbf{y}\|^2 \end{aligned} \]
Backward Pass:

\[ \begin{aligned} \overline{\mathcal{L}} &= 1 \\ \overline{\mathbf{y}} &= \overline{\mathcal{L}} (\mathbf{y} - \mathbf{t}) \\ \overline{\mathbf{W}^{(2)}} &= \overline{\mathbf{y}} \mathbf{h}^\top \\ \overline{\mathbf{b}^{(2)}} &= \overline{\mathbf{y}} \\ \overline{\mathbf{h}} &= \mathbf{W}^{(2)\top} \overline{\mathbf{y}} \\ \overline{\mathbf{z}} &= \overline{\mathbf{h}} \circ \sigma'(\mathbf{z}) \\ \overline{\mathbf{W}^{(1)}} &= \overline{\mathbf{z}} \mathbf{x}^\top \\ \overline{\mathbf{b}^{(1)}} &= \overline{\mathbf{z}} \end{aligned} \]

Summary¶

Backpropagation for multilayer networks follows the same principles as for simpler networks. Using the computation graph and vectorized operations simplifies the process and avoids redundant calculations.