Topic 11: Linear Classifier¶

Overview¶

Binary Linear Classification¶

classification: predict a discrete-valued target
binary: predict a binary target \(t \in \{0, 1\}\)
- Training examples with \(t = 1\) are called positive examples, and training examples with \(t = 0\) are called negative examples.
linear: model is a linear function of \(\mathbf{x}\), followed by a threshold:

\[ \begin{aligned} z &= \mathbf{w}^T \mathbf{x} + b\\ y &= \begin{cases} 1 & \text{if } z \geq r \\ 0 & \text{if } z < r \end{cases} \end{aligned} \]

Some Simplifications¶

Eliminating the Threshold¶

We can assume WLOG that the threshold \(r = 0\):

\[ \begin{aligned} \mathbf{w}^T \mathbf{x} + b \geq r \iff \mathbf{w}^T \mathbf{x} + b - r \geq 0 \end{aligned} \]

where \(b-r\) is \(\triangleq b'\)

Eliminating the Bias¶

Add a dummy feature \(x_0\) which always takes the value 1.
The weight \(w_0=b\) is equivalent to a bias (same as lineaer regression).

Simplified Model¶

Received input \(\mathbf{x}\in \mathbb{R}^{D+1}\) with \(x_0=1\):

\[ \begin{aligned} z &= \mathbf{w}^T \mathbf{x} \\ y &= \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases} \end{aligned} \]

Special Case of Neurons¶

The Geometric Picture¶

Training examples are represented as points.
Weights (hypotheses) define half-spaces:

\[ H_+ = \{\mathbf{x} : \mathbf{w}^T \mathbf{x} \geq 0\}, \quad H_- = \{\mathbf{x} : \mathbf{w}^T \mathbf{x} < 0\} \]
- The boundaries of these half-spaces pass through the origin.
- Decision boundary:
\[ \{\mathbf{x} : \mathbf{w}^T \mathbf{x} = 0\} \]
- In 2-D, it's a line; in higher dimensions, it's a hyperplane.
- Linear separability:
- Data is linearly separable if training examples can be perfectly separated by a linear decision rule.

Weights (hypotheses) \(\mathbf{w}\) are represented as points.
Each training example \(\mathbf{x}\) defines a half-space for \(\mathbf{w}\) to lie in for correct classification: \(\mathbf{w}^T \mathbf{x} \geq 0 \quad \text{if } t = 1\)
For NOT example:
- \(x_0 = 1, x_1 = 0, t = 1 \implies (w_0, w_1) \in \{\mathbf{w} : w_0 \geq 0\}\)
- \(x_0 = 1, x_1 = 1, t = 0 \implies (w_0, w_1) \in \{\mathbf{w} : w_0 + w_1 < 0\}\)
The region satisfying all constraints is the feasible region:
- If this region is nonempty, the problem is feasible; otherwise, it is infeasible.
Feasible set will always have a corner at the origin.

Perceptron Learning Rule¶

If \(t = 1\) and \(z = \mathbf{w}^T \mathbf{x} > 0\):
- \(y = 1\), so no change is necessary.
If \(t = 1\) and \(z < 0\):
- \(y = 0\), so we need to increase \(z\).
- Update:
  
  \[ \mathbf{w}' \gets \mathbf{w} + \mathbf{x} \]
- Justification
\[ \begin{aligned} \mathbf{w}'^T \mathbf{x} &= (\mathbf{w} + \mathbf{x})^T \mathbf{x} \\ &= \mathbf{w}^T \mathbf{x} + \mathbf{x}^T \mathbf{x} \\ &= \mathbf{w}^T \mathbf{x} + \|\mathbf{x}\|^2 \end{aligned} \]

The Limits of Linear Classifiers¶

Linear classifiers are limited in their ability to handle datasets that are not linearly separable. This means they cannot accurately classify data where classes cannot be divided by a straight line or hyperplane. For example, the XOR problem is a classic case where no linear boundary can separate the classes correctly. In such situations, linear classifiers fail to provide accurate predictions.

Convexity

A set \(S\) is convex if the line segment connecting any two points in \(S\) must lie within \(S\).

\[ \mathbf{x}_{avg} = \lambda_1\mathbf{x}^{(1)} + \cdots + \lambda_N\mathbf{x}^{(N)},\quad \text{where }0 \leq \lambda \leq 1, \lambda_1 + \cdots + \lambda_N = 1 \]

In the context of binary classification, there are two important sets that are always convex:

In data space, the positive and negative regions are both convex. Both regions are half-spaces, and it should be visually obvious that half- spaces are convex. This implies that if inputs \(x^{(1)},\cdots,x^{(N)}\) are all in the positive region, then any weighted average must also be in the positive region. Similarly for the negative region.
In weight space, the feasible region is convex. The rough mathematical argument is as follows. Each good region (the set of weights which correctly classify one data point) is convex because it’s a half-space. The feasible region is the intersection of all the good regions, so it must be convex because the intersection of convex sets is convex.

XOR function is not linearly separable.

Since the posi- tive region is convex, if we draw the line segment connecting the two positive examples \((0,1)\) and \((1,0)\), this entire line segment must be classified as positive. Similarly, if we draw the line segment connecting the two negative examples \((0, 0)\) and \((1, 1)\), the entire line segment must be classified as negative. But these two line segments intersect at \((0.5, 0.5)\), which means this point must be classified as both positive and negative, which is impossible. Therefore, XOR isn’t linearly separable.

Overcoming the Limitation of Linear Classifiers¶

Feature Maps: Sometimes, we can overcome the limitation of linear classifiers using feature maps, similar to linear regression.

Example: XOR Problem¶

Given feature map:

\[ \phi(\mathbf{x}) = \begin{pmatrix} x_1 \\ x_2 \\ x_1 x_2 \end{pmatrix} \]

\(x_1\)	\(x_2\)	\(\phi_1(\mathbf{x})\)	\(\phi_2(\mathbf{x})\)	\(\phi_3(\mathbf{x})\)	\(t\)
0	0	0	0	0	0
0	1	0	1	0	1
1	0	1	0	0	1
1	1	1	1	1	0

We find that the following set of weights and biases correctly classifies all the training examples:

\[ b=−0.5\quad w_1 =1 \quad w_2 =1 \quad w_3 =−2 \]

This transformed dataset is linearly separable.
Limitation:
- This is not a general solution.
- It can be challenging to select effective basis functions.
- Solution: Use neural networks to learn nonlinear hypotheses directly.

Training a Classifier¶

Define a model, cost function and minimize using gradient descent
What to use as a loss function?
We can’t just use the classification error itself, because the gradient is zero almost everywhere! Instead, we’ll define a surrogate loss function, i.e. an alternative loss function which is easier to optimize.

Given a set of training examples \(\{\mathbf{x}^{(i)},t^{(i)}\}^N_{i=1}\), where \(\mathbf{x}^{(i)}\) are vector-valued inputs, and \(t^{(i)}\) are binary-valued targets. We would like to learn a binary linear classifier such that

\[ y = \begin{cases} 1 & \text{if } \mathbf{w}^{\top}\mathbf{x} \geq 0 \\ 0 & \text{otherwise} \end{cases} \]

How can we define a sensible learning criterion when the dataset isn’t linearly separable? One natural criterion is to minimize the number of misclassified training examples.

Attempt 1: 0-1 loss¶

\[ \mathcal{L}_{0-1}(y,t)=\begin{cases}0 & \text{if } y=t \\ 1 & \text{otherwise.}\end{cases} \]

As always, the cost function is just the loss averaged over the training examples; in this case, that corresponds to the error rate, or fraction of misclassified examples.
If we are on the boundary, the cost is discontinuous, which certainly isn’t any better.
We certainly can’t optimize 0-1 loss with gradient descent.

Attempt 2: Linear Regression¶

\[ \begin{aligned} y &= \mathbf{w}^\top\mathbf{x} + b \\ \mathcal{L}_{\text{SE}}(y,t) &= \frac{1}{2}(y-t)^2 \end{aligned} \]

One obvious problem is that the predictions are real-valued rather than binary.
But that’s OK, since we can just pick some scheme for binarizing them, such as thresholding at \(y = 1/2\). When we replace a loss function we trust with another one we trust less but which is easier to optimize, the replacement one is called a surrogate loss function.
There’s still a problem:
1. Correct prediction (\(y=1\)): Cost = \(0\)
2. Wrong prediction (\(y=0\)): Cost = \(1/2\)
3. Overconfident prediction (\(y=9\)): Cost = \(32\) (calculated as \(1/2(9-1)^2\))
4. The key issue is that being overconfident (\(y=9\)) incurs a much higher cost (\(32\)) than being wrong (\(y=0\), cost of \(1/2\)). This means the learning algorithm will actively try to avoid making confident positive predictions, even when they might be appropriate, because the potential cost of being overconfident is so much higher than the cost of being conservative and predicting 0.

Attempt 3: Logistic Nonlinearity¶

The problem with linear regression is that the predictions were allowed to take arbitrary real values. But it makes no sense to predict anything smaller than 0 or larger than 1. Let’s fix this problem by applying a nonlinearity, or activation function, which squashes the predictions to be between 0 and 1. In particular, we’ll use something called the logistic function:

\[ \sigma(z) = \frac{1}{1 + e^{-z}}. \]

This is a kind of sigmoidal, or S-shaped, function:

What’s important about this function is that it increases monotonically, with asymptotes at \(0\) and \(1\). (Plus, it’s smooth, so we can compute derivatives.)

We refine the model as follows:

\[ \begin{aligned} z &= \mathbf{w}^\top \mathbf{x} + b \\ y &= \sigma(z) \\ \mathcal{L}_{SE}(y, t)& = \frac{1}{2}(y - t)^2\\ \end{aligned} \]

Notice that this model solves the problem we observed with linear regression. As the predictions get more and more confident on the correct answer, the loss continues to decrease.

To derive the gradient descent updates, we’ll need the partial derivatives of the cost function. We’ll do this by applying the Chain Rule twice: first to compute \(\frac{\partial \mathcal{L}_{SE}}{\partial z}\), and then again to compute \(\frac{\partial \mathcal{L}_{SE}}{\partial w_j}\). But first, let’s note the convenient fact that:

\[ \frac{\partial y}{\partial z} = \frac{e^{-z}}{(1 + e^{-z})^2} = y(1 - y) \]

Now for the Chain Rule:

\[ \begin{aligned} \frac{\mathrm{d} \mathcal{L}_{SE}}{\mathrm{d} z} &= \frac{\mathrm{d} \mathcal{L}_{SE}}{\mathrm{d} y} \cdot \frac{\mathrm{d} y}{\mathrm{d} z} \\ &= (y - t) y (1 - y) \\ \\ \frac{\partial \mathcal{L}_{SE}}{\partial w_j} &= \frac{\mathrm{d} \mathcal{L}_{SE}}{\mathrm{d} z} \cdot \frac{\partial z}{\partial w_j} \\ &= \frac{\mathrm{d} \mathcal{L}_{SE}}{\mathrm{d} z} \cdot x_j \\ \frac{\mathrm{d} \mathcal{L}_{SE}}{\mathrm{d} b} &= \frac{\mathrm{d} \mathcal{L}_{SE}}{\mathrm{d} z} \cdot \frac{\partial z}{\partial b} \\ &= \frac{\mathrm{d} \mathcal{L}_{SE}}{\mathrm{d} z}\end{aligned} \]

Problem with extreme misclassifications:
- When a training example is highly misclassified (e.g., predicting \(z = -5\) for \(t = 1\)), the gradient \(\frac{\partial \mathcal{L}_{SE}}{\partial z}\) becomes very small (\(\approx -0.0066\)).
- A smaller gradient means the update during training is minimal, even for large mistakes.
- This results in a weak gradient signal for badly misclassified examples.
Issue with squared error loss in classification:
- Squared error loss doesn’t differentiate between bad and extremely bad predictions.
- Example: If \(t = 1\), predictions \(y = 0.01\) and \(y = 0.0001\) yield similar losses despite one being much worse.
- From a classification error perspective, these predictions are equivalent.
- From an optimization perspective, this is problematic since improving \(y\) from \(0.0001\) to \(0.01\) doesn't significantly affect the loss.
- A better loss function would provide feedback reflecting incremental improvements, helping the model learn more effectively.

Final Touch: Cross-entropy Loss¶

Problem with squared-error loss:
- It treats predictions like \(y = 0.01\) and \(y = 0.00001\) as nearly equivalent for a positive example.
- This fails to distinguish between bad and extremely bad predictions.
Cross-entropy loss (CE):
- Cross-entropy loss addresses this issue by making these predictions very different:
  
  \[ \mathcal{L}_{CE}(y, t) = \begin{cases} -\log y & \text{if } t = 1 \\ -\log(1 - y) & \text{if } t = 0 \end{cases} \]
- For example, it assigns higher penalties for predictions closer to 0 when \(t = 1\).
- Rewriting cross-entropy loss:
- For simplicity, the loss can be rewritten as:
  
  \[ \mathcal{L}_{CE}(y, t) = -t \log y - (1 - t) \log(1 - y) \]
Key advantage:
- Cross-entropy provides a sizable gradient signal, even when predictions are far from correct.
- This ensures better optimization compared to squared-error loss, particularly for extreme misclassifications.
Logistic Regression:
- Combining the logistic activation function with cross-entropy loss results in logistic regression:
  
  \[ \begin{aligned} z &= \mathbf{w}^\top \mathbf{x} + b \\ y &= \sigma(z) \\ \mathcal{L}_{CE}& = -t \log y - (1 - t) \log (1 - y) \end{aligned} \]
Derivative calculations:
- Mechanical approach for derivatives:
  
  \[ \begin{aligned} \frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} y} &= -\frac{t}{y} + \frac{1 - t}{1 - y} \\ \frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} z} &= \frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} y} \cdot \frac{\mathrm{d} y}{\mathrm{d} z} \\ &= \frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} y} \cdot y(1 - y) \\ \frac{\partial \mathcal{L}_{CE}}{\partial w_j} &= \frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} z} \cdot \frac{\partial z}{\partial w_j} \\ &= \frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} z} \cdot x_j \end{aligned} \]
Purpose:
- These derivative expressions are useful for implementing logistic regression in tools like NumPy.

Logistic-cross-entropy Function¶

Problem with numerical stability:
- When \(t = 1\) but \(z \ll 0\), \(y \approx 0\).
- Computing \(\log(0)\) in cross-entropy results in \(-\infty\).
- The gradient \(\frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} y}\) becomes extremely large, causing numerical instability.
Solution: Logistic-cross-entropy:
- Combine logistic function and cross-entropy loss:
  
  \[ \mathcal{L}_{LCE}(z, t) = \mathcal{L}_{CE}(\sigma(z), t) = t \log(1 + e^{-z}) + (1 - t) \log(1 + e^z) \]

Implementation in NumPy:

Use logaddexp for numerical stability:

E = t * np.logaddexp(0, -z) + (1 - t) * np.logaddexp(0, z)

Derivative calculation:

\[ \begin{aligned} \frac{\mathrm{d} \mathcal{L}_{CE}}{\mathrm{d} z} &= \frac{\mathrm{d}}{\mathrm{d} z} \left[ t \log(1 + e^{-z}) + (1 - t) \log(1 + e^z) \right] \\ &= -t \cdot \frac{e^{-z}}{1 + e^{-z}} + (1 - t) \cdot \frac{e^z}{1 + e^z} \\ &= -t(1 - y) + (1 - t)y \\ &= y - t \end{aligned} \]
Key observation:
- The derivative simplifies to \(y - t\), which matches the linear regression case, providing an intuitive interpretation:
  - If \(y > t\), shift prediction negatively.
  - If \(y < t\), shift prediction positively.

Hinge Loss¶

Hinge Loss:
- Hinge loss is defined as:
  
  \[ \mathcal{L}_H(y, t) = \max(0, 1 - ty) \]
- Here, \(y\) is a real value and \(t \in \{-1, 1\}\).
- Provides an upper bound on 0-1 loss, meaning minimizing hinge loss minimizes classification errors.
- Support Vector Machine (SVM):
- A linear model using hinge loss:
  
  \[ \begin{aligned} y &= \mathbf{w}^\top \mathbf{x} + b\\ \mathcal{L}_H &= \max(0, 1 - ty) \end{aligned} \]
Key insights:
- Hinge loss emphasizes a margin of separation between classes.
- SVMs, though statistically motivated differently, behave similarly to logistic regression.
- Loss functions such as hinge and cross-entropy loss asymptotically behave similarly, despite differing smoothness.
Practical note:
- Understanding various loss functions helps optimize machine learning models and derive efficient algorithms, like gradient descent updates for SVMs.

Comparison of the loss functions considered so far

Multiclass classification¶

Multiclass classification:
- Use a one-hot vector (also called one-of-K encoding) for targets:
  
  \[ \mathbf{t} = \underbrace{(0, \dots, 0, 1, 0, \dots, 0)}_{\text{entry } k \text{ is 1}} \]
Model design:
- Linear function:
  
  \[ \mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b} \]
  
  where \(\mathbf{W}\) is a \(K \times D\) weight matrix, and \(\mathbf{b}\) is a \(K\)-dimensional bias vector.
- Activation function: Use the softmax function, a generalization of the logistic function:
\[ y_k = \text{softmax}(z_1, \dots, z_K)_k = \frac{e^{z_k}}{\sum{k'} e^{z_{k'}}} \]

Outputs are nonnegative and sum to 1, interpretable as probabilities.

The inputs to the softmax are called the logits. Note that when one of the \(z_k\)’s is much larger than the others, the output of the softmax will be approximately the argmax, in the one-hot encoding. Hence, a more accurate name might be “soft-argmax.”
Loss function:
- Generalized cross-entropy loss for multiple classes:
  
  \[ \mathcal{L}_{CE}(\mathbf{y}, \mathbf{t}) = -\sum_{k=1}^K t_k \log y_k = -\mathbf{t}^\top \log \mathbf{y} \]
  
  Here, \(\log \mathbf{y}\) represents the elementwise log. Note that only one of the \(t_k\)’s is \(1\) and the rest are \(0\), so the summation has the effect of picking the relevant entry of the vector \(\log \mathbf{y}\). Note that this loss function only makes sense for predictions which sum to \(1\); if you eliminate that constraint, you could trivially minimize the loss by making all the \(y_k\)’s large.
Multiclass logistic regression (softmax regression):
- Model:
  
  \[ \begin{aligned} \mathbf{z} & = \mathbf{W} \mathbf{x} + \mathbf{b} \\ \mathbf{y} &= \text{softmax}(\mathbf{z})\\ \mathcal{L}_{CE} &= -\mathbf{t}^\top \log \mathbf{y}\\ \end{aligned} \]
- Softmax and Cross-Entropy:
  - These two functions complement each other and are often combined into a single softmax-cross-entropy loss (\(\mathcal{L}_{SCE}\)) for numerical stability.
- Derivative of Softmax-Cross-Entropy:
  - The derivative simplifies to:
    
    \[ \frac{\partial \mathcal{L}_{SCE}}{\partial \mathbf{z}} = \mathbf{y} - \mathbf{t} \]
  - Here:
    - \(\mathbf{y}\) is the predicted probability vector from the softmax function.
    - \(\mathbf{t}\) is the target one-hot encoded vector.
    - Advantages:
  - Softmax regression leverages this derivative for efficient optimization.
  - It provides a robust and practical solution for multiclass classification problems.

Convex Function¶

Convexity and Optimization¶

Convex Sets¶

A set \(S\) is convex if the line segment connecting any two points in \(S\) lies entirely within \(S\). Mathematically, for \(\mathbf{x}_0, \mathbf{x}_1 \in S\):

\[ (1 - \lambda)\mathbf{x}_0 + \lambda \mathbf{x}_1 \in S \quad \text{for } 0 \leq \lambda \leq 1. \]

Convex Functions¶

A function \(f\) is convex if for any \(\mathbf{x}_0, \mathbf{x}_1\) in the domain of \(f\):

\[ f((1 - \lambda)\mathbf{x}_0 + \lambda \mathbf{x}_1) \leq (1 - \lambda)f(\mathbf{x}_0) + \lambda f(\mathbf{x}_1). \]

Graphical Interpretation¶

The line segment connecting any two points on the graph of \(f\) must lie above the graph of \(f\).
The set of points lying above the graph of \(f\) must form a convex set.
Intuitively, convex functions are "bowl-shaped."

Importance of Convexity in Optimization¶

Global Minima: All critical points of a convex function are global minima. Setting the derivatives to zero solves the optimization problem.
Gradient Descent: Gradient descent always converges to the global minimum for convex functions.

Non-Convex Problems¶

Loss functions such as squared error, hinge loss, and logistic regression objectives are convex, but other functions, like the 0–1 loss, are not.
Convex optimization is a well-researched area, focusing on minimizing convex functions over convex sets.
Many real-world problems, including training deep neural networks, are inherently non-convex, even though convex loss functions provide significant advantages in optimization.

Takeaways¶

While convexity is crucial for optimization, practical problems often involve non-convex functions. Nonetheless, convex loss functions remain advantageous for simplifying and solving optimization challenges.

Gradient Checking with Finite Differences¶

Definition of the Partial Derivative¶

The partial derivative of a function is defined as:

\[ \frac{\partial}{\partial x_i} f(x_1, \ldots, x_N) = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_N) - f(x_1, \ldots, x_i, \ldots, x_N)}{h}. \]

Finite Differences Method¶

Derivatives can be checked numerically by substituting a small value of \(h\), such as \(10^{-10}\). This is known as the method of finite differences.

The two-sided definition of the partial derivative is preferred due to its higher accuracy:

\[ \frac{\partial}{\partial x_i} f(x_1, \ldots, x_N) = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_N) - f(x_1, \ldots, x_i - h, \ldots, x_N)}{2h}. \]

Importance of Gradient Checking¶

Gradient checking is crucial in machine learning to ensure the correctness of gradient computations.
Algorithms may appear to perform well even with incorrect gradient calculations, which can lead to skipping checks.
Correct gradients are essential, especially for achieving the highest accuracy.
Wrong gradients may lead to misleading improvements due to error compensation in the gradient calculations.
When derivatives are implemented by hand, gradient checking ensures the algorithm functions correctly.