Skip to content

Topic 10: Linear Regression

Linear Regression

1. Introduction

Linear regression is our first machine learning algorithm that demonstrates several key concepts in machine learning.

Core Characteristics

  • Predicts scalar-valued targets (like stock prices)
  • Uses linear functions for prediction
  • Belongs to supervised learning paradigm (requires labeled training data)

Key Course Themes Illustrated

  1. Mathematical optimization:
    • Formulating machine learning as optimization
    • Converting practical problems into mathematical forms
    • Defining objective functions
  2. Vector-based thinking:
    • Representing data points as vectors
    • Treating model parameters as vectors
    • Utilizing vector operations for efficiency
  3. Optimization strategies:
    • Deriving closed-form solutions
    • Implementing gradient descent
    • Understanding trade-offs between methods
  4. Linear algebra implementation:
    • Using matrix operations
    • Vectorizing computations
    • Leveraging efficient implementations
  5. Feature enhancement:
    • Using basis functions
    • Transforming inputs for better representation
    • Making linear models more powerful
  6. Generalization analysis:
    • Understanding overfitting and underfitting
    • Analyzing model performance
    • Evaluating on new data

Learning Objectives

Students should be able to:

  1. Understand and derive the linear regression objective function
  2. Derive both closed-form and gradient descent solutions
  3. Implement solutions using matrix operations
  4. Write efficient Python implementations
  5. Apply feature maps for nonlinear relationships
  6. Understand generalization, overfitting, and underfitting
  7. Measure generalization performance practically

2. Problem Setup

2.1 Model Definition

The linear regression model is defined mathematically as:

\[ y = \sum_j w_j x_j + b \]

Key components:

  • \(w_j\): Weight coefficients
  • \(x_j\): Input features
  • \(b\): Bias term (intercept)
  • \(y\): Predicted output

Visual Representation

  • Data space: Shows \(t\) vs \(x\) with linear fits
  • Weight space: Shows \((w,b)\) pairs
  • Connection: \(w\) represents slope, \(b\) represents y-intercept

2.2 Loss Function

Squared error loss is used:

\[ L(y,t) = \frac{1}{2}(y-t)^2 \]

Properties:

  • Small when prediction (\(y\)) is close to target (\(t\))
  • Large when prediction differs significantly
  • Factor of \(\frac{1}{2}\) simplifies derivative calculations
  • \((y-t)\) termed as residual

2.3 Cost Function

Full optimization objective:

\[ E(w_1,...,w_D,b) = \frac{1}{N}\sum_{i=1}^N L(y^{(i)}, t^{(i)}) \]

Expanded form:

\[ E(w_1,...,w_D,b) = \frac{1}{2N}\sum_{i=1}^N (y^{(i)} - t^{(i)})^2 \]

Complete form with model:

\[ E(w_1,...,w_D,b) = \frac{1}{2N}\sum_{i=1}^N (\sum_j w_j x_j^{(i)} + b - t^{(i)})^2 \]

3. Optimization Solutions

3.1 Partial Derivatives

For weights:

\[ \frac{\partial E}{\partial w_j} = \frac{1}{N}\sum_{i=1}^N x_j^{(i)}(\sum_{j'} w_{j'} x_{j'}^{(i)} + b - t^{(i)}) \]

Simplified to:

\[ \frac{\partial E}{\partial w_j} = \frac{1}{N}\sum_{i=1}^N x_j^{(i)}(y^{(i)} - t^{(i)}) \]

For bias:

\[ \frac{\partial E}{\partial b} = \frac{1}{N}\sum_{i=1}^N (\sum_{j'} w_{j'} x_{j'}^{(i)} + b - t^{(i)}) \]

Simplified to:

\[ \frac{\partial E}{\partial b} = \frac{1}{N}\sum_{i=1}^N (y^{(i)} - t^{(i)}) \]

3.2 Direct Solution Method

System of linear equations:

\[ \sum_{j'=1}^D A_{jj'} w_{j'} - c_j = 0 \quad \forall j \in \{1,...,D\} \]

Where:

  • \(A_{jj'} = \frac{1}{N}\sum_{i=1}^N x_j^{(i)} x_{j'}^{(i)}\)
  • \(c_j = \frac{1}{N}\sum_{i=1}^N x_j^{(i)} t^{(i)}\)

Implementation steps:

  1. Compute coefficients \(A_{jj'}\) and \(c_j\)
  2. Solve linear system using linear algebra library
  3. Extract optimal weights

3.3 Gradient Descent Method

Gradient, the direction of steepest ascent (i.e. fastest increase) of a function.

The entries of the gradient vector are the partial derivatives with respect to each of the variables:

\[ \frac{\partial E}{\partial \mathbf{w}} = \begin{pmatrix}\frac{\partial E}{\partial w_1} \\\vdots \\\frac{\partial E}{\partial w_D}\end{pmatrix} \]

Basic update rule:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \frac{\partial E}{\partial \mathbf{w}} \]

Component-wise updates:

\[ w_j \leftarrow w_j - \alpha \frac{\partial E}{\partial w_j} \]

Full update equation:

\[ w_j \leftarrow w_j - \alpha \frac{1}{N}\sum_{i=1}^N x_j(y^{(i)} - t^{(i)}) \]

Key aspects:

  • \(\alpha\): Learning rate (typically small, e.g., 0.01 or 0.001)
  • Iterative process
  • Converges to solution gradually
  • Fixed points occur where \(\frac{\partial E}{\partial w} = 0\)

Advantages over direct solution:

  1. Applicable to any model with computable gradients
  2. Often computationally cheaper for large systems
  3. Can be automated using frameworks like TensorFlow

4. Vectorization

4.1 Matrix Representation

Data organization:

  • Inputs \(X\): \(N × D\) matrix (\(N\) examples, \(D\) dimensions)
  • Weights \(w\): \(D\)-dimensional vector
  • Targets \(t\): \(N\)-dimensional vector
  • Predictions \(y\): \(N\)-dimensional vector

4.2 Vectorized Computations

Predictions:

\[ y = Xw + b1 \]

Cost function:

\[ \begin{aligned} E &= \frac{1}{2N} \|y - t\|^2 \\ &= \frac{1}{2N} \|Xw + b1 - t\|^2\\ \end{aligned} \]

Closed-form solution:

\[ w = (X^TX)^{-1}X^Tt \]

Gradient descent:

\[ w \leftarrow w - \frac{\alpha}{N}X^T(y - t) \]

4.3 Advantages of Vectorization

  1. Code efficiency:
    • Reduces interpreter overhead
    • Optimized linear algebra libraries
    • Better memory access patterns
  2. GPU optimization:
    • Matrix operations highly parallelizable
    • Minimal control flow
    • Significant speedup (≈50x) possible
  3. Code clarity:
    • Simpler, more readable formulas
    • Avoid explicit indexing
    • More maintainable code

5. Feature Mappings

5.1 Polynomial Regression Example

Feature mapping function:

\[ \phi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ x^3 \end{bmatrix} \]

Modified prediction:

\[ y = w^T\phi(x) \]

5.2 General Properties

Advantages:

  • Enables nonlinear relationships
  • Uses same linear regression algorithms
  • Maintains computational simplicity

Limitations:

  • Requires pre-defined features
  • High computational cost in high dimensions
  • Need for feature engineering expertise

6. Generalization

6.1 Core Concepts

Three key phenomena:

  1. Underfitting:
    • Model too simple
    • Cannot capture underlying patterns
    • Example: Linear function for nonlinear data
  2. Overfitting:
    • Model too complex
    • Fits noise in training data
    • Example: High-degree polynomial
  3. Good fit:
    • Balanced complexity
    • Captures patterns without noise
    • Example: Cubic polynomial for cubic data

6.2 Dataset Partitioning

Three essential sets:

  1. Training set:
    • Used to train model parameters
    • Primary learning dataset
    • Largest portion of data
  2. Validation set:
    • Used for hyperparameter tuning
    • Estimates generalization error
    • Guides model selection
  3. Test set:
    • Final evaluation only
    • Measures true generalization
    • Never used in training/tuning

6.3 Hyperparameter Optimization

Characteristics:

  • Cannot be learned during training
  • Must be set externally
  • Examples: polynomial degree, learning rate
  • Requires validation set for tuning

6.4 Practical Considerations

Best practices:

  • Regular validation checks
  • Early stopping when needed
  • Cross-validation for small datasets
  • Careful hyperparameter tuning
  • Final test set evaluation