Topic 10: Linear Regression¶

Linear Regression¶

1. Introduction¶

Linear regression is our first machine learning algorithm that demonstrates several key concepts in machine learning.

Core Characteristics¶

Predicts scalar-valued targets (like stock prices)
Uses linear functions for prediction
Belongs to supervised learning paradigm (requires labeled training data)

Key Course Themes Illustrated¶

Mathematical optimization:
- Formulating machine learning as optimization
- Converting practical problems into mathematical forms
- Defining objective functions
Vector-based thinking:
- Representing data points as vectors
- Treating model parameters as vectors
- Utilizing vector operations for efficiency
Optimization strategies:
- Deriving closed-form solutions
- Implementing gradient descent
- Understanding trade-offs between methods
Linear algebra implementation:
- Using matrix operations
- Vectorizing computations
- Leveraging efficient implementations
Feature enhancement:
- Using basis functions
- Transforming inputs for better representation
- Making linear models more powerful
Generalization analysis:
- Understanding overfitting and underfitting
- Analyzing model performance
- Evaluating on new data

Learning Objectives¶

Students should be able to:

Understand and derive the linear regression objective function
Derive both closed-form and gradient descent solutions
Implement solutions using matrix operations
Write efficient Python implementations
Apply feature maps for nonlinear relationships
Understand generalization, overfitting, and underfitting
Measure generalization performance practically

2. Problem Setup¶

2.1 Model Definition¶

The linear regression model is defined mathematically as:

\[ y = \sum_j w_j x_j + b \]

Key components:

\(w_j\): Weight coefficients
\(x_j\): Input features
\(b\): Bias term (intercept)
\(y\): Predicted output

Visual Representation¶

Data space: Shows \(t\) vs \(x\) with linear fits
Weight space: Shows \((w,b)\) pairs
Connection: \(w\) represents slope, \(b\) represents y-intercept

2.2 Loss Function¶

Squared error loss is used:

\[ L(y,t) = \frac{1}{2}(y-t)^2 \]

Properties:

Small when prediction (\(y\)) is close to target (\(t\))
Large when prediction differs significantly
Factor of \(\frac{1}{2}\) simplifies derivative calculations
\((y-t)\) termed as residual

2.3 Cost Function¶

Full optimization objective:

\[ E(w_1,...,w_D,b) = \frac{1}{N}\sum_{i=1}^N L(y^{(i)}, t^{(i)}) \]

Expanded form:

\[ E(w_1,...,w_D,b) = \frac{1}{2N}\sum_{i=1}^N (y^{(i)} - t^{(i)})^2 \]

Complete form with model:

\[ E(w_1,...,w_D,b) = \frac{1}{2N}\sum_{i=1}^N (\sum_j w_j x_j^{(i)} + b - t^{(i)})^2 \]

3. Optimization Solutions¶

3.1 Partial Derivatives¶

For weights:

\[ \frac{\partial E}{\partial w_j} = \frac{1}{N}\sum_{i=1}^N x_j^{(i)}(\sum_{j'} w_{j'} x_{j'}^{(i)} + b - t^{(i)}) \]

Simplified to:

\[ \frac{\partial E}{\partial w_j} = \frac{1}{N}\sum_{i=1}^N x_j^{(i)}(y^{(i)} - t^{(i)}) \]

For bias:

\[ \frac{\partial E}{\partial b} = \frac{1}{N}\sum_{i=1}^N (\sum_{j'} w_{j'} x_{j'}^{(i)} + b - t^{(i)}) \]

Simplified to:

\[ \frac{\partial E}{\partial b} = \frac{1}{N}\sum_{i=1}^N (y^{(i)} - t^{(i)}) \]

3.2 Direct Solution Method¶

System of linear equations:

\[ \sum_{j'=1}^D A_{jj'} w_{j'} - c_j = 0 \quad \forall j \in \{1,...,D\} \]

Where:

\(A_{jj'} = \frac{1}{N}\sum_{i=1}^N x_j^{(i)} x_{j'}^{(i)}\)
\(c_j = \frac{1}{N}\sum_{i=1}^N x_j^{(i)} t^{(i)}\)

Implementation steps:

Compute coefficients \(A_{jj'}\) and \(c_j\)
Solve linear system using linear algebra library
Extract optimal weights

3.3 Gradient Descent Method¶

Gradient, the direction of steepest ascent (i.e. fastest increase) of a function.

The entries of the gradient vector are the partial derivatives with respect to each of the variables:

\[ \frac{\partial E}{\partial \mathbf{w}} = \begin{pmatrix}\frac{\partial E}{\partial w_1} \\\vdots \\\frac{\partial E}{\partial w_D}\end{pmatrix} \]

Basic update rule:

\[ \mathbf{w} \leftarrow \mathbf{w} - \alpha \frac{\partial E}{\partial \mathbf{w}} \]

Component-wise updates:

\[ w_j \leftarrow w_j - \alpha \frac{\partial E}{\partial w_j} \]

Full update equation:

\[ w_j \leftarrow w_j - \alpha \frac{1}{N}\sum_{i=1}^N x_j(y^{(i)} - t^{(i)}) \]

Key aspects:

\(\alpha\): Learning rate (typically small, e.g., 0.01 or 0.001)
Iterative process
Converges to solution gradually
Fixed points occur where \(\frac{\partial E}{\partial w} = 0\)

Advantages over direct solution:

Applicable to any model with computable gradients
Often computationally cheaper for large systems
Can be automated using frameworks like TensorFlow

4. Vectorization¶

4.1 Matrix Representation¶

Data organization:

Inputs \(X\): \(N × D\) matrix (\(N\) examples, \(D\) dimensions)
Weights \(w\): \(D\)-dimensional vector
Targets \(t\): \(N\)-dimensional vector
Predictions \(y\): \(N\)-dimensional vector

4.2 Vectorized Computations¶

Predictions:

\[ y = Xw + b1 \]

Cost function:

\[ \begin{aligned} E &= \frac{1}{2N} \|y - t\|^2 \\ &= \frac{1}{2N} \|Xw + b1 - t\|^2\\ \end{aligned} \]

Closed-form solution:

\[ w = (X^TX)^{-1}X^Tt \]

Gradient descent:

\[ w \leftarrow w - \frac{\alpha}{N}X^T(y - t) \]

4.3 Advantages of Vectorization¶

Code efficiency:
- Reduces interpreter overhead
- Optimized linear algebra libraries
- Better memory access patterns
GPU optimization:
- Matrix operations highly parallelizable
- Minimal control flow
- Significant speedup (≈50x) possible
Code clarity:
- Simpler, more readable formulas
- Avoid explicit indexing
- More maintainable code

5. Feature Mappings¶

5.1 Polynomial Regression Example¶

Feature mapping function:

\[ \phi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ x^3 \end{bmatrix} \]

Modified prediction:

\[ y = w^T\phi(x) \]

5.2 General Properties¶

Advantages:

Enables nonlinear relationships
Uses same linear regression algorithms
Maintains computational simplicity

Limitations:

Requires pre-defined features
High computational cost in high dimensions
Need for feature engineering expertise

6. Generalization¶

6.1 Core Concepts¶

Three key phenomena:

Underfitting:
- Model too simple
- Cannot capture underlying patterns
- Example: Linear function for nonlinear data
Overfitting:
- Model too complex
- Fits noise in training data
- Example: High-degree polynomial
Good fit:
- Balanced complexity
- Captures patterns without noise
- Example: Cubic polynomial for cubic data

6.2 Dataset Partitioning¶

Three essential sets:

Training set:
- Used to train model parameters
- Primary learning dataset
- Largest portion of data
Validation set:
- Used for hyperparameter tuning
- Estimates generalization error
- Guides model selection
Test set:
- Final evaluation only
- Measures true generalization
- Never used in training/tuning

6.3 Hyperparameter Optimization¶

Characteristics:

Cannot be learned during training
Must be set externally
Examples: polynomial degree, learning rate
Requires validation set for tuning

6.4 Practical Considerations¶

Best practices:

Regular validation checks
Early stopping when needed
Cross-validation for small datasets
Careful hyperparameter tuning
Final test set evaluation