Topic 10: Linear Regression¶
Linear Regression¶
1. Introduction¶
Linear regression is our first machine learning algorithm that demonstrates several key concepts in machine learning.
Core Characteristics¶
- Predicts scalar-valued targets (like stock prices)
- Uses linear functions for prediction
- Belongs to supervised learning paradigm (requires labeled training data)
Key Course Themes Illustrated¶
- Mathematical optimization:
- Formulating machine learning as optimization
- Converting practical problems into mathematical forms
- Defining objective functions
- Vector-based thinking:
- Representing data points as vectors
- Treating model parameters as vectors
- Utilizing vector operations for efficiency
- Optimization strategies:
- Deriving closed-form solutions
- Implementing gradient descent
- Understanding trade-offs between methods
- Linear algebra implementation:
- Using matrix operations
- Vectorizing computations
- Leveraging efficient implementations
- Feature enhancement:
- Using basis functions
- Transforming inputs for better representation
- Making linear models more powerful
- Generalization analysis:
- Understanding overfitting and underfitting
- Analyzing model performance
- Evaluating on new data
Learning Objectives¶
Students should be able to:
- Understand and derive the linear regression objective function
- Derive both closed-form and gradient descent solutions
- Implement solutions using matrix operations
- Write efficient Python implementations
- Apply feature maps for nonlinear relationships
- Understand generalization, overfitting, and underfitting
- Measure generalization performance practically
2. Problem Setup¶
2.1 Model Definition¶
The linear regression model is defined mathematically as:
Key components:
- \(w_j\): Weight coefficients
- \(x_j\): Input features
- \(b\): Bias term (intercept)
- \(y\): Predicted output
Visual Representation¶
- Data space: Shows \(t\) vs \(x\) with linear fits
- Weight space: Shows \((w,b)\) pairs
- Connection: \(w\) represents slope, \(b\) represents y-intercept
2.2 Loss Function¶
Squared error loss is used:
Properties:
- Small when prediction (\(y\)) is close to target (\(t\))
- Large when prediction differs significantly
- Factor of \(\frac{1}{2}\) simplifies derivative calculations
- \((y-t)\) termed as residual
2.3 Cost Function¶
Full optimization objective:
Expanded form:
Complete form with model:
3. Optimization Solutions¶
3.1 Partial Derivatives¶
For weights:
Simplified to:
For bias:
Simplified to:
3.2 Direct Solution Method¶
System of linear equations:
Where:
- \(A_{jj'} = \frac{1}{N}\sum_{i=1}^N x_j^{(i)} x_{j'}^{(i)}\)
- \(c_j = \frac{1}{N}\sum_{i=1}^N x_j^{(i)} t^{(i)}\)
Implementation steps:
- Compute coefficients \(A_{jj'}\) and \(c_j\)
- Solve linear system using linear algebra library
- Extract optimal weights
3.3 Gradient Descent Method¶
Gradient, the direction of steepest ascent (i.e. fastest increase) of a function.
The entries of the gradient vector are the partial derivatives with respect to each of the variables:
Basic update rule:
Component-wise updates:
Full update equation:
Key aspects:
- \(\alpha\): Learning rate (typically small, e.g., 0.01 or 0.001)
- Iterative process
- Converges to solution gradually
- Fixed points occur where \(\frac{\partial E}{\partial w} = 0\)
Advantages over direct solution:
- Applicable to any model with computable gradients
- Often computationally cheaper for large systems
- Can be automated using frameworks like TensorFlow
4. Vectorization¶
4.1 Matrix Representation¶
Data organization:
- Inputs \(X\): \(N × D\) matrix (\(N\) examples, \(D\) dimensions)
- Weights \(w\): \(D\)-dimensional vector
- Targets \(t\): \(N\)-dimensional vector
- Predictions \(y\): \(N\)-dimensional vector
4.2 Vectorized Computations¶
Predictions:
Cost function:
Closed-form solution:
Gradient descent:
4.3 Advantages of Vectorization¶
- Code efficiency:
- Reduces interpreter overhead
- Optimized linear algebra libraries
- Better memory access patterns
- GPU optimization:
- Matrix operations highly parallelizable
- Minimal control flow
- Significant speedup (≈50x) possible
- Code clarity:
- Simpler, more readable formulas
- Avoid explicit indexing
- More maintainable code
5. Feature Mappings¶
5.1 Polynomial Regression Example¶
Feature mapping function:
Modified prediction:
5.2 General Properties¶
Advantages:
- Enables nonlinear relationships
- Uses same linear regression algorithms
- Maintains computational simplicity
Limitations:
- Requires pre-defined features
- High computational cost in high dimensions
- Need for feature engineering expertise
6. Generalization¶
6.1 Core Concepts¶
Three key phenomena:
- Underfitting:
- Model too simple
- Cannot capture underlying patterns
- Example: Linear function for nonlinear data
- Overfitting:
- Model too complex
- Fits noise in training data
- Example: High-degree polynomial
- Good fit:
- Balanced complexity
- Captures patterns without noise
- Example: Cubic polynomial for cubic data
6.2 Dataset Partitioning¶
Three essential sets:
- Training set:
- Used to train model parameters
- Primary learning dataset
- Largest portion of data
- Validation set:
- Used for hyperparameter tuning
- Estimates generalization error
- Guides model selection
- Test set:
- Final evaluation only
- Measures true generalization
- Never used in training/tuning
6.3 Hyperparameter Optimization¶
Characteristics:
- Cannot be learned during training
- Must be set externally
- Examples: polynomial degree, learning rate
- Requires validation set for tuning
6.4 Practical Considerations¶
Best practices:
- Regular validation checks
- Early stopping when needed
- Cross-validation for small datasets
- Careful hyperparameter tuning
- Final test set evaluation