Topic 8: Outliers¶
Outlier Detection¶
- Outlier detection:
- Find observations that are "unusually different" from the others.
- Also known as "anomaly detection".
- May want to remove outliers or focus on them (e.g., for security).
- Sources of outliers:
- Measurement errors.
- Data entry errors.
- Contamination from different sources.
- Rare events.
Applications of Outlier Detection¶
- Data Cleaning: Removing outliers may lead to better models.
- Security and Fault Detection: Detect network intrusions, DOS attacks.
- Fraud Detection: Identify irregularities in credit card transactions, stocks, voting.
- Natural Disaster Detection: Locate underwater earthquakes.
- Astronomy: Discover new classes of stars/planets.
- Genetics: Identify individuals with unique or ancient genes.
5 Types of Methods for Outlier Detection¶
- Model-Based Methods
- Graphical Approaches
- Cluster-Based Methods
- Distance-Based Methods
- Supervised-Learning Methods
- Note: This is a challenging problem with no definitive solutions.
Outlier Detection Method 1: Model-Based Outlier Detection¶
- Fit a probabilistic model.
- Outliers are examples with low density.
-
Example:
- Assume data follows a normal distribution.
- Compute the z-score for 1D data:
\[ z_i = \frac{x_i-\mu}{\theta} \]where \(\mu=\frac1n \sum^n_{i=1}x_i\) and \(\theta = \sqrt{\frac1n \sum^n_{i=1}(x_i-\mu)^2}\)
- "Number of standard deviations away from the mean".
- Mark as "outlier" if \(|z| > 4\) or some other threshold.
Problems with Z-Score¶
- Sensitivity to Outliers:
- Mean and variance are affected by outliers.
- Possible solutions: use quantiles or iteratively remove worst outliers.
- Assumption of Unimodality:
- Z-score assumes data is concentrated around a single mean.

Global vs. Local Outliers¶
- Global Outliers: Points that are unusual across the entire dataset.
- Local Outliers: Points that are unusual within their local neighborhood.
- Example:
- A red point with the lowest z-score could be a global or local outlier, depending on the context.

Outlier Detection Method 2: Graphical Approaches to Outlier Detection¶
- Look at a plot of the data.
- Human decides if data is an outlier.
Box Plots:
- Useful for detecting outliers in univariate data.
- Displays data distribution and highlights "extreme" values.

Scatter Plots:
- Helpful in bivariate data.
- Outliers may appear as points far from the main cluster.
- Can detect complex patterns.

Parallel Coordinate Plots (Scatterplot Array):
- Useful for multivariate data.
- Look at all combinations of variables.
- Outliers can be detected as lines that differ significantly from the general trend.
- But laborious in high-dimensions.
- And still only 2 variables at a time.

Scatterplot of 2-dimensional PCA
- But loses information and sensitive to outliers

Outlier Detection Method 3: Cluster-Based Outlier Detection¶
- Cluster the data
- Find points that do not belong to clusters.
- Basic Idea: Outliers are points that do not fit into clusters.
- Examples:
- K-means:
- Find points that are far away from any mean.
- Find clusters with a small number of points.
- Density-based clustering:
- Outliers are points not assigned to any cluster.
- Hierarchical clustering:
- Outliers take longer to join other groups.
- Also good for outlier groups.
- K-means:



Outlier Detection Method 4: Distance-Based¶
- Most outlier detection approaches are based on distances.
- Can we skip the model/plot/clustering and just measure distances?
- How many points lie in a radius "epsilon"?
- What is the distance to the k-th nearest neighbour?
Global Distance-Based Outlier Detection: KNN¶
- KNN outlier detection:
- For each point, compute the average distance to its KNN.
- Choose points with biggest values (or values above a threshold) as outliers.
- "Outliers" are points that are far from their KNNs.
- Goldstein and Uchida [2016]:
- Compared 19 methods on 10 datasets.
- KNN best for finding "global" outliers.
- "Local" outliers best found with local distance-based methods...

Local Distance-Based Outlier Detection¶
- As with density-based clustering, problem with differing densities:
- Outlier \(o_2\) has similar density as elements of cluster \(C_1\).
- Basic idea behind local distance-based methods:
- Outlier \(o_2\) is "relatively" far compared to its neighbours.
- "Outlierness" ratio of example 'i':
\[
\frac{\text{average distance of ``i'' to its KNNs}}{\text{average distance of neighbours of ``i'' to their KNNs}}
\]
- If outlierness > 1, \(x_i\) is further away from neighbours than expected.


Problem with Unsupervised Outlier Detection¶
- Can be hard to decide when to report an outlier:
- If you report too many non-outliers, users will turn you off.
Outlier Detection Method 5: Supervised¶
- Final approach to outlier detection is to use supervised learning:
- \(y_i = 1\) if \(x_i\) is an outlier.
- \(y_i = 0\) if \(x_i\) is a regular point.
- We can use our methods for supervised learning:
- We can find very complicated outlier patterns.
- Classic credit card fraud detection methods used decision trees.
- But it needs supervision:
- We need to know what outliers look like.
- We may not detect new "types" of outliers.
Summary¶
- Outlier detection is the task of finding unusually different examples.
- A concept that is very difficult to define.
- 5 approaches for outlier detection:
- Model-based: find unlikely examples given a model of the data.
- Graphical: plot data and use human inspection to find outliers.
- Cluster-based: check whether examples belong to clusters.
- Distance-based outlier detection: measure (relative) distance to neighbours.
- Supervised-learning for outlier detection: turns the task into supervised learning.
Outlierness (Symbol Definition)¶
- Let \(N_k(x_i)\) be the k-nearest neighbours of \(x_i\) .
-
Let \(D_k(x_i)\) be the average distance to k-nearest neighbours:
\[ D_k(x_i) = \frac{1}{k} \sum_{j \in N_k(x_i)} \| x_i - x_j \| \] -
Outlierness is the ratio of \(D_k(x_i)\) to the average \(D_k(x_j)\) for its neighbours \(j\) :
\[ O_k(x_i) = \frac{D_k(x_i)}{\frac{1}{k} \sum_{j \in N_k(x_i)} D_k(x_j)} \] -
If outlierness \(> 1\) , \(x_i\) is further away from neighbours than expected.
Training/Validation/Testing (Supervised)¶
- A typical supervised learning setup:
- Train parameters on dataset \(D_1\) .
- Validate hyper-parameters on dataset \(D_2\) .
- Test error evaluated on dataset \(D_3\) .
- What should we choose for \(D_1\) , \(D_2\) , and \(D_3\) ?
- Usual answer: should all be IID samples from data distribution \(D_s\) .
Training/Validation/Testing (Outlier Detection)¶
- A typical outlier detection setup:
- Train parameters on dataset \(D_1\) (there may be no "training" to do).
- For example, find z-scores.
- Validate hyper-parameters on dataset \(D_2\) (for outlier detection).
- For example, see which z-score threshold separates \(D_1\) and \(D_2\).
- Test error evaluated on dataset \(D_3\) (for outlier detection).
- For example, check whether z-score recognizes \(D_3\) as outliers.
- Train parameters on dataset \(D_1\) (there may be no "training" to do).
- \(D_1\) will still be samples from \(D_s\) (data distribution).
- \(D_2\) could use IID samples from another distribution \(D_m\).
- \(D_m\) represents the "none" or "outlier" class.
- Tune parameters so that \(D_m\) samples are outliers and \(D_s\) samples aren’t.
- Could just fit a binary classifier here.
- \(D_3\) could use IID samples from \(D_m\).
- How well do you do at recognizing "data" samples from "none" samples?
- What can go wrong?
- You needed to pick a distribution \(D_m\) to represent "none".
- But in the wild, your outliers might follow another "none" distribution.
- This procedure can overfit to your \(D_m\).
- You can overestimate your ability to detect outliers.
OD-Test: a better way to evaluate outlier detections¶
- A reasonable setup:
- \(D_1\) will still be samples from \(D_s\) (data distribution).
- \(D_2\) could use IID samples from another distribution \(D_m\).
- \(D_3\) could use IID samples from yet-another distribution \(D_t\).
- "How do you perform at detecting different types of outliers?"
- Seems like a harder problem, but arguably closer to reality.