Topic 8: Outliers¶

Outlier Detection¶

Outlier detection:
- Find observations that are "unusually different" from the others.
- Also known as "anomaly detection".
- May want to remove outliers or focus on them (e.g., for security).
Sources of outliers:
- Measurement errors.
- Data entry errors.
- Contamination from different sources.
- Rare events.

Applications of Outlier Detection¶

Data Cleaning: Removing outliers may lead to better models.
Security and Fault Detection: Detect network intrusions, DOS attacks.
Fraud Detection: Identify irregularities in credit card transactions, stocks, voting.
Natural Disaster Detection: Locate underwater earthquakes.
Astronomy: Discover new classes of stars/planets.
Genetics: Identify individuals with unique or ancient genes.

5 Types of Methods for Outlier Detection¶

Model-Based Methods
Graphical Approaches
Cluster-Based Methods
Distance-Based Methods
Supervised-Learning Methods
Note: This is a challenging problem with no definitive solutions.

Outlier Detection Method 1: Model-Based Outlier Detection¶

Fit a probabilistic model.
Outliers are examples with low density.
Example:
- Assume data follows a normal distribution.
- Compute the z-score for 1D data:
\[ z_i = \frac{x_i-\mu}{\theta} \]

where \(\mu=\frac1n \sum^n_{i=1}x_i\) and \(\theta = \sqrt{\frac1n \sum^n_{i=1}(x_i-\mu)^2}\)
- "Number of standard deviations away from the mean".
- Mark as "outlier" if \(|z| > 4\) or some other threshold.

Problems with Z-Score¶

Sensitivity to Outliers:
- Mean and variance are affected by outliers.
- Possible solutions: use quantiles or iteratively remove worst outliers.
Assumption of Unimodality:
- Z-score assumes data is concentrated around a single mean.

Global vs. Local Outliers¶

Global Outliers: Points that are unusual across the entire dataset.
Local Outliers: Points that are unusual within their local neighborhood.
Example:
- A red point with the lowest z-score could be a global or local outlier, depending on the context.

Outlier Detection Method 2: Graphical Approaches to Outlier Detection¶

Look at a plot of the data.
Human decides if data is an outlier.

Box Plots:

Useful for detecting outliers in univariate data.
Displays data distribution and highlights "extreme" values.

Scatter Plots:

Helpful in bivariate data.
Outliers may appear as points far from the main cluster.
Can detect complex patterns.

Parallel Coordinate Plots (Scatterplot Array):

Useful for multivariate data.
Look at all combinations of variables.
Outliers can be detected as lines that differ significantly from the general trend.
But laborious in high-dimensions.
And still only 2 variables at a time.

Scatterplot of 2-dimensional PCA

But loses information and sensitive to outliers

Outlier Detection Method 3: Cluster-Based Outlier Detection¶

Cluster the data
Find points that do not belong to clusters.
Basic Idea: Outliers are points that do not fit into clusters.
Examples:
1. K-means:
  - Find points that are far away from any mean.
  - Find clusters with a small number of points.
2. Density-based clustering:
  - Outliers are points not assigned to any cluster.
3. Hierarchical clustering:
  - Outliers take longer to join other groups.
  - Also good for outlier groups.

Outlier Detection Method 4: Distance-Based¶

Most outlier detection approaches are based on distances.
Can we skip the model/plot/clustering and just measure distances?
- How many points lie in a radius "epsilon"?
- What is the distance to the k-th nearest neighbour?

Global Distance-Based Outlier Detection: KNN¶

KNN outlier detection:
- For each point, compute the average distance to its KNN.
- Choose points with biggest values (or values above a threshold) as outliers.
  - "Outliers" are points that are far from their KNNs.
Goldstein and Uchida [2016]:
- Compared 19 methods on 10 datasets.
- KNN best for finding "global" outliers.
- "Local" outliers best found with local distance-based methods...

Local Distance-Based Outlier Detection¶

As with density-based clustering, problem with differing densities:
Outlier \(o_2\) has similar density as elements of cluster \(C_1\).
Basic idea behind local distance-based methods:
- Outlier \(o_2\) is "relatively" far compared to its neighbours.
"Outlierness" ratio of example 'i':

\[ \frac{\text{average distance of ``i'' to its KNNs}}{\text{average distance of neighbours of ``i'' to their KNNs}} \]

If outlierness > 1, \(x_i\) is further away from neighbours than expected.

Problem with Unsupervised Outlier Detection¶

Can be hard to decide when to report an outlier:
- If you report too many non-outliers, users will turn you off.

Outlier Detection Method 5: Supervised¶

Final approach to outlier detection is to use supervised learning:
- \(y_i = 1\) if \(x_i\) is an outlier.
- \(y_i = 0\) if \(x_i\) is a regular point.
We can use our methods for supervised learning:
- We can find very complicated outlier patterns.
- Classic credit card fraud detection methods used decision trees.
But it needs supervision:
- We need to know what outliers look like.
- We may not detect new "types" of outliers.

Summary¶

Outlier detection is the task of finding unusually different examples.
- A concept that is very difficult to define.
5 approaches for outlier detection:
- Model-based: find unlikely examples given a model of the data.
- Graphical: plot data and use human inspection to find outliers.
- Cluster-based: check whether examples belong to clusters.
- Distance-based outlier detection: measure (relative) distance to neighbours.
- Supervised-learning for outlier detection: turns the task into supervised learning.

Outlierness (Symbol Definition)¶

Let \(N_k(x_i)\) be the k-nearest neighbours of \(x_i\) .
Let \(D_k(x_i)\) be the average distance to k-nearest neighbours:

\[ D_k(x_i) = \frac{1}{k} \sum_{j \in N_k(x_i)} \| x_i - x_j \| \]
Outlierness is the ratio of \(D_k(x_i)\) to the average \(D_k(x_j)\) for its neighbours \(j\) :

\[ O_k(x_i) = \frac{D_k(x_i)}{\frac{1}{k} \sum_{j \in N_k(x_i)} D_k(x_j)} \]
If outlierness \(> 1\) , \(x_i\) is further away from neighbours than expected.

Training/Validation/Testing (Supervised)¶

A typical supervised learning setup:
- Train parameters on dataset \(D_1\) .
- Validate hyper-parameters on dataset \(D_2\) .
- Test error evaluated on dataset \(D_3\) .
What should we choose for \(D_1\) , \(D_2\) , and \(D_3\) ?
Usual answer: should all be IID samples from data distribution \(D_s\) .

Training/Validation/Testing (Outlier Detection)¶

A typical outlier detection setup:
- Train parameters on dataset \(D_1\) (there may be no "training" to do).
  - For example, find z-scores.
- Validate hyper-parameters on dataset \(D_2\) (for outlier detection).
  - For example, see which z-score threshold separates \(D_1\) and \(D_2\).
- Test error evaluated on dataset \(D_3\) (for outlier detection).
  - For example, check whether z-score recognizes \(D_3\) as outliers.
\(D_1\) will still be samples from \(D_s\) (data distribution).
\(D_2\) could use IID samples from another distribution \(D_m\).
- \(D_m\) represents the "none" or "outlier" class.
- Tune parameters so that \(D_m\) samples are outliers and \(D_s\) samples aren’t.
  - Could just fit a binary classifier here.
\(D_3\) could use IID samples from \(D_m\).
- How well do you do at recognizing "data" samples from "none" samples?
What can go wrong?
You needed to pick a distribution \(D_m\) to represent "none".
- But in the wild, your outliers might follow another "none" distribution.
- This procedure can overfit to your \(D_m\).
  - You can overestimate your ability to detect outliers.

OD-Test: a better way to evaluate outlier detections¶

A reasonable setup:
- \(D_1\) will still be samples from \(D_s\) (data distribution).
- \(D_2\) could use IID samples from another distribution \(D_m\).
- \(D_3\) could use IID samples from yet-another distribution \(D_t\).
"How do you perform at detecting different types of outliers?"
- Seems like a harder problem, but arguably closer to reality.