Skip to content

Topic 8: Outliers

Outlier Detection

  • Outlier detection:
    • Find observations that are "unusually different" from the others.
    • Also known as "anomaly detection".
    • May want to remove outliers or focus on them (e.g., for security).
  • Sources of outliers:
    • Measurement errors.
    • Data entry errors.
    • Contamination from different sources.
    • Rare events.

Applications of Outlier Detection

  • Data Cleaning: Removing outliers may lead to better models.
  • Security and Fault Detection: Detect network intrusions, DOS attacks.
  • Fraud Detection: Identify irregularities in credit card transactions, stocks, voting.
  • Natural Disaster Detection: Locate underwater earthquakes.
  • Astronomy: Discover new classes of stars/planets.
  • Genetics: Identify individuals with unique or ancient genes.

5 Types of Methods for Outlier Detection

  1. Model-Based Methods
  2. Graphical Approaches
  3. Cluster-Based Methods
  4. Distance-Based Methods
  5. Supervised-Learning Methods
  6. Note: This is a challenging problem with no definitive solutions.

Outlier Detection Method 1: Model-Based Outlier Detection

  1. Fit a probabilistic model.
  2. Outliers are examples with low density.
  3. Example:

    • Assume data follows a normal distribution.
    • Compute the z-score for 1D data:
    \[ z_i = \frac{x_i-\mu}{\theta} \]

    where \(\mu=\frac1n \sum^n_{i=1}x_i\) and \(\theta = \sqrt{\frac1n \sum^n_{i=1}(x_i-\mu)^2}\)

    • "Number of standard deviations away from the mean".
    • Mark as "outlier" if \(|z| > 4\) or some other threshold.

Problems with Z-Score

  • Sensitivity to Outliers:
    • Mean and variance are affected by outliers.
    • Possible solutions: use quantiles or iteratively remove worst outliers.
  • Assumption of Unimodality:
    • Z-score assumes data is concentrated around a single mean.

image.png

Global vs. Local Outliers

  • Global Outliers: Points that are unusual across the entire dataset.
  • Local Outliers: Points that are unusual within their local neighborhood.
  • Example:
    • A red point with the lowest z-score could be a global or local outlier, depending on the context.

image.png

Outlier Detection Method 2: Graphical Approaches to Outlier Detection

  1. Look at a plot of the data.
  2. Human decides if data is an outlier.

Box Plots:

  • Useful for detecting outliers in univariate data.
  • Displays data distribution and highlights "extreme" values.

image.png

Scatter Plots:

  • Helpful in bivariate data.
  • Outliers may appear as points far from the main cluster.
  • Can detect complex patterns.

image.png

Parallel Coordinate Plots (Scatterplot Array):

  • Useful for multivariate data.
  • Look at all combinations of variables.
  • Outliers can be detected as lines that differ significantly from the general trend.
  • But laborious in high-dimensions.
  • And still only 2 variables at a time.

image.png

Scatterplot of 2-dimensional PCA

  • But loses information and sensitive to outliers

image.png

Outlier Detection Method 3: Cluster-Based Outlier Detection

  1. Cluster the data
  2. Find points that do not belong to clusters.
  3. Basic Idea: Outliers are points that do not fit into clusters.
  4. Examples:
    1. K-means:
      • Find points that are far away from any mean.
      • Find clusters with a small number of points.
    2. Density-based clustering:
      • Outliers are points not assigned to any cluster.
    3. Hierarchical clustering:
      • Outliers take longer to join other groups.
      • Also good for outlier groups.

image.png

image.png

image.png

Outlier Detection Method 4: Distance-Based

  • Most outlier detection approaches are based on distances.
  • Can we skip the model/plot/clustering and just measure distances?
    • How many points lie in a radius "epsilon"?
    • What is the distance to the k-th nearest neighbour?

Global Distance-Based Outlier Detection: KNN

  • KNN outlier detection:
    • For each point, compute the average distance to its KNN.
    • Choose points with biggest values (or values above a threshold) as outliers.
      • "Outliers" are points that are far from their KNNs.
  • Goldstein and Uchida [2016]:
    • Compared 19 methods on 10 datasets.
    • KNN best for finding "global" outliers.
    • "Local" outliers best found with local distance-based methods...

image.png

Local Distance-Based Outlier Detection

  • As with density-based clustering, problem with differing densities:
  • Outlier \(o_2\) has similar density as elements of cluster \(C_1\).
  • Basic idea behind local distance-based methods:
    • Outlier \(o_2\) is "relatively" far compared to its neighbours.
  • "Outlierness" ratio of example 'i':
\[ \frac{\text{average distance of ``i'' to its KNNs}}{\text{average distance of neighbours of ``i'' to their KNNs}} \]
  • If outlierness > 1, \(x_i\) is further away from neighbours than expected.

image.png

image.png

Problem with Unsupervised Outlier Detection

  • Can be hard to decide when to report an outlier:
    • If you report too many non-outliers, users will turn you off.

Outlier Detection Method 5: Supervised

  • Final approach to outlier detection is to use supervised learning:
    • \(y_i = 1\) if \(x_i\) is an outlier.
    • \(y_i = 0\) if \(x_i\) is a regular point.
  • We can use our methods for supervised learning:
    • We can find very complicated outlier patterns.
    • Classic credit card fraud detection methods used decision trees.
  • But it needs supervision:
    • We need to know what outliers look like.
    • We may not detect new "types" of outliers.

Summary

  • Outlier detection is the task of finding unusually different examples.
    • A concept that is very difficult to define.
  • 5 approaches for outlier detection:
    • Model-based: find unlikely examples given a model of the data.
    • Graphical: plot data and use human inspection to find outliers.
    • Cluster-based: check whether examples belong to clusters.
    • Distance-based outlier detection: measure (relative) distance to neighbours.
    • Supervised-learning for outlier detection: turns the task into supervised learning.

Outlierness (Symbol Definition)

  • Let \(N_k(x_i)\) be the k-nearest neighbours of \(x_i\) .
  • Let \(D_k(x_i)\) be the average distance to k-nearest neighbours:

    \[ D_k(x_i) = \frac{1}{k} \sum_{j \in N_k(x_i)} \| x_i - x_j \| \]
  • Outlierness is the ratio of \(D_k(x_i)\) to the average \(D_k(x_j)\) for its neighbours \(j\) :

    \[ O_k(x_i) = \frac{D_k(x_i)}{\frac{1}{k} \sum_{j \in N_k(x_i)} D_k(x_j)} \]
  • If outlierness \(> 1\) , \(x_i\) is further away from neighbours than expected.

Training/Validation/Testing (Supervised)

  • A typical supervised learning setup:
    • Train parameters on dataset \(D_1\) .
    • Validate hyper-parameters on dataset \(D_2\) .
    • Test error evaluated on dataset \(D_3\) .
  • What should we choose for \(D_1\) , \(D_2\) , and \(D_3\) ?
  • Usual answer: should all be IID samples from data distribution \(D_s\) .

Training/Validation/Testing (Outlier Detection)

  • A typical outlier detection setup:
    • Train parameters on dataset \(D_1\) (there may be no "training" to do).
      • For example, find z-scores.
    • Validate hyper-parameters on dataset \(D_2\) (for outlier detection).
      • For example, see which z-score threshold separates \(D_1\) and \(D_2\).
    • Test error evaluated on dataset \(D_3\) (for outlier detection).
      • For example, check whether z-score recognizes \(D_3\) as outliers.
  • \(D_1\) will still be samples from \(D_s\) (data distribution).
  • \(D_2\) could use IID samples from another distribution \(D_m\).
    • \(D_m\) represents the "none" or "outlier" class.
    • Tune parameters so that \(D_m\) samples are outliers and \(D_s\) samples aren’t.
      • Could just fit a binary classifier here.
  • \(D_3\) could use IID samples from \(D_m\).
    • How well do you do at recognizing "data" samples from "none" samples?
  • What can go wrong?
  • You needed to pick a distribution \(D_m\) to represent "none".
    • But in the wild, your outliers might follow another "none" distribution.
    • This procedure can overfit to your \(D_m\).
      • You can overestimate your ability to detect outliers.

OD-Test: a better way to evaluate outlier detections

  • A reasonable setup:
    • \(D_1\) will still be samples from \(D_s\) (data distribution).
    • \(D_2\) could use IID samples from another distribution \(D_m\).
    • \(D_3\) could use IID samples from yet-another distribution \(D_t\).
  • "How do you perform at detecting different types of outliers?"
    • Seems like a harder problem, but arguably closer to reality.