Topic 7: Clustering¶

Clustering¶

Clustering:
- Input: set of examples described by features \(x_i\).
- Output: an assignment of examples to groups.
Unlike classification, we are not given the 'groups'.
- Algorithm must discover groups.
Example of groups we might discover in e-mail spam:
- 'Lucky winner' group
- 'Your payment has been processed' group
- 'I need your help' group
General goal of clustering algorithms:
- Examples in the same group should be similar.
- Examples in different groups should be different.
But the 'best' clustering is hard to define:
- We don’t have a test error.
- Generally, there is no 'best' method in unsupervised learning.
  - So there are lots of methods: we’ll focus on important/representative ones.
Why cluster?
- You could want to know what the groups are.
- You could want to find the group for a new example \(x_i\).
- You could want to find examples related to a new example \(x_i\).
- You could want a 'prototype' example for each group.
  - For example, what does the typical breakfast look like?

Defining Clusters¶

Before discussing clustering strategies for unlabeled data, we need to consider what a "cluster" means.

Similarity-based Definition: A cluster is a set of samples similar to each other, with different similarity criteria leading to different clustering results.
Density-based Definition: A cluster is a region in feature space containing a high density of samples, with peaks in the sample density function associated with clusters.

The definition of clusters influences the clustering strategies we use.

Intuitive Clustering Approaches¶

Strategy 1: Mixture Density Strategy¶

If we define clusters based on high sample density regions:

Estimate the combined PDF, \(p(x)\), using methods like Parzen window estimation.
This density \(p(x)\) is a mixture of the class densities:

\[ p(\underline{x}) = \sum_{i=1}^{K} P(c_i) p(\underline{x} | c_i) \]

where \(P(c_i)\) is the probability of class \(i\).

If the classes are distinct and nearly equally likely, \(p(x)\) should have \(K\) peaks (local maxima), one for each class.

Define a cluster prototype at each local maximum:

\[ \underline{z}_i = \underline{x} \text{ such that } \underline{x} \text{ is the } i\text{-th local maximum of } p(\underline{x}) \]

Clusters can be determined using distance metrics (e.g., Euclidean distance) or boundaries based on local minima.

Mixture Density Strategy Drawbacks¶

Too few samples can result in a large number of spurious local maxima.
Individual sample densities for a class may be skewed (e.g., Gamma distribution), which makes a simple Euclidean distance metric inadequate for defining cluster boundaries.
Cluster peaks may be too close or even overlap, making distinction difficult.

Strategy 2: Principal Component Analysis (PCA) Strategy¶

Assume two compact and distinct clusters. The combined covariance matrix describes the shape of the total sample distribution.

Take the maximum eigenvalue eigenvector \(\phi_1\) as the direction of maximum separation.
Project samples onto \(\phi_1\).
Construct a histogram and choose a threshold \(\theta\) at the minimum of the distribution to specify clusters:

\[ \underline{x} \in c_1 \text{ if } \phi_1^T \underline{x} < \theta \]

This scheme is simple and intuitive but cannot handle more complex clustering problems such as:

Having more than two clusters (\(K > 2\))
Non-spherical distributions

Strategy 3: Euclidean Distance Threshold Strategy¶

This strategy is based on similarity:

A pattern is assigned to a cluster if its distance to the prototype is less than a threshold \(T\).
Steps:
- Let \(z_1 = x_1\), the first sample.
- For each sample, assign it to the closest prototype if \(|\underline{x}_i - \underline{z}_j| < T\). If no such prototype exists, let \(\underline{x}_i\) be a new prototype.
- Repeat until all samples are assigned to clusters.

Threshold \(T\) is arbitrary.
- Small \(T\) results in many clusters.
- Large \(T\) results in fewer clusters.
Sensitive to the order of samples.

To remedy these issues, formal criteria for clustering are developed to minimize arbitrary structure and detect natural structure.

Criteria for Clustering¶

A more formal approach to clustering is to define a criterion function. This function serves as a quantitative measure of the clusters.

Thus, a given partitioning of the sample set is optimum when the criterion function is maximized or minimized.

Commonly used criterion functions:

Sum of squared errors
Scatter volume

Sum of Squared Error Criterion¶

Suppose we want \(K\) clusters, and a single cluster, \(c_i\), has \(N_i\) samples. The sum of squared errors when the \(N_i\) samples are represented by the cluster prototype, \(\underline{z}_i\), can be expressed as:

\[ J_i = \sum_{\underline{x} \in c_i} |\underline{x} - \underline{z}_i|^2 \]

The prototype \(\underline{z}_i\) that minimizes the single class error is simply the class mean, \(\underline{m}_i\).

Therefore, the total sum of squared error criterion is:

\[ J_e = \sum_{i=1}^{K} J_i = \sum_{i=1}^{K} \sum_{\underline{x} \in c_i} |\underline{x} - \underline{m}_i|^2 \]

A clustering or partitioning that minimizes this total error \(J_e\) is called a minimum variance partition.

Scatter Volume Criterion¶

Suppose we compute the scatter matrix for class \(c_i\), defined as:

\[ S_i = \sum_{\underline{x} \in c_i} (\underline{x} - \underline{m}_i)(\underline{x} - \underline{m}_i)^T \]

Here, \(S_i\) is just \(N_i\) times the covariance matrix of class \(i\).

Summing over \(K\) classes, we have a measure of total within-class scatter:

\[ S_W = \sum_{i=1}^{K} S_i \]

If we compute the trace of \(S_W\), we realize it is the sum of squared error criterion:

\[ \text{tr}(S_W) = \sum_{i=1}^{K} \text{tr}(S_i)= \sum_{i=1}^{K} \sum_{\underline{x} \in c_i} |\underline{x} - \underline{m}_i|^2 = J_e \]

Another useful criterion is scatter volume, determined by the determinant of \(S_W\):

\[ J_v = |S_W| \]

Minimum Variance Algorithms¶

With the criterion functions discussed, we can now develop specific clustering algorithms that satisfy these criterion functions.

Basic Assumption: The number of clusters is known, although more advanced methods exist for unknown cluster numbers by starting with a large number and converging to the ideal count.

Here, we focus on minimum variance algorithms, aiming to minimize the sum of squared error criterion.

K-means Algorithm¶

The most well-known minimum variance algorithm is the K-means algorithm.

Problem: Suppose we are given a set of unlabeled samples.

Steps of the K-means Algorithm:

Choose prototypes \(\{ \underline{z}_1, \ldots, \underline{z}_K \}\) arbitrarily.
Compute the Euclidean distance between each of the \(N\) samples and each cluster prototype.
Assign each sample to the cluster with the nearest prototype:

\[ \underline{x} \in C_i \text{ if } |\underline{x} - \underline{z}_i| < |\underline{x} - \underline{z}_j|, \quad j \neq i \]

Compute the new cluster prototypes as the mean of the assigned samples:

\[ z_{i,\text{new}} = \frac{1}{N_i} \sum_{i=1}^{N_i}{\underline{x}_i}, \quad \forall x \in C_i \]

where \(N_i\) is the number of samples in cluster \(C_i\).

If any cluster prototype has changed, go back to Step 2.

Stopping Criterion: The algorithm stops when there is no change in the cluster prototypes.

Considerations and Possible Solutions for K-means¶

There are several considerations associated with the K-means algorithm:

Sensitivity to Initialization: K-means can yield different results depending on the initial choice of prototypes. This sensitivity can be reduced by running the algorithm multiple times with different initializations and choosing the best result.
Fixed Number of Clusters: K-means requires the number of clusters to be specified in advance. This can be mitigated by starting with a large number of clusters and consolidating them.
Sensitivity to Cluster Shape: K-means assumes clusters are spherical and of similar size, which may not suit all data distributions.

Variants of the K-means Algorithm¶

A more complex variant of K-means is the ISODATA (Iterative Self-Organizing Data Analysis Technique A):

If the number of samples in any cluster is below a threshold, the cluster may be eliminated.
If the maximum variance feature for a cluster exceeds a threshold and there are enough samples, the cluster may be split.
If the distance between clusters is below a threshold, the clusters may be merged.

ISODATA is highly flexible but requires several thresholds to be defined. It is often used in an interactive, empirical environment.

K-means Summary¶

Most popular clustering method is k-means.
Input:
- The number of clusters 'k' (hyper-parameter).
- Initial guess of the center (the "mean") of each cluster.
K-Means Algorithm for Finding Means:
- Assign each \(x_i\) to its closest mean.
- Update the means based on the assignment.
- Repeat until convergence.

K-means Issue¶

Guaranteed to converge when using Euclidean distance.
Given a new test example:
- Assign it to the nearest mean to cluster it.
Assumes you know number of clusters 'k':
- Lots of heuristics to pick 'k', none satisfying:
  - https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
Each example is assigned to one (and only one) cluster:
- No possibility for overlapping clusters or leaving examples unassigned.
It may converge to sub-optimal solution...
Don’t confuse KNN classification and k-means clustering:

Property	KNN Classification	K-Means Clustering
Task	Supervised learning (given \(y_i\))	Unsupervised learning (no given \(y_i\))
Meaning of ‘k’	Number of neighbors to consider, (not number of classes).	Number of clusters, (always consider single nearest mean)
Initialization	No training phase.	Training that is sensitive to initialization.
Model complexity	Model is complicated for small ‘k’, simple for large ‘k’.	Model is simple for small ‘k’, complicated for large ‘k’.
Parametric?	Non-parametric: Stores data ‘X’	Parametric (for ‘k’ not depending on ‘n’): Stores means ‘W’
- We can interpret K-means steps as minimizing an objective:
- Total sum of squared distances from each example \(x_i\) to its center \(w_{\hat{y}_i}\):

    $$
    f(w_1, w_2, \dots, w_k, \hat{y}_1, \hat{y}_2, \dots, \hat{y}_n) = \sum_{i=1}^{n} \| w_{\hat{y}_i} - x_i \|^2
    $$

    where $w_{\hat{y}_i}$ is the cluster of example $i$, and $\hat{y}_i \in \{1, 2, \dots, K\}$.

The K-means steps:
- Minimize \(f\) in terms of the \(\hat{y}_i\) (update cluster assignments).
- Minimize \(*f*\) in terms of the \(w_c\) (update means).
Termination of the algorithm follows because:
- Each step does not increase the objective.
- There are a finite number of assignments to \(k\) clusters.

\[ W = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_k \end{bmatrix} \quad \text{where each } w \text{ has dimension } d. \]

Use \(*f*\) to choose between initializations (fixed 'k').
Need to change \(w_c\) update under other distances:
- L1-norm: set \(w_c\) to median ("k-medians").

Hierarchical Clustering¶

Both K-means and ISODATA can be sensitive to initial conditions, such as the number of clusters, prototype selection, and sample ordering.

Alternative Approach: Hierarchical clustering aims to build a sequence of partitions where clusters are combined in each step based on some similarity criterion.

This approach is often used to determine initial cluster starting points or for clustering itself.

Hierarchical Clustering: Simple Example¶

Suppose we have \(N\) samples, each treated as a one-sample cluster initially.

Find the two most similar clusters and merge them, reducing the number of clusters to \(N-1\).
Continue merging the most similar clusters until a stopping criterion is met.

This process can be conveniently represented using a dendrogram.

A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering.

It provides a visual representation of the clustering process and helps to determine the most appropriate number of clusters.

Measures of Similarity for Hierarchical Clustering¶

Hierarchical clustering relies on different measures of similarity, often based on Euclidean distance. Each measure can yield different clustering results.

Minimum Distance (Nearest Neighbor):

\[ d_{\text{min}}(c_i, c_j) = \min_{\underline{x} \in c_i, \, \underline{x}' \in c_j} |\underline{x} - \underline{x}'| \]

Nearest neighbor is well-suited for string-like clusters but is highly sensitive to noise and outliers.
Maximum Distance (Furthest Neighbor):

\[ d_{\text{max}}(c_i, c_j) = \max_{\underline{x} \in c_i, \, \underline{x}' \in c_j} |\underline{x} - \underline{x}'| \]

Furthest neighbor discourages elongated clusters, as two sub-clusters are merged only if the least similar pair is sufficiently close. This measure is less sensitive to noise and outliers, making it suitable for compact, spherical clusters.
Average Distance:

\[ d_{\text{avg}}(c_i, c_j) = \frac{1}{N_i N_j} \sum_{\underline{x} \in c_i} \sum_{\underline{x}' \in c_j} |\underline{x} - \underline{x}'| \]

Average distance is less sensitive to noise and outliers and balances the influence of all sample pairs.
Distance Between Means:

\[ d_{\text{mean}}(c_i, c_j) = |\underline{m}_i - \underline{m}_j| \]

This measure, based on the distance between the means of two clusters, is less sensitive to noise and outliers and is the most computationally efficient to implement.

Hierarchical Clustering: Termination Criterion¶

Hierarchical clustering continues merging clusters until a specified criterion is met. Possible termination criteria include:

Maximum Number of Levels: Limit the hierarchy depth.
Desired Number of Clusters: Stop once the target number of clusters is reached.
Error Threshold: Halt merging when clusters become too dissimilar.

Density-based clustering¶

Density-based clustering algorithm (DBSCAN) has two hyperparameters:
- Epsilon (\(\epsilon\)): distance we use to decide if another point is a "neighbour".
- MinNeighbours: number of neighbours needed to say a region is "dense".
  - If you have at least MinNeighbours "neighbours", you are called a core point.
Main idea: merge all neighbouring core points to form clusters.
Intuitively, density-based clustering algorithm implements a "chain reaction" throughout the dense areas.
For each example \(x_i\):
- If \(x_i\) is already assigned to a cluster, do nothing.
- Else: Test whether \(x_i\) is a "core" point (\(\geq\) MinNeighbours examples within "\(\epsilon\)")
  - If \(x_i\) is not a core point, do nothing (this could be an outlier).
  - If \(x_i\) is a core point, make a new cluster and call the "expand cluster" function.
    - Which spreads the "reaction" to nearby points.
"Expand cluster" function:
- Assign to this cluster all \(x_j\) within distance "\(\epsilon\)" of core point \(x_i\) to this cluster.
- For each new "core" point found, call "expand cluster" (recursively).
Some points are not assigned to a cluster.
- Good or bad, depending on the application.
Ambiguity of "non-core" (boundary) points:
- Otherwise, not sensitive to initialization (except for boundary points).
Sensitive to the choice of ε and MinNeighbours.
- Original paper proposed an "elbow" method.
If you get a new example, finding cluster is expensive.
- Need to compute distances to core points (or maybe all training points).
In high dimensions, need a lot of points to "fill" the space.