Topic 2: Distance Measures for Pattern Classification¶

Sample Statistics

The sample mean \(m_x\) is given by: \(\underline{m}_x = \frac{1}{N} \sum_{i=1}^N \underline{x}_i\)

The covariance is defined as: \(\text{Covariance} = E[\underline{xx}^T] - E[\underline x]E[\underline x]^T\)

The sample covariance \(S_x\) can be computed as: \(S_x = \frac{1}{N} \sum_{i=1}^N (\underline x_i - \underline m)(\underline x_i - \underline m)^T = \frac{1}{N} \sum_{i=1}^N (\underline x_i \underline x_i^T) - \underline{mm}^T\)

Mulitvariate Gaussian distribution: \(p(\mathbf{x}) = \frac{1}{(2\pi)^{n/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left\{ -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right\}\)

Equiprobability Contour

Just an ellipse with the following properties:

Centered at \(\boldsymbol{\mu}\)
Axes are eigenvectors of \(\Sigma\)
Lengths of axes equal to the square roots of eigenvalues of \(\Sigma\)

Pattern Recognition¶

Framework

Measurement/Preprocessing: input pattern is measured and preprocessed for improved recognition
Feature Extraction: features are extracted to create concise representation of pattern
Classification: assign a class label to the pattern based on some learned model
Feature selection: select a set of meaningful features to represent samples used for training the classifier
Learning: learn a model based on features representing the samples

Patterns have properties or attributes which distinguish them from other patterns (e.g., apples vs. oranges).

Measurements taken of a pattern should reflect either directly or indirectly the attributes associated with the pattern (e.g., images).

Features provide a concise representation of measurements to facilitate classification (e.g., shape, color, size, etc.).

Measurements represented by a vector, \(\boldsymbol{x}\), consisting of \(m\) measurements obtained from sensor (e.g., color intensities at each pixel in image)

Features represented by a vector, \(\boldsymbol{y}\), consisting of \(n\) features obtained from measurements (e.g., overall color, shape, size, texture, etc.)

Classes: Goal of pattern recognition is to assign input to a class. A class is a group or set of patterns which are similar in some sense (i.e., they share some common properties or attributes).

Prototype: idealized form that boils down the “essence” of a class (e.g., mean apple). [AI said: The term "Prototype" in the context of pattern recognition, machine learning, and cognitive psychology refers to an idealized and often simplified representation of a class that captures the essential characteristics or features of that class.]

In vector space representation, sharing common properties implies closeness in feature space. Such closeness can be measured quantitatively using a distance metric \(d(\boldsymbol{y}_1,\boldsymbol{y}_2)\) (the lower the \(d\), the greater the similarity)

Minkowski Distance:

Scaled Euclidean Distance:

Minimum Euclidean Distance (MED) Classifier¶

Definition

\(\underline{x} \in c_k \text{ iff } d_E(\underline{x}, \underline{z}_k) < d_E(\underline{x}, \underline{z}_l)\) for all \(l \neq k\) where \(d_E(\underline{x}, \underline{z}_k) = [(\underline{x}-\underline{z}_k)^T(\underline{x}-\underline{z}_k)]^{1/2}\)

Meaning

\(\underline{x}\) belongs to the class \(k\) if and only if the Euclidean distance between \(\underline{x}\) and the prototype of \(c_k\) is less than the distance between \(\underline{x}\) and all other class prototypes.

Simplifying the decision criteria \(d_E(\underline{x}, \underline{z}_k) < d_E(\underline{x}, \underline{z}_l)\) gives us:

\[ -\underline{z}_1^T \underline{x} + \frac{1}{2} \underline{z}_1^T \underline{z}_1 < -\underline{z}_2^T \underline{x} + \frac{1}{2} \underline{z}_2^T \underline{z}_2 \]

This gives us the discrimination/decision function:

\[ g_k(\underline{x}) = -\underline{z}_k^T \underline{x} + \frac{1}{2} \underline{z}_k^T \underline{z}_k \]

Therefore, MED classification made based on minimum discriminant for given \(\underline{x}\):

\[ \underline{x} \in c_k \text{ iff } g_k(\underline{x}) < g_l(\underline{x}) \quad \forall l \neq k \]

(Decision Boundary) Formed by features equidistant to two classes (\(g_k(\underline{x}) = g_l(\underline{x})\)):

\[ g(\underline{x}) = g_k(\underline{x}) - g_l(\underline{x}) = 0 \]

For MED classifier, the decision boundary becomes:

\[ g(\underline{x}) = (\underline{z}_k - \underline{z}_l)^T \underline{x} + \frac{1}{2} (\underline{z}_l^T \underline{z}_l - \underline{z}_k^T \underline{z}_k) = 0 \]

\[ g(\underline{x}) = \underline{w}^T \underline{x} + w_o = 0 \]

The MED decision boundary is just a hyperplane with normal vector \(\underline{w}\), a distance \(\frac{|\underline{w}_0|}{|\underline{w}|}\) from the origin.

Common prototypes

Sample mean: less sensitive to noise and outliers, but poor at handling long, thin, tendril-like clusters

\[ z_k(x) = \frac{1}{N_K}\sum^{N_k}_{i=1} \underline{x}_i \]

where \(N_k\) is the number of samples in class \(c_k\) and \(\underline{x}_i\) is the \(i^{th}\) sample of \(c_k\).
Nearest Neighbor (NN): better at handling long, thin, tendril-like clusters, but more sensitive to noise and outliers, and computationally complex (more time to deal with).

\[ z_k(x) = \underline{x}_k \text{ such that } d_E(\underline{x}, \underline{x}_k) = \min_i d_E(\underline{x}, \underline{x}_i) \quad \forall \underline{x}_i \in c_k \]

Meaning: For a given \(\underline{x}\) you wish to classify, you compute the distance between \(\underline{x}\) and all labeled samples, and you assign \(\underline{x}\) the same class as its nearest neighbor.
Furthest Neighbor (FNN): More tight, compact clusters, but more sensitive to noise and outliers, and Computationally complex (need to re-compute all prototypes for each new point)

\[ z_k(x) = \underline{x}_k \text{ such that } d_E(\underline{x}, \underline{x}_k) = \max_i d_E(\underline{x}, \underline{x}_i) \quad \forall \underline{x}_i \in c_k \]

Meaning: For a given \(\underline{x}\) you wish to classify, you compute the distance between \(\underline{x}\) and all labeled samples, and you define the prototype in each cluster as that point furthest from \(\underline{x}\).
K-nearest Neighbor: Less sensitive to noise and outliers, and better at handling long, thin, tendril-like clusters but computationally complex (need to re-compute all prototypes for each new point)

For a given \(x\) you wish to classify, you compute the distance between \(\underline{x}\) and all labeled samples, and you define the prototype in each cluster as the sample mean of the \(K\) samples within that cluster that is nearest \(\underline{x}\).

Weighted Euclidean Distance Metric¶

Different measurements and features may:

be more or less dependent
have different units and scales
have different variances

The use of Euclidean distance can lead to poor classification performance in certain cases where the above situations hold true.

We can weight the features differently when measuring distances!

\[ d_{w_o}(\underline{x}, \underline{z}) = \left[ \sum_{i=1}^n (w_i(x_i-z_i))^{\frac12} \right]^{\frac12} \]

What we are doing is essentially scaling the feature axes with a linear transformation and then applying Euclidean distance metric.

\[ d_{w_o}(\underline{x}, \underline{z}) = d_E(\underline{x'}, \underline{z'}) \]

where \(\underline{x'} = W_o\underline{x}, \underline{z'} = W_o\underline{z}\) , and \(W_o\) is a diagonal matrix of weights.

The weighted Euclidean distance in the original feature space is just Euclidean distance in transformed space!

A more general form of the weighted Euclidean distance metric can be defined as:

\[ d_w(\underline{x}, \underline{z}) = \left[(\underline{x}- \underline{z})^TW^TW(\underline{x}- \underline{z})\right]^{\frac12} \]

where \(W\) is the general weight matrix of the form:

\[ \begin{bmatrix} W_{11} & W_{12} & \cdots & W_{1n} \\ W_{21} & W_{22} & \cdots & W_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ W_{n1} & W_{n2} & \cdots & W_{nn}\end{bmatrix} \]

which allows scaling AND rotation of axes! Use below matrix to rotate.

\[ R = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix} \]

Orthonormal Covariance Transforms¶

How do we determine the weights \(W\)?

Euclidean distance is only valid for cases where features are:

uncorrelated
unit variance

Visually, shape of distribution in feature space is a hypersphere.

We wish to find \(W\) **that transforms the shape of the distribution into a hypersphere!

Note: Non-zero non-diagonal elements in covariance matrix implies correlation.

To compute this transformation:

As a first step, we wish to transform the samples into a space in which features are uncorrelated, by finding the transform \(\Phi\) that diagonalizes the covariance matrix \(\Sigma\)

\[ \Phi^T \Sigma \Phi = \Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n\end{bmatrix} \]

where the columns of \(\Phi\) are the eigenvectors of \(\Sigma\), and the elements of \(\Lambda\) are the eigenvalues.

The transform \(\Phi\) that diagoanlizes \(\Sigma\) is

\[ \Phi = \begin{bmatrix} \underline{\phi}_1^T\\ \underline{\phi}_2^T\\ \vdots \\ \underline{\phi}_n^T \end{bmatrix} \]

\(\Lambda\) will be our new covariance matrix.

Generalized Euclidean Metric

Now we need to transform these uncorrelated features into ones with unit variances. To achieve equal variance features, we just need to scale the feature axes based on their eigenvalues!

The necessary scaling transformation (whitening transformation) is \(\Lambda^{-\frac12}\)

\[ \Lambda^{-\frac{1}{2}} = \begin{bmatrix} \frac{1}{\sqrt{\lambda_1}} & 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{\lambda_2}} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{\lambda_n}} \end{bmatrix} \]

Therefore, the weight matrix W that we want can be defined as:

\[ W = \Lambda^{-\frac12}\Phi^T \]

What this weight matrix does is:

Rotate coordinate axes to get diagonal covariance matrix
Scale the axes to obtain identity matrix

Given \(W\) **, the final generalized Euclidean metric is:

\[ d_G(\underline{x}, \underline{z}) = \left[ (\underline{x} - \underline{z})^T \left(\Lambda^{-\frac{1}{2}} \Phi^T\right)^T \left(\Lambda^{-\frac{1}{2}} \Phi^T\right) (\underline{x} - \underline{z}) \right]^\frac{1}{2} \]

Simplifying the formulation gives:

\[ \begin{align*} d_G(\underline{x}, \underline{z}) & = \left[ (\underline{x} - \underline{z})^T (\Phi \Lambda^{-\frac{1}{2}} \Lambda^{-\frac{1}{2}} \Phi^T) (\underline{x} - \underline{z}) \right]^{\frac{1}{2}} \\ d_G(\underline{x}, \underline{z}) & = \left[ (\underline{x} - \underline{z})^T (\Phi \Lambda^{-1} \Phi^T) (\underline{x} - \underline{z}) \right]^{\frac{1}{2}} \\ d_G(\underline{x}, \underline{z}) & = \left[ (\underline{x} - \underline{z})^T (S^{-1}) (\underline{x} - \underline{z}) \right]^{\frac{1}{2}} \end{align*} \]

Aside:

To compute eigenvalues \(\lambda_1\) and \(\lambda_2\) of a \(2 \times 2\) matrix:

\[ \det(S-\lambda I) = 0 \]

To computer eigenvectors \(\phi_1\) and \(\phi_2\) of this matrix:

\[ (S-\underline{\lambda}I)\underline{v}=0 \]

Minimum Intra-Class Distance (MICD) Classifier¶

For classification, we want to maximize within class similarity. In terms of distance metrics, we want to minimize intra-class distance (AI said: Intra-class distance refers to the distance between data points within the same class. We want to minimize intra-class distance, meaning the points within the same class should be close together in the feature space. This helps ensure that points from the same class are similar and clustered, making it easier to classify them correctly.)

Based on the criterion of minimum mean squared distance within classes, the generalized Euclidean metric IS the minimum intra-class distance (MICD) metric.

The MICD classifier is defined by the following decision rule:

\[ \underline{x} \in A \text{ iff } (\underline{x} -\underline{m}_A)^TS_A^{-1}(\underline{x}-\underline{m}_A)<(\underline{x} -\underline{m}_B)^TS_B^{-1}(\underline{x}-\underline{m}_B) \]

Important observations:

The distance to each class is measured with its own metric determined by its own covariance matrix (e.g., \(\underline{x}\) is transformed by \(S^{-1}_A\) to compute \(d_{MICD}(\underline{x}, \underline{m}_A)\), while \(\underline{x}\) is transformed by \(S^{-1}_B\) to compute \(d_{MICD}(\underline{x}, \underline{m}_B)\)
Means that while a linear transformation is associated with each metric, it is not possible in general to map both hyperellipsoidal distributions to hyperspheres with the same transformation.

What distance are we really measuring when we compare a pattern and class prototype in a transformed feature space?

To get into the transformed feature space for each class, we normalize the feature axes by its standard deviation for each feature.
What that means is that the new unit of measure in the transformed space is now in terms of the number of standard deviations.

Unlike the Euclidean distance classifier, the decision boundaries between class regions are in general NOT linear.

The decision boundaries lie on the intersections of the corresponding equidistance contours around the classes. The equidistance contour for a class \(A\) **can be defined as:

\[ (\underline{x}-\underline{m}_A)^TS^{-1}_A(\underline{x}-\underline{m}_A)=c \]

Therefore, the intersections of the corresponding equidistance contours around the classes is

\[ (\underline{x}-\underline{m}_A)^TS^{-1}_A(\underline{x}-\underline{m}_A) = (\underline{x}-\underline{m}_B)^TS^{-1}_B(\underline{x}-\underline{m}_B) \]

Expanding and simplifying leads to the general quadratic surface in hyperspace:

\[ \underline{x}^T Q_0 \underline{x} + Q_1 \underline{x} + Q_2 = 0, \]

where,

\[ \begin{align*} Q_0 &= S_A^{-1} - S_B^{-1},\\ Q_1 &= 2 [\underline{m}_B^T S_B^{-1} - \underline{m}_A^T S_A^{-1}], \\ Q_2 &= \underline{m}_A^T S_A^{-1} \underline{m}_A - \underline{m}_B^T S_B^{-1} \underline{m}_B. \end{align*} \]

Some linear algebra properties:

\((AB)^T = B^TA^T\)
\((A+B)^T = A^T+B^T\)
\((A+B)C = AC+BC\)

Difference cases:

Case 1: Different means, equal covariance (\(a = b\), \(S_A = S_B = S\))

The parameters become:

\[ \begin{align*}Q_0 &= S^{-1} - S^{-1} = 0 \\ Q_1 &= 2[\underline{m}_B^T S_B^{-1} - \underline{m}_A^T S_A^{-1}] = 2[\underline{m}_B - \underline{m}_A]^T S^{-1} \\ Q_2 &= \underline{m}_A^T S^{-1} \underline{m}_A - \underline{m}_B^T S^{-1} \underline{m}_B = (\underline{m}_A - \underline{m}_B)^T S^{-1} (\underline{m}_A - \underline{m}_B) \end{align*} \]

This gives us the final decision boundary:

\[ (\underline{m}_A - \underline{m}_B)^T S^{-1} \left[ \underline{x} - \frac{(\underline{m}_A + \underline{m}_B)}{2} \right] = 0 \]

This is just a straight line through \(\frac{(\underline{m}_A + \underline{m}_B)}{2}\) (i.e., midpoint between the means), with a slope that's influenced by \(S\).
Case 2: different means, different variances, uncorrelated features
Case 3: different means, different variances, correlated features
Case 4: Same means, different covariance (\(\underline{m}_a = \underline{m}_b = \underline{m}\), \(S_A \neq S_B\))

The parameters become:

\[ \begin{align*}Q_0 &= S_A^{-1} - S_B^{-1}\\Q_1 &= 2[\underline{m}_B^T S_B^{-1} - \underline{m}_A^T S_A^{-1}] = -2\underline{m}^T [S_A^{-1} - S_B^{-1}] \\ Q_2 &= \underline{m}_A^T S_A^{-1} \underline{m}_A - \underline{m}_B^T S_B^{-1} \underline{m}_B = \underline{m}^T (S_A^{-1} - S_B^{-1}) \underline{m} \end{align*} \]

This gives us the final decision boundary:

\[ (\underline{x} - \underline{m})^T (S_A^{-1} - S_B^{-1}) (\underline{x} - \underline{m}) = 0 \]

This surface is complicated in general, but can be visualized more easily on some special cases.
Note

❗

(same means, different covariance, only one feature (\(n = 1\)) and \(\underline{m} = 0\)) The MICD classification rule decides in favour of the class with the largest variance, regardless of \(x\)!
- When applying MICD, distance is really measured in units of standard deviation.
- Therefore, any unknown \(x\) is always closer to \(m = 0\) with the metric for the class with largest variance (the transform that scales that cluster’s class to unit standard deviation is larger, thus pulling any unknown \(x\) closer)

Advantages: lower sensitivity to noise and outliers, great for handling class distributions that are well modelled by Gaussian models (e.g., mean and variance)

Disadvantages: poor at handling more complex class distributions