Topic 3: Probability Measures for Classification¶

Probability and Classification¶

Consider a general probability distribution of two variables, \(p(x_1, x_2)\). Using Bayes' rule, without loss of generality, we can write

\[ p(x_1, x_2) = p(x_1 \mid x_2) p(x_2) \]

Similarly, if we had another "class" variable, \(c\), we can write, using Bayes' rule:

\[ p(x_1, x_2 \mid c) = p(x_1 \mid x_2, c) p(x_2 \mid c) \]

In linear discriminant analysis (LDA), there are generally two types of approaches

Generative approachL Estimate model, then define the classifier

Goal: Construct a discriminant function \(g(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + w_0\) from the data.
- Suppose there are two classes \(C_1\) and \(C_2\).
- Each class is modelled as a Gaussian.
- We are going to utilize two concepts:
  - Likelihood function
  \[ p_{\mathbf{X}|Y}(\mathbf{x}|i) = \mathcal{N}(\mathbf{x} | \mu_i, \Sigma_i) \]
  - Prior distribution
  \[ p_Y(i) = \pi_i \]
Discriminative approach: Directly define the classifier

High Dimension Gaussian

An \(d\)-dimensional Gaussian has a PDF

\[ p_x(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\right), \]

where \(d\) denotes the dimensionality of the vector \(\mathbf{x}\).

The mean vector \(\boldsymbol{\mu}\) is

\[ \boldsymbol{\mu} = \mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \vdots \\ \mathbb{E}[X_d] \end{bmatrix} \]

The covariance matrix \(\Sigma\) is

\[ \Sigma = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T] = \begin{bmatrix} \operatorname{Var}[X_1] & \operatorname{Cov}(X_1, X_2) & \cdots & \operatorname{Cov}(X_1, X_d) \\ \operatorname{Cov}(X_2, X_1) & \operatorname{Var}[X_2] & \cdots & \operatorname{Cov}(X_2, X_d) \\ \vdots & \vdots & \ddots & \vdots \\ \operatorname{Cov}(X_d, X_1) & \operatorname{Cov}(X_d, X_2) & \cdots & \operatorname{Var}[X_d] \end{bmatrix} \]

\(\Sigma\) is always positive semi-definite. (If \(A\) is a real symmetric matrix, and \(x^TAx \geq 0, \forall x \in R^n\), then \(A\) is a positive semidefinite matrix)

Conditional Gaussian

Data: \(\{x_1, \ldots, x_N\}\).
Class \(Y \in \{1, 2, \ldots, K\}\).
Likelihood:

\[ p_{X|Y}(x|k) = \text{Probability of getting } X \text{ given } Y \]

Prior:

\[ p_Y(k) = \text{Probability of getting } Y \]

Posterior:

\[ p_{Y|X}(k|x) = \text{Probability of getting } Y \text{ given } X \]

Related by:

\[ p_{Y|X}(k|x) = \frac{p_{X|Y}(x|k) p_Y(k)}{p_X(x)} = \frac{p_{X|Y}(x|k) p_Y(k)}{\sum_k p_{X|Y}(x|k) p_Y(k)} \]

Negative Log-Likelihood for Gaussian:

\[ - \log p_{\mathbf{X}|Y}(\mathbf{x}|k) = - \log \left(\frac{1}{\sqrt{(2\pi)^d |\Sigma_k|}} \exp \left(-\frac{1}{2} (\mathbf{x} - \mu_k)^T \Sigma_k^{-1} (\mathbf{x} - \mu_k)\right)\right)= \frac{1}{2} (\mathbf{x} - \mu_k)^T \Sigma_k^{-1} (\mathbf{x} - \mu_k) - \frac{n}{2} \log 2\pi - \frac{1}{2} \log |\Sigma_k| \]

\((\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu) \geq 0\), always.
\(\sqrt{(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)}\) is called Mahalanobis distance.

Bayesian classifier¶

\[ P(c_j \mid d) = \text{probability of class } c_j, \text{ given that we have observed }d \]

Bayesian classifiers use Bayes theorem, which says

\[ P(c_j \mid d) = \frac{P(d \mid c_j)P(c_j)}{P(d)} \]

To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate

\[ P(\mathbf{d} \mid c_j) = P(d_1 \mid c_j)*\ldots P*(d_n \mid c_j) \]

Naïve Bayes is NOT sensitive to irrelevant features.

A general rule of statistical decision theory is to minimize the “cost” associated with making a wrong decision.

Let \(L_{ij}\) be the cost of deciding on class \(c_j\) when the true class is \(c_i\).

The total risk associated with deciding \(\underline{x}\) belongs to \(c_j\) can be defined by the expected cost:

\[ r_j(\underline{x}) = \sum_{i=1}^K L_{ij} P(c_i|\underline{x}) \]

where \(K\) is the number of classes and \(P(c_i|\underline{x})\) is the posterior distribution of class \(c_i\) given the pattern \(\underline{x}\).

For the two class case:

\[ r_1(\underline{x}) = L_{11}P(c_1|\underline{x})+L_{21}P(c_2|\underline{x}) \]

\[ r_2(\underline{x}) = L_{12}P(c_1|\underline{x})+L_{22}P(c_2|\underline{x}) \]

Applying Bayes' rule gives us:

\[ r_1(\underline{x}) = \frac{L_{11} P(\underline{x} \mid c_1) P(c_1) + L_{21} P(\underline{x} \mid c_2) P(c_2)}{p(\underline{x})} \]

\[ r_2(\underline{x}) = \frac{L_{12} P(\underline{x} \mid c_1) P(c_1) + L_{22} P(\underline{x} \mid c_2) P(c_2)}{p(\underline{x})} \]

The general \(K\)-class Bayesian classifier is defined as follows, and minimizes total risk:

\[ \underline{x} \in c_i \quad \text{iff} \quad r_i(\underline{x}) < r_j(\underline{x}) \quad \forall j \neq i \]

For the two class case:

\[ \left( L_{11} - L_{12} \right) P(\underline{x} \mid c_1) P(c_1) \underset{\text{2}}{\overset{\text{1}}{\substack{> \\ <}}} \left( L_{21} - L_{22} \right) P(\underline{x} \mid c_2) P(c_2) \]

The most common cost used in the situation where no other cost criterion is known is the “zero-one” loss function:

\[ L_{ij} = \begin{cases} 0 & \text{if } i = j \\ 1 & \text{if } i \neq j \end{cases} \]

Meaning: all errors have equal costs.

Given the “zero-one” loss function, the total risk function becomes:

\[ r_j(\underline{x}) = \sum_{\substack{i=1 \\ i \neq j}}^K P(c_i \mid \underline{x}) = P(\varepsilon \mid \underline{x}) \]

So minimizing total risk in this case is the same as minimizing probability of error!

Maximum a Posteriori (MAP) Probability Classifier¶

Given two classes \(A\) and \(B\), the MAP classifier can be defined as follows:

\[ P(A \mid \underline{x}) \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} P(B \mid \underline{x}) \]

where \(P(A \mid \underline{x})\) and \(P(B \mid \underline{x})\) are the posterior class probabilities of A and B, respectively, given observation \(\underline{x}\).

Meaning: All patterns with a higher posterior probability for \(A\) than for \(B\) will be classified as \(A\), and all patterns with a higher posterior probability for \(B\) than for \(A\) will be classified as \(B\).

Class probability models usually given in terms of class conditional probabilities \(P(\underline{x}|A)\) and \(P(\underline{x}|B)\). More convenient to express MAP in the form:

\[ \frac{P(\underline{x}|A)}{P(\underline{x}|B)} \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} \frac{P(B)}{P(A)} \]

\[ I(\underline{x}) \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} \theta \]

where \(I(\underline{x})\) is the likelihood ratio and \(\theta\) is the threshold.

When dealing with probability density functions with exponential dependence (e.g., Gamma, Gaussian, etc.), it is more convenient to deal with MAP in the log-likelihood form:

\[ \log I(\underline{x}) \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} \log \theta \]

Ideally, we would like to use the MAP classifier, which chooses the most probable class. However, in many cases the priors \(P(A)\) and \(P(B)\) are unknown, making it impossible to use the posteriors \(P(A\mid \underline{x})\) and \(P(B\mid \underline{x})\). Common alternative is, instead of choosing the most probable class, we choose the class that makes the observed pattern \(\underline{x}\) **most probable.

Instead of maximizing the posterior, we instead maximize the likelihood:

\[ P(\underline{x} | A)\underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}}{P(\underline{x} | B)} \]

In likelihood form:

\[ \frac{P(\underline{x} | A)}{P(\underline{x} | B)} \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} 1 \]

Can be viewed as special case of MAP where \(P(A) = P(B)\).

MAP Classifier for Normal Distributions¶

By far the most popular conditional class distribution model is the Gaussian distribution:

\[ p(x | A) = \mathcal{N}(\mu_A, \sigma_A^2) = \frac{1}{\sqrt{2\pi }\sigma_A} \exp \left(-\frac{1}{2} \left(\frac{x - \mu_A}{\sigma_A}\right)^2\right) \]

and

\[ p(x | B) = \mathcal{N}(\mu_B, \sigma_B^2) \]

For the two-class case where both distributions are Gaussian, the following MAP classifier can be defined as:

\[ \frac{\mathcal{N}(\mu_A, \sigma_A^2)}{\mathcal{N}(\mu_B, \sigma_B^2)} \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} \frac{P(B)}{P(A)} \]

\[ \frac{\exp \left(-\frac{1}{2} \left(\frac{x - \mu_A}{\sigma_A}\right)^2\right)}{\exp \left(-\frac{1}{2} \left(\frac{x - \mu_B}{\sigma_B}\right)^2\right)} \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} \frac{\sigma_AP(B)}{\sigma_BP(A)} \]

In log-likelihood form:

\[ \left[\left (\frac{x-\mu_B}{\sigma_B} \right)^2 - \left(\frac{x-\mu_A)}{\sigma_A}\right)^2\right] \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} 2 \left[\ln (\sigma_A P(B)) - \ln (\sigma_B P(A))\right] \]

The decision boundary (threshold) for the MAP classifier where \(P(x|A)\) and \(P(x|B)\) are Gaussian distributions can be found by solving the following expression for \(x\):

\[ \begin{equation*} \left(\frac{x - \mu_B}{\sigma_B}\right)^2 - \left(\frac{x - \mu_A}{\sigma_A}\right)^2 = 2 \left[\ln (\sigma_A P(B)) - \ln (\sigma_B P(A))\right] \end{equation*} \]

which can be simplified to:

\[ \begin{equation*} x^2 \left(\frac{1}{\sigma_B^2} - \frac{1}{\sigma_A^2}\right) - 2x \left(\frac{\mu_B}{\sigma_B^2} - \frac{\mu_A}{\sigma_A^2}\right) + \left(\frac{\mu_B^2}{\sigma_B^2} - \frac{\mu_A^2}{\sigma_A^2}\right) = 2 \ln \left(\frac{\sigma_A P(B)}{\sigma_B P(A)}\right) \end{equation*} \]

For the \(n\)-d case, where \(p(\underline{x}|A) = \mathcal{N}(\underline{\mu}_A, \Sigma_A^2)\) and \(p(\underline{x}|B) = \mathcal{N}(\underline{\mu}_B, \Sigma_B^2)\)

\[ \frac{P(A) \exp \left[ -\frac{1}{2} (\underline{x} - \underline{\mu}_A)^T \Sigma_A^{-1} (\underline{x} - \underline{\mu}_A) \right]}{(2\pi)^{n/2} |\Sigma_A|^{1/2}} \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} \frac{P(B) \exp \left[ -\frac{1}{2} (\underline{x} - \underline{\mu}_B)^T \Sigma_B^{-1} (\underline{x} - \underline{\mu}_B) \right]}{(2\pi)^{n/2} |\Sigma_B|^{1/2}} \]

\[ \frac{\exp\left[-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_A)^T \boldsymbol{\Sigma}_A^{-1} (\mathbf{x} - \boldsymbol{\mu}_A)\right]}{\exp\left[-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_B)^T \boldsymbol{\Sigma}_B^{-1} (\mathbf{x} - \boldsymbol{\mu}_B)\right]} \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}}\frac{|\boldsymbol{\Sigma}_A|^{1/2} P(B)}{|\boldsymbol{\Sigma}_B|^{1/2} P(A)} \]

Take the log and simplifying:

\[ \begin{equation*}(\mathbf{x} - \boldsymbol{\mu}_B)^T \Sigma_B^{-1} (\mathbf{x} - \boldsymbol{\mu}_B) - (\mathbf{x} - \boldsymbol{\mu}_A)^T \Sigma_A^{-1} (\mathbf{x} - \boldsymbol{\mu}_A) \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} 2 \ln \left[ \frac{|\Sigma_A|^{1/2} P(B)}{|\Sigma_B|^{1/2} P(A)} \right]\end{equation*} \]

\[ \begin{equation*}(\mathbf{x} - \boldsymbol{\mu}_B)^T \Sigma_B^{-1} (\mathbf{x} - \boldsymbol{\mu}_B) - (\mathbf{x} - \boldsymbol{\mu}_A)^T \Sigma_A^{-1} (\mathbf{x} - \boldsymbol{\mu}_A) \underset{\text{B}}{\overset{\text{A}}{\substack{> \\ <}}} 2 \ln \left[ \frac{P(B)}{P(A)} \right] + \ln \left[ \frac{|\Sigma_A|}{|\Sigma_B|} \right]\end{equation*} \]

MAP Decision Boundaries for Normal Distribution¶

What is the MAP decision boundaries if our classes can be characterized by normal distributions?

\[ \begin{equation}\mathbf{x}^T Q_0 \mathbf{x} + Q_1 \mathbf{x} + Q_2 + 2 Q_3 + Q_4 = 0,\end{equation} \]

where

\[ \begin{align*}Q_0 &= S_A^{-1} - S_B^{-1}, \\Q_1 &= 2(\mathbf{m}_B^T S_B^{-1} - \mathbf{m}_A^T S_A^{-1}), \\Q_2 &= \mathbf{m}_A^T S_A^{-1} \mathbf{m}_A - \mathbf{m}_B^T S_B^{-1} \mathbf{m}_B, \\Q_3 &= \ln \left( \frac{P(B)}{P(A)} \right), \\Q_4 &= \ln \left( \frac{|S_A|}{|S_B|} \right).\end{align*} \]

Performance of the Bayes Classifier¶

Since the Bayes classifier minimizes the probability of error, one way to analyze how well it does is to compute the probability of error \(P(\epsilon)\) itself.

For any pattern \(\underline{x}\) such that \(P( A\mid \underline{x}) > P(B\mid\underline{x})\):

\(\underline{x}\) is classified as part of class \(A\)
The probability of error of classifying \(\underline{x}\) as \(A\) is \(P(B\mid\underline{x})\)

Therefore, naturally, for any given \(\underline{x}\) the probability of error \(P(\epsilon \mid \underline{x})\) is:

\[ P(\epsilon \mid \underline{x}) = \min(P(A\mid\underline{x}), P(B\mid\underline{x})) \]

Rationale: Since we always chose the maximum posterior probability as our class, the minimum posterior probability would be the probability of choosing incorrectly.

Now that we know the probability of error for a given \(\underline{x}\), denoted as \(P(\epsilon | \underline{x})\), the expected probability of error \(P(\epsilon)\) can be found as:

\[ P(\epsilon) = \int P(\epsilon | \underline{x})p(\underline{x})d\underline{x} \]

\[ P(\epsilon) = \int \min[P(A | \underline{x}), P(B | \underline{x})] p(\underline{x}) d\underline{x} \]

In terms of class PDFs:

\[ P(\epsilon) = \int \min[P(\underline{x} | A)P(A), P(\underline{x} | B)P(B)] d\underline{x} \]

Now if we were to define decision regions \(R_A\) and \(R_B\):

\(R_A = \underline{x} \text{ such that } P(A|\underline{x}) > P(B|\underline{x})\)
\(R_B = \underline{x} \text{ such that } P(B|\underline{x}) > P(A|\underline{x})\)

The expected probability of error can be defined as:

\[ P(\epsilon) = \int_{R_A} P(\underline{x}|B)P(B)d\underline{x} + \int_{R_B} P(\underline{x}|A)P(A)d\underline{x} \]

Rationale: For all patterns in \(R_A\), the probability of \(A\) will be the maximum between \(A\) and \(B\), so the probability of error of patterns in \(R_A\) is just the minimum probability (in this case, the probability of \(B\)), and vice versa.

Error Bounds¶

In practice, the exact \(P(\epsilon)\) is only easy to compute for simple cases as shown before. Instead of finding the exact \(P(\epsilon)\), we determine the bounds on \(P(\epsilon)\), which are:

Easier to compute
Leads to estimates of classifier performance

Bhattacharrya Bound¶

Using the following inequality:

\[ \min[a, b] \leq \sqrt{(a, b)} \]

The following holds true:

\[ P(\epsilon) = \int \min[P(\underline{x}|A)P(A), P(\underline{x}|B)P(B)]d\underline{x} \]

\[ P(\epsilon) \leq \sqrt{P(A)P(B)} \int \sqrt{P(\underline{x}|A)P(\underline{x}|B)}d\underline{x} \]

You don’t need the actual decision regions to compute this!

Since \(P(A) + P(B) = 1\) and the Bhattacharrya coefficient \(\rho\) can be defined as:

\[ \rho = \int \sqrt{P(\underline{x}|A)P(\underline{x}|B)}d\underline{x} \]

The upper bound (Bhattacharrya bound) of \(P(\epsilon)\) can be written as:

\[ P(\epsilon) \leq \frac{1}{2}\rho \]