Topic 4: Estimation and Learning¶

Estimation is an inverse problem with the goal of recovering the underlying parameter \(\theta\) of a distribution \(f_X(x;\theta)\) based on the observed samples \(X_1, \ldots, X_N\).

From the previous chapter, we know that the Bayesian classifier achieves minimum probability of error. Therefore, it performs better than MICD classifier for situations where the class conditional PDFs \((p(\underline{x}|c_i))\) are known.

However, In general, the PDFs are not known a priori, so how do we perform Bayesian classification?

What if we have samples with known class labels?
With these samples, we can try to learn the PDFs of the individual classes.
These empirical PDFs allow us to apply Bayesian classification!

Bayesian classification is optimal in terms of probability of error ONLY if the true class conditional PDFs (\((p(\underline{x}|c_i))\)) are known. The use of empirical PDFs result in sub-optimal classifiers.

There are two main categories of statistical learning approaches:

Parametric Estimation: functional form of PDF is assumed to be known and the necessary parameters of the PDF are estimated.
Non-parametric Estimation: functional form of PDF is not assumed to be known and the PDF is estimated directly.

Parametric Learning¶

Here, we assume that we know the class conditional probability function, but we don’t know the parameters that define this function. In this scenario, what we want to do is estimate what these parameters are based on a set of labeled samples for the class!

There are two main categories of parametric estimation approaches:

Maximum Likelihood Estimation: Treat parameters as being fixed but unknown quantities, with the goal of finding estimate values which maximize the probability that the given samples came from the resulting PDF \(p(\underline{x}|\theta)\).
Bayesian (Maximum a Posteriori) Estimation: Treat parameters as random variables with an assumed a priori distribution, with the goal of obtaining an a posteriori distribution \(p(\underline{x}|\theta)\) which indicates the estimate value based on the given samples.

Maximum Likelihood Estimation¶

Unbiased and consistent estimator¶

\[ \hat{\mu}=\frac1N \sum_{n=1}^N x_n, \quad \text{and } \hat{\mathbb{\Sigma}}=\frac1N \sum^N_{n=1}(x_n-\hat{\mu})(x_n-\hat{\mu})^T \]

The goal is to obtain estimates for the parameters \(\theta\) that defines this PDF.

Writing the PDF as \(p(\underline{x}|\theta)\) to emphasize the dependence on parameters, the Maximum Likelihood estimate of the parameters θ is the set of parameters that maximizes the probability that the given samples \(\{x_1,x_2,...,x_N\}\) are obtained given \(\theta\):

\[ \hat{\theta}_{ML} = \argmax_{\theta}[p(\{\underline{x}_i\}|\theta)] \]

Assuming that the sample are independent of each other, \(p(\{\underline{x}_i\} | \theta)\) becomes:

\[ p(\{\underline{x}_i\}|\theta) = p(\underline{x}_1, \underline{x}_2, \ldots, \underline{x}_N | \theta) = \prod_{i=1}^{N} p(\underline{x}_i | \theta) \]

Therefore, the sample set probability is just the product of the individual sample probabilities.

To maximize \(p(\{\underline{x}_i\} | \theta)\), we take the derivative and set it to zero:

\[ \frac{\partial}{\partial \theta} p(\{\underline{x}_i\} | \theta)\bigg|_{\theta=\hat{\theta}_{ML}} = 0 \]

It is often more convenient to deal with \(p(\{\underline{x}_i\} | \theta)\) in log form:

\[ l(\theta) = \log [p(\{\underline{x}_i\} | \theta)] = \sum_{i=1}^{N} \log p(\underline{x}_i | \theta) \]

This gives us the final maximum likelihood condition:

\[ \frac{\partial}{\partial \theta} l(\theta)\bigg|_{\theta=\hat{\theta}_{ML}} = 0 \]

Example:

we know that the PDF is a multivariate Normal distribution: \(p(\underline{x}) = \mathcal{N}(\underline{\mu}, \Sigma)\). We do not know what the mean vector \(\underline{\mu}\) is. We do not know what the covariance matrix \(\Sigma\) is. Then the MLE for \(\underline{\mu}\) is \(\frac{1}{N} \sum_{i=1}^N (\underline{x}_i) = \hat{\underline{\mu}}_{ML}\) and for \(\Sigma\) is \(\hat{\Sigma}_{ML} = \frac{1}{N} \sum_{i=1}^N (\underline{x}_i - \hat{\underline{\mu}}_{ML})(\underline{x}_i - \hat{\underline{\mu}}_{ML})^T\).

Advantages:

MLE gives the explanation of the data you observed.
If \(n\) is large and your model/distribution is correct, then MLE finds the true parameters.

Disadvantages:

But the MLE can overfit the data if \(n\) is small.
If you do not have the correct model (and \(n\) is small) then MLE can be terribly wrong.

Estimation Bias¶

Formal definition

an estimate \(\hat{\underline{\theta}}\) is unbiased if its expected value is equal to the true value:

\[ E[\hat{\underline{\theta}}] = {\underline{\theta}} \]

The ML estimate of the mean is unbiased since

\[ E[\hat{\underline{\mu}}_{ML}] = \frac{1}{N} \sum_{i=1}^N \underline{\mu} = \underline{\mu} \]

However, the ML estimate for the covariance matrix is biased!

\[ E[\hat{\Sigma}_{ML}] = \frac{N-1}{N} \Sigma \]

As \(N \leftarrow \infin\), the bias becomes negligible. To get an unbiased estimate: just multiply your ML estimate by \(\frac{N}{N-1}\)

\[ E\left[\frac{N}{N-1} \hat{\Sigma}_{ML}\right] = \frac{N}{N-1} \frac{N-1}{N} \Sigma = \Sigma \]

Bias stems from the use of ML estimate for mean, \(\hat{\underline{\mu}}_{ML}\), rather than the true mean in the expression for **\(\hat{\Sigma}_{ML}\).

Bayesian Estimation¶

Instead of treating the parameters as fixed and finding the parameters that maximize the probability that the given samples come from the resulting PDF, we do the following:

Treat the parameters as random variables with an assumed a priori distribution
Use the observed samples to obtain an a posterior distribution which indicates the parameters!

Let \(p(\theta)\) be the a priori distribution and \(\{\underline{x}_i\}\) be the set of samples.

The a posteriori distribution can be written as:

\[ p(\underline{\theta}|\{\underline{x}_i\}) = \frac{p(\{\underline{x}_i\}|\underline{\theta})p(\underline{\theta})}{p(\{\underline{x}_i\})} \]

The term \(p(\{\underline{x}_i\})\) is treated as a scale factor which may be obtained from the requirement for PDFs:

\[ \int p(\underline{\theta}|\{\underline{x}_i\})d\underline{\theta} = 1 \]

Note

❗

ChatGPT said:

Philosophical Foundations

Frequentist Statistics:
- Probability Interpretation: Probability is interpreted as the long-run frequency of events occurring in repeated experiments.
- Parameters: Parameters are considered fixed but unknown quantities. The goal is to estimate these parameters based on sample data.
Bayesian Statistics:
- Probability Interpretation: Probability represents a degree of belief or certainty about an event. It is subjective and can be updated as new information becomes available.
- Parameters: Parameters are treated as random variables with their own probability distributions (the prior distribution), which can be updated with data (the likelihood) to form a posterior distribution.

Maximum A Posteriori (MAP)

Maximum A Posteriori (MAP) estimation is a statistical technique used in Bayesian inference. It seeks to find the mode of the posterior distribution of a parameter given observed data. In simpler terms, MAP provides a way to estimate the most likely value of a parameter after taking into account prior beliefs (the prior distribution) and the likelihood of the observed data.

The MAP estimate is mathematically expressed as:

\[ \theta_{MAP} = \arg\max_{\theta} P(\theta | D) = \arg\max_{\theta} \frac{P(D | \theta) P(\theta)}{P(D)} \]

Where:

\(\theta\) represents the parameter of interest.
\(D\) is the observed data.
\(P(\theta | D)\) is the posterior distribution.
\(P(D | \theta)\) is the likelihood of the data given the parameter.
\(P(\theta)\) is the prior distribution of the parameter.

Since \(P(D)\) is constant with respect to \(\theta\), we can simplify this to:

\[ \theta_{MAP} = \arg\max_{\theta} P(D | \theta) P(\theta) \]

In practical terms, MAP estimation combines both prior information and the data, making it particularly useful when the sample size is small or when prior knowledge is strong.

MLE and MAP¶

There are two typical ways of estimating parameters.

Maximum-likelihood estimation (MLE): \(\theta\) is deterministic
Maximum-a-posteriori estimation (MAP): \(\theta\) is random and has a prior distribution

From MLE to MAP¶

In MLE, the parameter \(\theta\) is deterministic.

Likelihood:

\[ p(x_n|\theta_1) = \mathcal{N}(x_n|\theta_1, \sigma_1^2), \text{ and } p(x_n|\theta_2) = \mathcal{N}(x_n|\theta_2, \sigma_2^2) \]

Maximum-likelihood: You know nothing about \(\theta_1\) and \(\theta_2\). So you need to take measurements to estimate \(\theta_1\) and \(\theta_2\).
Maximum-a-Posteriori: You know something about \(\theta_1\) and \(\theta_2\).

Prior:

\[ p(\theta_1) = \mathcal{N}(\mu_1|\gamma_1^2), \text{ and } p(\theta_2) = \mathcal{N}(\mu_2|\gamma_2^2) \]

Maximum-a-Posteriori estimation¶

The Bayesian way is to model \(\theta\) as a random variable drawn from a distribution \(p(\theta)\). Note that \(\theta\) is not a random variable associated with an event in a sample space. In frequentist statistics, this is forbidden. In Bayesian statistics, this is allowed.

Now, we can look at \(p(\theta \mid D)\): \(p(\theta \mid D) = \frac{p(D\mid\theta)p(\theta)}{p(D)}\) (Bayes rule), where

\(p(D | \theta)\) is the likelihood of the data given the parameter(s) \(\theta\)
\(p(\theta)\) is the prior distribution over the parameter(s) \(\theta\)
\(p(\theta | D)\) is the posterior distribution over the parameter(s) \(\theta\)

Formally (MAP principle): Find \(\hat{\theta}\) that maximizes the posterior distribution over parameters \(p(\theta \mid D)\):

\[ \begin{align*} \hat{\theta}_\text{MAP} & = \argmax_{\theta}p(\theta \mid D) \\ & = \argmax_{\theta} \frac{p(D\mid \theta)p(\theta)}{p(D)} \\ & = \argmax_{\theta} \log p(D \mid \theta)+\log p(\theta) \end{align*} \]

Advantages:

As \(n \rightarrow \infin, \hat{\theta}_{MAP} \rightarrow \hat{\theta}_{MLE}\)
MAP is a great estimator if prior belief exists and is accurate

Disadvantages:

If \(n\) is small, it can be very wrong if prior belief is wrong
Also we have to choose a reasonable prior (\(p(\theta) > 0 \quad \forall \theta)\)

By Bayes Theorem again:

\[ p_{\Theta|X}(\theta | x_n) = \frac{p_{X \mid \Theta}(x_n | \theta) p_{\Theta}(\theta)}{p_X(x_n)}. \]

To maximize the posterior distribution

\[ \begin{align*} \hat{\theta} & = \arg \max_{\theta} p_{\Theta|X}(\theta | D) \\ &= \arg \max_{\theta} \prod_{n=1}^N p_{\Theta|X}(\theta | x_n) \\ &= \arg \max_{\theta} \prod_{n=1}^N \frac{p_{X|\Theta}(x_n | \theta) p_{\Theta}(\theta)}{\textcolor{red}{\cancel{p_X(x_n)}}} \text{ since it does not depend on } \theta \\ &= \arg \min_{\theta} -\sum_{n=1}^N \left[ \log p_{X|\Theta}(x_n | \theta) + \log p_{\Theta}(\theta) \right]. \end{align*} \]

Let's look at the MAP:

\[ \hat{\theta} = \arg \min_{\theta} -\sum_{n=1}^N \{ \log p_{X|\Theta}(x_n|\theta) + \log p_{\Theta}(\theta) \} \]

Special case: When

\[ p_{\Theta}(\theta) = \delta(\theta - \theta_0) \]

Then the delta function gives:

\[ \log p_{\Theta}(\theta) = \begin{cases} -\infty, & \text{if } \theta \neq \theta_0,\\ 0, & \text{if } \theta = \theta_0. \end{cases} \]

This will give:

\[ \hat{\theta} = \arg \min_{\theta} -\sum_{n=1}^N \begin{cases} -\infty, & \text{if } \theta \neq \theta_0,\\ \log p_{X|\Theta}(x_n|\theta_0), & \text{if } \theta = \theta_0. \end{cases}= \theta_0 \]

No uncertainty. Absolutely sure \(\theta = \theta_0\).

Illustration: 1D Example¶

Suppose that:

\[ p_{X|\Theta}(x|\theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \theta)^2}{2\sigma^2}\right) \]

\[ p_{\Theta}(\theta) = \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(\theta - \theta_0)^2}{2\sigma_0^2}\right) \]

When \(N = 1\). The MAP problem is simply

\[ \begin{align*} \hat{\theta} &= \arg\max_\theta p_{X|\Theta}(x|\theta) p_{\Theta}(\theta) \\ &= \arg\max_\theta \left[ \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \theta)^2}{2\sigma^2}\right) \cdot \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(\theta - \theta_0)^2}{2\sigma_0^2}\right) \right] \\ &= \arg\max_\theta \left[ -\frac{(x - \theta)^2}{2\sigma^2} - \frac{(\theta - \theta_0)^2}{2\sigma_0^2} \right] \end{align*} \]

Taking derivatives:

\[ \begin{align*} &\frac{d}{d\theta} \left[ -\frac{(x-\theta)^2}{2\sigma^2} - \frac{(\theta-\theta_0)^2}{2\sigma_0^2} \right] &= & 0 \\ &\frac{(x-\theta)}{\sigma^2} - \frac{(\theta-\theta_0)}{\sigma_0^2} &= &0 \\ & \sigma_0^2(x - \theta) &= &\sigma^2(\theta - \theta_0) \\ &\sigma_0^2 x + \sigma^2 \theta_0 &= &(\sigma_0^2 + \sigma^2) \theta \end{align*} \]

Therefore, the solution is

\[ \theta = \frac{\sigma_0^2 x + \sigma^2 \theta_0}{\sigma_0^2 + \sigma^2}. \]

Let us interpret the result:

\[ \theta = \frac{\sigma_0^2x + \sigma^2\theta_0}{\sigma_0^2 + \sigma^2} \]

Does it make sense?

If \(\sigma_0 = 0\), then \(\theta = \frac{0 \cdot x + \sigma^2 \theta_0}{0 + \sigma^2} = \theta_0\).
- This means: No uncertainty. Absolutely sure that \(\theta = \theta_0\).
- \(p_{\Theta}(\theta) = \delta(\theta - \theta_0)\)

The other extreme:

If \(\sigma_0 = \infty\), then \(\theta = \frac{\sigma_0^2 \cdot x + \cancel{\sigma^2 \theta_0}}{\sigma_0^2 + \cancel{\sigma^2}} = x\).
- This means: I don’t trust my prior at all. Use data.
- \(p_{\Theta}(\theta) = \frac{1}{|\Omega|}\), for all \(\theta\) in \(\Omega\).

Therefore, the MAP solution gives you a trade-off between data and prior.

N is arbitrary Case¶

General N.

\[ \begin{align*} \hat{\theta} &= \arg \max_{\theta} \prod_{n=1}^N p_{X|\Theta}(x_n|\theta) p_{\Theta}(\theta) \\ &= \arg \max_{\theta} \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^N \exp\left(-\frac{\sum_{n=1}^N (x_n - \theta)^2}{2\sigma^2}\right) \cdot \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(\theta - \theta_0)^2}{2\sigma_0^2}\right) \\ &= \arg \min_{\theta} \left\{ \frac{\sum_{n=1}^N (x_n - \theta)^2}{2\sigma^2} + \frac{(\theta - \theta_0)^2}{2\sigma_0^2} \right\} \\ &= \frac{\sigma_0^2 \sum_{n=1}^N x_n + \sigma^2 \theta_0}{N\sigma_0^2 + \sigma^2}. \end{align*} \]

What does it mean?

As N goes to infinity¶

Let's do some algebra:

\[ \begin{align*} \hat{\theta} &= \frac{\left(\sum_{n=1}^N x_n\right)\sigma_0^2 + \theta_0\sigma^2}{N\sigma_0^2 + \sigma^2} \\ &= \frac{\left(\sum_{n=1}^N x_n\right)\sigma_0^2 + \sigma^2\theta_0}{N(\sigma_0^2 + \frac{\sigma^2}{N})} \\ &= \frac{\left(\frac{1}{N}\sum_{n=1}^N x_n\right)\sigma_0^2 + \frac{\sigma^2}{N}\theta_0}{\sigma_0^2 + \frac{\sigma^2}{N}}. \end{align*} \]

Fix \(\theta_0\) and \(\sigma_0\)
As \(N \to \infty\),

\[ \hat{\theta} = \frac{\left(\frac{1}{N} \sum_{n=1}^N x_n\right)\sigma_0^2 + \cancel{\frac{\sigma^2}{N}\theta_0}}{\sigma_0^2 + \cancel{\frac{\sigma^2}{N}}} = \frac{1}{N} \sum_{n=1}^N x_n \]

This is the maximum-likelihood estimate.

When I have a lot of samples, the prior does not really matter.

As N goes to 0¶

Fix \(\sigma_0\) and \(\sigma\)
As \(N \rightarrow 0\),

\[ \hat{\theta} = \frac{\cancel{\left(\frac{1}{N}\sum_{n=1}^N x_n\right)\sigma_0^2} + \frac{\sigma^2}{N}\theta_0}{\cancel{\sigma_0^2} + \frac{\sigma^2}{N}} = \theta_0 \]

This is just the prior.

When I have very few samples, I should rely on the prior.
If the prior is good, then I can do well even if I have very few samples.
Maximum-likelihood does not have the same luxury!