Topic 6: Non Parametric Estimation & Learning¶

Non-parametric Learning¶

Here, we assume that we do not know anything about the class conditional probability function. All we are given are N samples $\underline{x}_i$, which are labeled so we know that they belong to a single class.

In this scenario, we aim to estimate the distribution directly based on a set of labeled samples for the class. This process is repeated for multiple classes, each with its own labeled samples. Non-parametric learning focuses on learning a single class distribution from labeled samples.

Types of Non-parametric Estimation¶

Three main categories:

Histogram Estimation: Group samples into discrete regions to approximate $p(\underline{x})$.
Parzen Window Estimation: Approximate $p(\underline{x})$ continuously based on each sample's local contribution.
k-Nearest Neighbor Estimation (k-NN): Approximate $p(\underline{x})$ continuously based on the nearest neighbor samples.

Histogram Estimation¶

The simplest approach to non-parametric estimation of $p(\underline{x})$ is constructing a normalized histogram.

For an interval $R = [a, b]$, if we $p(x)$ is constant over this region, then :

\[ Pr(x \in R) = p_R = \int_a^b p(x) dx = p(a) \cdot (b - a) \]

Given $N$ samples, the probability of $M$ samples within $R$ follows a binomial distribution:

\[ p(M) = \binom{N}{M} p_R^M (1 - p_R)^{N-M} \]

Taking the log of the binomial distribution for simplicity:

\[ \log p(M) = \log \binom{N}{M} + M \log p_R + (N - M) \log (1 - p_R) \]

Differentiating with respect to $p_R$ and setting to zero:

\[ \frac{\partial \log p(M)}{\partial p_R} = \frac{M}{p_R} - \frac{N - M}{1 - p_R} = 0 \]

Solving this gives the maximum likelihood estimate of $p_R$ as:

\[ \hat{p}_R = \frac{M}{N} \]

Recall:

\[ p_R = p(a) \cdot (b - a) \]

Plugging in our maximum likelihood estimate for $p_R$:

\[ \hat{p}_R = \hat{p}(x) \cdot (b - a) \]

Thus:

\[ \hat{p}(x) = \frac{\hat{p}_R}{(b - a)} = \frac{M / N}{(b - a)} = \frac{M}{N |R|} \]

where $|R|$ is the size of region $R$.

Histogram Estimation: Steps¶

Given a set of bins $R_i$:

Count the number of samples $M_i$ that fall into each bin $i$.
Count the total number of samples $N$.
For a particular pattern $x$, the probability $p(x)$ can be computed as:

\[ \hat{p}(x) = \frac{M_i}{N |R_i|} \]

for $x \in R_i$.

This means for a particular pattern $x$, find which bin $i$ it belongs to, and compute the ratio between the number of samples in that bin ($M_i$) and the total number of samples $N$ times the region size $|R|$.

Histogram Estimation: Tradeoffs¶

There is a fundamental tradeoff between resolution along $x$ and resolution along $p(x)$:

For good resolution along $x$, we want to have small-sized regions, leading to small $M_i$, resulting in poor PDF resolution.
For good resolution along $p(x)$, we want large $M_i$ and large-sized regions, which leads to poor $x$ resolution.

Other Disadvantages:

Estimated PDF is always discontinuous.
Shifting the origin changes the shape of the estimated PDF.

Parzen Window Estimation¶

Histogram estimation assumes that the PDF $p(x)$ is constant over each region, which may not be realistic.

Solution: Parzen Window Estimation avoids predefined regions and bin counting.

Advantages:

No need to predefine region sizes.
Estimated PDF is always continuous.
Method is origin-independent.

Fundamental Idea:

Each sample $x_i$ locally influences the estimated PDF in the vicinity of $x_i$.
Thus, since we observed $x_i$, the PDF near $x_i$ should be higher; areas with more samples should show a larger PDF.
The estimated PDF is the sum of contributions from each sample:

\[ \hat{p}(x) \propto \sum_i \phi(x - x_i) \]

where $\phi$ is a window function that controls each sample's influence on the PDF.

The window function $\phi$ must be normalized:

\[ \int_{-\infty}^{\infty} \phi(x) \, dx = 1 \]

To adjust the locality of a sample’s influence, we stretch or compress $\phi$:

\[ \phi\left(\frac{x - x_i}{h}\right) \]

where $h$ is a scaling factor. To keep the function normalized:

\[ \int_{-\infty}^{\infty} \frac{1}{h} \phi\left(\frac{x - x_i}{h}\right) dx = 1 \]

Given $N$ samples $\{x_i, \ldots, x_N\}$, the Parzen Window estimate of $p(x)$ is:

\[ \hat{p}(x) = \frac{1}{N} \sum_i \frac{1}{h} \phi\left(\frac{x - x_i}{h}\right) \]

Common Window Functions:

Rectangular
Triangular
Gaussian
Exponential

Parzen Window Estimation: Scaling Factor¶

One drawback of Parzen Window estimation is that we still need to choose a window function and scale factor $h$.

General rule: try $h = \frac{K}{\sqrt{N}}$ for some constant $K$.

For small $N$, the constant $K$ is important:

If $K$ is too small, the estimate is noisy with sharp peaks at the samples.
If $K$ is too large, the estimate is smeared with low resolution.

k-Nearest Neighbor (k-NN) Estimation¶

In histogram and Parzen Window methods, we fix the region size or window width, controlling resolution along the $x$-axis while data controls the resolution along the PDF axis.

In k-NN estimation, we fix the number of samples $M$ and determine the region size needed at each point to enclose this many samples.

Thus, this controls the resolution along the PDF axis directly, and the $x$-axis resolution becomes data-dependent.

To compute the k-NN estimate of $p(x)$:

Create an interval $[x - \alpha, x + \alpha]$ centered around $x$.
Increase $\alpha$ until it contains the desired number of observations $M$.
Compute the estimate of $p(x)$ as:

\[ \hat{p}(x) = \frac{M}{N |R(x)|} = \frac{M}{N \cdot 2 \alpha} \]

where $R(x)$ is the smallest region centered around $x$ that encloses $M$ sample points.

A common setting is to set $M = \sqrt{N}$, which removes any free parameters.

Properties:

If the sample density is high, $|R(x)|$ will be small, providing high resolution where needed.
If the sample density is low, $|R(x)|$ will be large, resulting in lower resolution in sparsely populated regions.

Advantages:

Avoids setting $p(x)$ to zero in areas with no samples, resulting in a more realistic non-zero probability.
Disadvantage: Estimated PDF is highly “peaked” and non-normalized.
Optimal Classifier:

$f^*(x) = \arg \max_y P(y|x) = \arg \max_y P(x|y) P(y)$
k-NN Classifier:

$\hat{f}_{kNN}(x) = \arg \max_y \hat{P}_{kNN}(x|y) \hat{P}(y) = \arg \max_y k_y$

where:

\[ \hat{P}_{kNN}(x|y) = \frac{k_y}{n_y} \]

where $k_y$ is # training pts of class $y$ amongst k NNs of $x$, and $n_y$ is # total training pts of class $y$ with

\[ \sum_y k_y = k \]

and

\[ \hat{P}(y) = \frac{n_y}{n} \]

Approximation vs. Stability (aka Bias vs Variance) Tradeoff¶

Larger $K$ ⇒ predicted label is more stable
Smaller $K$ ⇒ predicted label can approximate the best classifier well given enough data

K-Nearest Neighbours¶

An old/simple classifier: k-nearest neighbours (KNN).
To classify an example $\tilde{x}_i$:
1. Find the 'k' training examples $x_i$ that are "nearest" to $\tilde{x}_i$.
2. Classify using the most common label of "nearest" training examples.
Assumption:
- Examples with similar features are likely to have similar labels.
Seems strong, but all good classifiers basically rely on this assumption.
- If not true there may be nothing to learn and you are in "no free lunch" territory.
- Methods just differ in how you define "similarity".
Most common distance function is Euclidean distance:

\[ \| x_i - \tilde{x}_{\tilde i} \| = \sqrt{\sum_{j=1}^d (x_{i,j} - \tilde{x}_{\tilde i,j})^2} \]
- $x_i$ is features of training example ‘$i$’, and $\tilde{x}_l$ is features of test example ‘$\tilde{l}$’.
- Costs $O(d)$ to calculate for a pair of examples.
- Assumption: Similar inputs have similar outputs
- Classification rule: For a test input $\mathbf{x}$, assign the most common label amongst its $k$ most similar training inputs

Formal, Incomprehensible Definition¶

Test point: $\mathbf{x}$
Denote the set of the k nearest neighbors of $\mathbf{x}$ as $S_{\mathbf{x}}$. Formally, $S_{\mathbf{x}}$ is defined as $S_{\mathbf{x}} \subseteq D \text{ s.t. } |S_{\mathbf{x}}| = k \text{ and } \forall (\mathbf{x}', y') \in D \setminus S_{\mathbf{x}},$

\[ \text{dist}(\mathbf{x}, \mathbf{x}') \geq \max_{(\mathbf{x}'', y'') \in S_{\mathbf{x}}} \text{dist}(\mathbf{x}, \mathbf{x}'') \]

(i.e., every point in $D$ but not in $S_{\mathbf{x}}$ is at least as far away from $\mathbf{x}$ as the furthest point in $S_{\mathbf{x}}$). We can then define the classifier $h()$ as a function returning the most common label in $S_{\mathbf{x}}$:

\[ h(\mathbf{x}) = \text{mode}(\{y'' : (\mathbf{x}'', y'') \in S_{\mathbf{x}}\}) \]

where mode($\cdot$) means to select the label of the highest occurrence. (Hint: In case of a draw, a good solution is to return the result of $k$-NN with smaller $k$.)

Distance Measure to Use¶

The k-nearest neighbor classifier fundamentally relies on a distance metric. The better that metric reflects label similarity, the better the classified will be. The most common choice is the Minkowski distance:

\[ \text{dist}(\mathbf{x}, \mathbf{z}) = \left( \sum_{r=1}^{d} |x_r - z_r|^p \right)^{1/p} \]

Effect of k¶

With large $k$ (hyper-parameter), KNN model will be very simple.
- With $k=n$, you just predict the mode of the labels.
- Model gets more complicated as $k$ decreases.
- With $k=1$, it is very sensitive to data (can fit data better but can overfit better too).
Effect of $k$ on fundamental trade-off:
- As $k$ grows, training error tends to increase.
- As $k$ grows, generalization gap tends to decrease.

Summary¶

$k$-NN is a simple and effective classifier if distances reliably reflect a semantically meaningful notion of the dissimilarity. (It becomes truly competitive through metric learning)
As $n \rightarrow \infty$, $k$-NN becomes provably very accurate, but also very slow.
As $d \gg 0$, points drawn from a probability distribution stop being similar to each other, and the $k$-NN assumption breaks down.

Decision Theory¶

We can give a cost to each scenario.

Instead of most probable label, take $\hat{y_i}$ minimizing expected cost. Even if “spam” has a higher probability, predicting “spam” might hace a expected higher cost.

Unbalanced Class Labels¶

A related idea is that of "unbalanced" class labels.
- If 99% of the e-mails are spam, you can get 99% accuracy by always predicting spam.
There are a variety of other performance measures available:
- Weighted classification error.
- Jaccard similarity.
- Precision and recall.
- False positive and false negative rate.
- ROC curves

Other Performance Measures¶

Classification error might be the wrong measure:
- Use weighted classification error if there are different costs.
- Might want to use things like Jaccard measure: TP / (TP + FP + FN).
Often, we report precision and recall (want both to be high):
- Precision: "if I classify as spam, what is the probability it actually is spam?"
  - Precision = TP / (TP + FP).
  - High precision means the filtered messages are likely to really be spam.
- Recall: "if a message is spam, what is the probability it is classified as spam?"
  - Recall = TP / (TP + FN).
  - High recall means that most spam messages are filtered.

Precision-Recall Curve¶

Consider the rule $p(y_i = \text{``spam''} \mid x_i) > t$, for threshold $t$.
Precision-recall (PR) curve plots precision vs. recall as $t$ varies.

ROC Curve¶

Receiver operating characteristic (ROC) curve:
- Plot true positive rate (recall) vs. false positive rate (FP / (FP + TN)). (negative examples classified as positive)
- Diagonal is random; perfect classifier would be in the upper left.
- Sometimes papers report area under curve (AUC).
  - Reflects performance for different possible thresholds on the probability.

Unbalanced Classes¶

With unbalanced classes, there are many alternatives to accuracy as a measure of performance:
- Two common ones are the Jaccard coefficient and the F-score.
Some machine learning models don’t work well with unbalanced data. Some common heuristics to improve performance are:
- Under-sample the majority class (only take 5% of the spam messages).
- Re-weight the examples in the accuracy measure (multiply training error of getting non-spam messages wrong by 10).

Encouraging invariance with data augmentation¶

May want classifier to be invariant to certain feature transforms.
- Images: translations, small rotations, changes in size, mild warping,...
  - Recognize same signal in different-looking images.
The hard/slow way is to modify your distance function:
- Find neighbours that require the “smallest” transformation of the image.
The easy/fast way is to use data augmentation.
- Just add transformed versions of your training examples to the training set.
  - Make translated/rotated/resized/warped versions of training images, and add them to the train set.
- Crucial part of many successful vision systems.
- Also really important for sound (translate, change volume, and so on)

Ensemble methods¶

Ensemble methods are classifiers that have classifiers as input.
They have interesting names:
- Averaging.
- Blending.
- Boosting.
- Bootstrapping.
- Bagging.
- Cascading.
- Random Forests.
- Stacking.
- Voting.
Ensemble methods often have higher accuracy than input classifiers.

Example - Voting¶

Ensemble methods use predictions of a set of models.
- For example, we could have:
  - Decision trees make one prediction.
  - Naïve Bayes makes another prediction.
  - KNN makes another prediction.
One of the simplest ensemble methods is voting:
- Take the mode of the predictions across the classifiers.
Another variation on voting is stacking
- Fit another classifier that uses the predictions as features.
Can tune the second classifier with validation data.
- Sometimes called "blending".
Stacking often performs better than individual models.
- Typically used by Kaggle winners.
- E.g., Netflix $1M user-rating competition winner was a stacked classifier.

Why can Voting work?¶

Consider 3 binary classifiers, each independently correct with probability 0.80:
With voting, ensemble prediction is correct if we have “at least 2 right”:
- P(all 3 right) = $0.8^3 = 0.512$.
- P(2 rights, 1 wrong) = $3 \cdot 0.8^2 \cdot (1 - 0.8) = 0.384$.
- P(1 right, 2 wrongs) = $3 \cdot (1 - 0.8)^2 \cdot 0.8 = 0.096$.
- P(all 3 wrong) = $(1 - 0.8)^3 = 0.008$.
- So ensemble is right with probability 0.896 (which is $0.512 + 0.384$).
Notes:
- For voting to work, errors of classifiers need to be at least somewhat independent.
- You also want the probability of being right to be > 0.5, otherwise it can do much worse.
  - But accuracy does not have to be the same across classifiers (“weak” classifiers can help “strong” ones).
Why can voting lead to better results?
Consider classifiers that overfit (like deep decision trees):
- If they all overfit in exactly the same way, voting does nothing.
But if they make independent errors:
- Probability that “vote” is wrong can be lower than for each classifier.
- Less attention to specific overfitting of each classifier.

Random Forests¶

Random forests take vote from a set of deep decision trees.
- Tend to be one of the best “out of the box” classifiers.
  - Often close to the best performance of any method on the first run.
- And predictions are very fast.
Do deep decision trees make independent errors?
- No: with the same training data you’ll get the same decision tree.
Two key ingredients in random forests:
- Bootstrapping: a way to generate different “versions” of your dataset.
- Random feature selection: a way to grow decision trees incorporating randomness.
Random forests train on different “bootstrap samples” of your dataset:
And models vote to make the final decision.
- The hope is that the “bootstrap samples” make errors more independent.

Bootstrapping Sampling¶

Start with a standard deck of 52 cards:
1. Sample a random card: (put it back and re-shuffle)
2. Sample a random card: (put it back and re-shuffle)
3. Sample a random card: (put it back and re-shuffle)
4. $\ldots$
5. Sample a random card: (which may be a repeat)
Makes a new deck of the 52 samples:
New 52-card deck is called a “bootstrap sample”:
- Some cards will be missing, and some cards will be duplicated.
  - So calculations on the bootstrap sample will give different results than original data.
- However, the bootstrap sample roughly maintains trends:
  - Roughly 25% of the cards will be diamonds.
  - Roughly 3/13 of the cards will be “face” cards.
  - There will be roughly four “10” cards.
- Bootstrap sampling is a general technique that is used in many settings:
  - Sample $n$ examples with replacement from your set of size $n$.
  - Repeat this several times, and compute some statistic on each bootstrap sample.
    - Gives you an idea of how the statistic varies as you vary the data.

Size¶

Each bootstrap sample should have the same size as the original dataset (n)
- E.g. 52 in card deck example
Reason:
- You want to simulate the statistics (including variance) of the original data.
- If your bootstrapped samples had a huge $N$, they would all be very similar, artificially reducing variance between datasets.
- Opposite for small $N$.

Random Forest Continued¶

Ingredient 1 - Bootstrapping¶

“Bagging”: ensemble of the same type of classifier on different bootstraps
- Bagging: “bootstrap aggregation”
- Random forests are an example

Ingredient 2 - Random Trees¶

For each split in a random tree model:
- Randomly sample a small number of possible features (typically $\sqrt{d}$).
- Only consider these random features when searching for the optimal rule.
  - So splits will tend to use different features in different trees.

Putting it together¶

Training:

Prediction:

Summary¶

Curse of dimensionality:
- Number of points to "fill" a space grows exponentially with dimension.
Data augmentation:
- Add transformed data to be invariant to transformations that preserve label.
Ensemble methods take multiple classifiers as inputs.
Voting ensemble method:
- Improves predictions of multiple classifiers if errors are independent.
Bootstrap sampling:
- Generating a new dataset, by sampling $n$ examples with replacement.