Chapter 19 Supervised Learning Overview

At this point, you should know…

Bayes Classifier

Classify to the class with the highest probability given a particular input \(x\).

\[ C^B({\bf x}) = \underset{k}{\mathrm{argmax}} \ P[Y = k \mid {\bf X = x}] \]

Since we rarely, if ever, know the true probabilities, use a classification method to estimate them using data.

The Bias-Variance Tradeoff

As model complexity increases, bias decreases.
As model complexity increases, variance increases.
As a result, there is a model somewhere in the middle with the best accuracy. (Or lowest RMSE for regression.)

The Test-Train Split

Never use test data to train a model. Test accuracy is a measure of how well a method works in general.
We can identify underfitting and overfitting models relative to the best test accuracy.
- A less complex model than the model with the best test accuracy is underfitting.
- A more complex model than the model with the best test accuracy is overfitting.

Classification Methods

Logistic Regression
Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
Naive Bayes (NB)
\(k\)-Nearest Neighbors (KNN)
For each, we can:
- Obtain predicted probabilities.
- Make classifications.
- Find decision boundaries. (Seen only for some.)

Discriminative versus Generative Methods

Discriminative methods learn the conditional distribution \(p(y \mid x)\), thus could only simulate \(y\) given a fixed \(x\).
Generative methods learn the joint distribution \(p(x, y)\), thus could only simulate new data \((x, y)\).

Parametric and Non-Parametric Methods

Parametric methods models \(P[Y = k \mid X = x]\) as a specific function of parameters which are learned through data.
Non-Parametric use an algorithmic approach to estimate \(P[Y = k \mid X = x]\) for each possible input \(x\).

Tuning Parameters

Specify how to train a model. This in contrast to model parameters, which are learned through training.

Cross-Validation

A method to estimate test metrics with training data. Repeats the train-validate split inside the training data.

Curse of Dimensionality

As feature space grows, that is as \(p\) grows, “neighborhoods” must become much larger to contain “neighbors,” thus local methods are not so local.

No-Free-Lunch Theorem

There is no one classifier that will be best across all datasets.

19.1 External Links

Wikipedia: No-Free-Lunch
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? - A paper that argues that No-Free-Lunch may be true in theory, but in practice there a only a few classifiers that outperform most others.

19.2 RMarkdown

The RMarkdown file for this chapter can be found here. The file was created using R version 4.0.2.