title | subtitle | date | output | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BAIT 509 Class Meeting 02 |
Reducible Error |
January 4, 2019 |
|
- Classification example
- Generalization error
- Optimize the reducible error through the fundamental tradeoff.
Let's use the Default
data in the ISLR
R package for classification: predict whether or not a new individual will default on their credit card debt.
Our task: make the best prediction model we can.
- Code the null model.
- What would you predict for a new person?
- What's the error using the original data (aka training data)?
- Plot all the data, and come up with a classifier by eye.
- What would you predict for a non-student with income $40,000 and balance $1,011?
- Classify the original (training) data. What's the overall error?
- How might we optimize the classifier? Let's code it.
Discussion: does it make sense to classify a new iris plant species based on the iris
dataset?
Discussion:
- What's wrong with calculating error on the training data?
- What's an alternative way to compute error?
Activity: Get the generalization error for the above classifier.
A big part of machine learning is choosing how complex to make a model. Examples:
- More partitions in the above credit card default classifier
- Adding higher order polynomials in linear regression
And lots more that we'll see when we explore more machine learning models.
The difficulty: adding more complexity will decrease training error. What's a way of dealing with this?
Let's return to the iris
example from lecture 1, predicting Sepal Width, this time using Petal Width only, using polynomial linear regression.
Task: Choose an optimal polynomial order.
Here's code for a graph to visualize the model fit:
p <- 4
ggplot(iris, aes(Petal.Width, Sepal.Width)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", formula = y ~ poly(x, p)) +
theme_bw()
Last time, we saw what irreducible error is, and how to "beat" it.
The other type of error is reducible error, which arises from not knowing the true distribution of the data (or some aspect of it, such as the mean or mode). We therefore have to estimate this. Error in this estimation is known as the reducible error.
- one numeric predictor
- one numeric response
- true (unknown) distribution of the data is
$Y|X=x \sim N(5/x, 1)$ (and take $X \sim 1+Exp(1)$). - you only see the following 100 observations stored in
dat
, plotted below, and choose to use linear regression as a model:
The difference between the true curve and the estimated curve is due to reducible error.
In the classification setting, a misidentification of the mode is due to reducible error.
(Why the toy data set instead of real ones? Because I can embed characteristics into the data for pedagogical reasons. You'll see real data at least in the assignments and final project.)
There are two key aspects to reducible error: bias and variance. They only make sense in light of the hypothetical situation of building a model/forecaster over and over again as we generate a new data set over and over again.
- Bias occurs when your estimates are systematically different from the truth. For regression, this means that the estimated mean is either usually bigger or usually smaller than the true mean. For a classifier, it's the systematic tendency to choosing an incorrect mode.
- Variance refers to the variability of your estimates.
There is usually (always?) a tradeoff between bias and variance. It's referred to as the bias/variance tradeoff, and we'll see examples of this later.
Let's look at the above linear regression example again. I'll generate 100 data sets, and fit a linear regression for each:
The spread of the linear regression estimates is the variance; the difference between the center of the regression lines and the true mean curve is the bias.
As the name suggests, we can reduce reducible error. Exactly how depends on the machine learning method, but in general:
- We can reduce variance by increasing the sample size, and adding more model assumptions.
- We can reduce bias by being less strict with model assumptions, OR by specifying them to be closer to the truth (which we never know).
Consider the above regression example again. Notice how my estimates tighten up when they're based on a larger sample size (1000 here, instead of 100):
Notice how, after fitting the linear regression
We saw that we measure error using mean squared error (MSE) in the case of regression, and the error rate in the case of a classifier. These both contain all errors: irreducible error, bias, and variance:
MSE = bias^2 + variance + irreducible variance
A similar decomposition for error rate exists.
Note: If you look online, the MSE is often defined as the expected squared difference between a parameter and its estimate, in which case the "irreducible error" is not present. We're taking MSE to be the expected squared distance between a true "new" observation and our prediction (mean estimate).