title | subtitle | date | output | references | |||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BAIT 509 Class Meeting 08 |
Supervised Learning Beyond the Mean and Mode |
January 23, 2019 |
|
|
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(Lahman))
suppressPackageStartupMessages(library(knitr))
opts_chunk$set(warning=FALSE, fig.width=6, fig.height=3, fig.align="center")
my_accent <- "#d95f02"
Housekeeping:
- Office hour room moved to ESB 3174 when possible.
- Can't accept late submissions for Assignments 2+. Solutions made available right away.
- Questions for Assignment 2?
- Hint for ISLR 8(c):
prune.tree()
from thetree
package in R will prune a fitted tree.
- Hint for ISLR 8(c):
Concepts:
- CART:
- For numeric predictors, decisions can only involve one threshold.
- Decisions are always binary.
- What is "Probabilistic forecasting" and "Quantile regression"; how to interpret the forecasts/predictions; and to have a sense of when it's appropriate to use them.
- Probabilistic forecasts: how to estimate (plot) predictive distribution densities using subset approaches.
- Quantile Regression:
- How to fit a linear quantile regression model in R (using both
ggplot2
and thequantreg
package), and how to interpret it. - Understand the "crossing quantile" problem.
- Know how to evaluate goodness of quantile regression models
- How to fit a linear quantile regression model in R (using both
Up until now, we've only seen different ways of using a predictor to give us more information the mean and mode of the response. The world holds a huge emphasis on the mean and mode, but these are not always what's important. Two alternatives are:
- Probabilistic forecasting
- Quantile Regression (numeric response only)
The idea here is to put forth an entire probability distribution as a prediction.
Let's look at an example. Suppose there are two baseball teams, one that gets 1000 total hits in a year, and another that gets 1500. Using "total hits in a year" as a predictor, we set out to predict the total number of runs of both teams. Here's the top snippet of the data:
dat <- Teams %>% tbl_df %>%
select(runs=R, hits=H)
kable(head(dat))
runs hits
401 426 302 323 249 328 137 178 302 403 376 410
Let's not concern ourselves with the methods yet. Using a standard regression technique, here are our predictions:
Number of Hits (X) Expected Number of Runs (E(Y))
1000 558
1500 768
Using a probabilistic forecast, here are our predictions:
Don't you think this is far more informative than the mean estimates in the above table?
The probabilistic forecast/prediction contains the most amount of information about the response as possible (based on a set of predictors), because it communicates the entire belief of what
Predictions/forecasts here are called predictive distributions.
From [@gneiting_raftery]:
Indeed, over the past two decades, probabilistic forecasting has become routine in such applications as weather and climate prediction (Palmer 2002; Gneiting and Raftery 2005), computational finance (Duffle and Pan 1997), and macroeconomic forecasting (Garratt, Lee, Pesaran, and Shin 2003; Granger 2006).
Let's review how to estimate a univariate probability density function or probability mass function.
Here's a random sample of 10 continuous variables, ordered from smallest to largest, stored in the variable x
:
x
## [1] -19.8 -13.6 -12.0 -3.5 -2.8 4.4 4.5 15.1 16.3 20.2
Recall that we can use histograms to estimate the density of the data. The idea is:
- Cut the range of the data into "bins" of a certain width.
- For these data, the range is 40. Let's set up four bins of width 10: -19.8 to -9.8, -9.8 to 0.2, etc.
- Count the number of observations that fall into each bin.
- For our setup, the number of observations falling into the four bins, in order, are: 3,2,2,3.
- Make a bar plot (with no space between the bars), where the bar width corresponds to the bins, and the bar height corresponds to the number of observations in that bin.
- For our setup, we have:
ggplot(data.frame(x=x), aes(x)) +
geom_histogram(binwidth=10, center=min(x)+5,
fill=my_accent, colour="black") +
theme_bw()
(Note: this is not a true density, since the area under the curve is not 1, but the shape is what matters)
You'd have to play with the binwidth to get a histogram that looks about right (not too jagged, not too coarse). For the above example, there are too few data to make a good estimate. Let's now generate 1000 observations, and make a histogram using qplot
from R's ggplot2
package, with a variety of binwidths -- too small, too large, and just right.
x <- rnorm(1000, sd=10)
qplot(x, binwidth=1) # too small
qplot(x, binwidth=10) # too big
qplot(x, binwidth=3.5) # just right
Advanced method: There's a technique called the kernel density estimate that works as an alernative to the histogram. The idea is to put a "little mound" (a kernel) on top of each observation, and add them up. Instead of playing with the binwidth, you can play with the "bandwidth" of the kernels. Use geom="density"
in qplot
, and use bw
to play with the bandwidth:
qplot(x, geom="density", bw=2.5)
When the response is discrete (this includes categorical), the approach is simpler:
- Calculate the proportion of observations that fall into each category.
- Make a bar chart, placing a bar over each category, and using the proportions as the bar heights.
Here are ten observations, stored in x
:
x
## [1] 1 0 0 0 2 0 1 2 3 0
The proportions are as follows:
props <- tibble(Value=x) %>%
group_by(Value) %>%
summarize(Proportion=length(Value)/length(x))
kable(props)
Value Proportion
0 0.5
1 0.2
2 0.2
3 0.1
You can plot these proportions with qplot
, specifying geom="col"
:
qplot(x=Value, y=Proportion, data=props, geom="col")
You can use ggplot2
to calculate the proportions, but it's more complex. It's easier to plot the raw counts, instead of proportions -- and that's fine, you'll still get the same shape. Using qplot
again, let's make a plot for 1000 observations (note that I indicate that my data are discrete by using the factor
function):
set.seed(2)
x <- rpois(1000, lambda=1)
qplot(factor(x))
Here's the code to get proportions instead of counts:
qplot(factor(x), mapping=aes(y=..prop..), group=1)
The local methods and classification/regression trees that we've seen so far can be used to produce probabilistic forecasts. For local methods, let's ignore the complications of kernel weighting and local polynomials. These methods result in a subset of the data, for which we're used to taking the mean or mode. Instead, use the subsetted data to plot a distribution.
- For kNN, form a histogram/density plot/bar plot using the
$k$ nearest neighbours. - For the moving window (loess), form a histogram/density plot/bar plot using the observations that fall in the window.
- For tree-based methods, use the observations within a leaf to form a histogram/density plot/bar plot for that leaf.
The above baseball example used a moving window with a radius of 20
hits. Visually, you can see the data that I subsetted within these two narrow windows, for hits of 1000 and 1500:
ggplot(dat, aes(hits, runs)) +
geom_point(colour=my_accent, alpha=0.1) +
geom_vline(xintercept=c(1000+c(-r,r), 1500+c(-r,r)),
linetype="dashed") +
theme_bw() +
labs(x="Number of Hits (X)",
y="Number of Runs (Y)")
- Install the
Lahman
package, which contains theTeams
dataset. - Build a null model probabilistic forecast of "number of runs" (
R
column). - Build a probabilistic forecast, using kNN, of "number of runs" for a team that has 1500 hits (
H
column) and 70 wins (W
column). Don't forget to scale the predictors! - Do the same thing, but using linear regression. What additional assumption(s) is/are needed here?
Let's examine the bias-variance / overfitting-underfitting tradeoff with kNN-based probabilistic forecasts. I'll run a simulation like so:
- Generate data from a bivariate Normal distribution, so that
$X \sim N(0, 100)$ , and$Y = X + N(0, 100)$ . - Training data will contain 500 observations, for which a kNN probabilistic forecast will be built when
$X=25$ . - Try both a small (k=15) and large (k=100) value of
$k$ . - For each value of
$k$ , we'll generate 20 training data sets.
Here are the 20 estimates for the values of
Notice that:
- When
$k$ is large, our estimates are biased, because the distributions are not centered correctly. But, the estimates are more consistent. - When
$k$ is small, our estimates are less biased, because the distributions overall have a mean that is close to the true mean. But the variance is high -- we get all sorts of distribution shapes here.
A similar thing happens with a moving window, with the window width parameter. For tree-based methods, the amount that you partition the predictor space controls the bias-variance tradeoff.
To choose a balance between bias and variance, we need a measure of prediction goodness. When predicting the mean, the MSE works. When predicting the mode, the classification error works. But what works for probabilistic forecasts?
This is an active area of research. The idea is to use a proper scoring rule -- a way of assigning a score based on the forecast distribution and the outcome only, that also encourages honesty. We won't go into details -- see [@gneiting_raftery] for details.
At the very least, one should check that the forecast distributions are "calibrated" -- that is, the actual outcomes are spread evenly amongst the forecasts. You can check this by applying the forecast cdf to the corresponding outcome -- the resulting sample should be Uniform(0,1). Note that this is built-in to at least some proper scoring rules.
For this course, we won't be picky about how you choose your tuning parameters. Just look for a subset that you think has "enough" observations in it so that the distribution starts to take some shape, but not so much that it starts to shift.
For (1) and (2) below, you're choosing between two candidates to hire. Discuss the pros and cons of choosing one candidate over the other in the following situations.
- Both are predicted to have the same productivity score of 75, but have the following probabilistic forecasts.
It's hard to make a decision here. On the one hand, we can be fairly certain that the actual productivity of candidate A will be about 75, but there's more of a gamble with candidate B. There's a very real chance that B's productivity is actually quite a bit higher than A -- for example, a productivity of 80 is plausible for B, but not for A. On the other hand, there's also a very real chance that B's productivity is actually quite a bit lower than A, for the same reason. Your decision would depend on whether you would want to take a risk or not.
On the other hand, in reality, this is only one tool out of many other aspects of the candidate that you would consider. It might be a good idea to chat with B to get a better sense of what their productivity might actually be.
- Two "non-overlapping" forecasts:
In this case, B is very very likely to have higher productivity than A, because all "plausible" productivity values for B are higher than all "plausible" productivity values of A.
Again, this is just one tool you might use to make a decision.
- You've formed a probabilistic forecast for a particular value of the predictors, displayed below as a density. You then collect test data for that same value of the predictor, indicated as the points below the density. What is the problem with the probabilistic forecast?
The forecast is biased, because the actual values are occuring near the upper tail of the distribution -- they should be scattered about the middle, with a higher density of points occuring near 0. If using local methods, we'd have to reduce
$k$ or the window width to decrease bias (to remove "further" data that are less relevant); if using a tree-based method, you could grow the tree deeper to lower the bias.
Probabilistic forecasts are useful if you're making a small amount of decisions at a time. For example:
- Predicting which hockey team will win the Stanley Cup
- Looking at the 2-day-ahead prediction of river flow every day to decide whether to take flood mitigation measures.
But they are not appropriate when making decisions en-masse. For example:
- A bus company wants to know how long it takes a bus to travel between stops, for all stops and all busses.
- You want to predict future behaviour of customers.
It's common to "default" to using the mean to make decisions. But, the mean is not always appropriate (I wrote a blog post about this):
- Sometimes there are outliers, in which case the median is a more robust measure of central tendency.
- For example, there was a very large flu outbreak in 2009.
- Sometimes a conservative/liberal estimate is wanted.
- For example, the bus company wants conservative estimates so that most busses fall within the estimated travel time.
In these cases, we care about quantiles, not the mean. Estimating them is called quantile regression (as opposed to mean regression).
Recall what quantiles are: the
For example, the bus company might want to predict the 0.8-quantile of transit time -- 80% of busses will get to their destination within that time.
Be warned: you may have a hard time convincing people that quantiles are actually what they care about, because the world is trained to think about the mean. Quantiles aside from the median are also harder to interpret.
The idea here is to model
Here are the 0.25-, 0.5-, and 0.75-quantile regression lines for the baseball data:
ggplot(dat, aes(hits, runs)) +
geom_point(alpha=0.1, colour=my_accent) +
geom_quantile(colour="black") +
theme_bw() +
labs(x="Number of Hits (X)",
y="Number of Runs (Y)")
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
## Smoothing formula not specified. Using: y ~ x
I did this easily with ggplot2
, just by adding a layer geom_quantile
to my scatterplot, specifying the quantile levels with the quantiles=
argument. We could also use the function rq
in the quantreg
package in R:
(fit_rq <- rq(runs ~ hits, data=dat, tau=c(0.25, 0.5, 0.75)))
## Call:
## rq(formula = runs ~ hits, tau = c(0.25, 0.5, 0.75), data = dat)
##
## Coefficients:
## tau= 0.25 tau= 0.50 tau= 0.75
## (Intercept) -118.8297872 8.2101818 64.0347349
## hits 0.5531915 0.4923636 0.4908592
##
## Degrees of freedom: 2835 total; 2833 residual
If we were to again focus on the two teams (one with 1000 hits, and one with 1500 hits), we have (by evaluating the above three lines):
predict(fit_rq, newdata=data.frame(hits=c(1000, 1500)))
## tau= 0.25 tau= 0.50 tau= 0.75
## 1 434.3617 500.5738 554.8940
## 2 710.9574 746.7556 800.3236
So, we could say that the team with 1000 hits:
- is estimated to have a 50% chance to have between 434 and 555 runs;
- has a 25% chance of achieving over 555 runs;
- has a 25% chance of getting less than 434 runs;
- would typically get 501 runs (median);
amongst other things.
- Get a 95% prediction interval using linear quantile regression, with Y=
R
(number of runs), X=H
(number of hits), when X=1500. - What about a 95% PI using kNN, going back to the earlier example we did?
Because each quantile is allowed to have its own line, some of these lines might cross, giving an invalid result. Here is an example with the iris
data set, fitting the 0.2- and 0.3-quantiles:
ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point(alpha=0.25, colour=my_accent) +
geom_quantile(aes(colour="0.2"), quantiles=0.2) +
geom_quantile(aes(colour="0.3"), quantiles=0.3) +
scale_colour_discrete("Quantile\nLevel") +
theme_bw() +
labs(x="Sepal Length",
y="Sepal Width")
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
fit_iris <- rq(Sepal.Width ~ Sepal.Length, data=iris, tau=2:3/10)
b <- coef(fit_iris)
at8 <- round(predict(fit_iris, newdata=data.frame(Sepal.Length=8)), 2)
Quantile estimates of Sepal Width for plants with Sepal Length less than 7.3
are valid, but otherwise, are not. For example, for plants with a Sepal Length of 8, this model predicts 30% of such plants to have a Sepal Width of less than 2.75
, but only 20% of such plants should have Sepal Width less than 2.82
. This is an illogical statement.
There have been several "adjustments" proposed to ensure that this doesn't happen (see below), but ultimately, this suggests an inadequacy in the model assumptions. Luckily, this usually only happens at extreme values of the predictor space, and/or for large quantile levels, so is usually not a problem.
- Bondell HD, Reich BJ, Wang H. Noncrossing quantile regression curve estimation. Biometrika. 2010;97(4):825-838.
- Dette H, Volgushev S. Non-crossing non-parametric estimates of quantile curves. J R Stat Soc Ser B Stat Methodol. 2008;70(3):609-627.
- Tokdar ST, Kadane JB. Simultaneous linear quantile regression: a semiparametric Bayesian approach. Bayesian Anal. 2011;6(4):1-22.
Estimates of higher quantiles usually become worse for large/small values of
Here is a histogram of 100 observations generated from a Student's t(1) distribution (it's heavy-tailed):
set.seed(4)
y <- rt(100, df=1)
qplot(y) + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here are estimates of high and low quantiles, compared to the actual. You can see the discrepency grows quickly. Extreme-low quantiles are too high, whereas extreme-high quantiles are too low.
As a rule of thumb, it's best to stay below
The question here is: if we have two or more models that predicts the
- Choose which predictors to include in a model;
- Choose optimal hyperparameters;
- Estimate parameters in a quantile regression model.
**NOTE**: Mean Squared Error is not appropriate here!! This is very important to remember.
The reason is technical -- the MSE is not a proper scoring rule for quantiles. In other words, the MSE does not elicit an honest prediction.
If we're predicting the median, then the mean absolute error works. This is like the MSE, but instead of squaring the errors, we take the absolute value.
In general, a "correct" scoring rule for the
Here is a plot of various check functions. Notice that, when
base <- ggplot(data.frame(x=c(-2,2)), aes(x)) +
theme_bw() +
labs(y=expression(rho)) +
theme(axis.title.y=element_text(angle=0, vjust=0.5)) +
ylim(c(0, 1.5))
rho <- function(tau) function(x) (tau - (x<0))*x
cowplot::plot_grid(
base + stat_function(fun=rho(0.2)) +
ggtitle(expression(paste(tau, "=0.2"))),
base + stat_function(fun=rho(0.5)) +
ggtitle(expression(paste(tau, "=0.5"))),
base + stat_function(fun=rho(0.8)) +
ggtitle(expression(paste(tau, "=0.8"))),
ncol=3
)
For quantile regression estimation, we minimize the sum of scores instead of the sum of squared residuals, as in the usual (mean) linear regression.