predictive performance via bootstrap variants
May 3, 2018

When we build a predictive model, we are interested in how the model will perform on data it hasn’t seen before. If we have lots of data, we can split it into training and test sets to assess model performance. If we don’t have lots of data, it’s better to fit a model using all of the available data and to assess its predictive performance using resampling techniques. The bootstrap is one such resampling technique. This post discusses several variants of the bootstrap that are appropriate for estimating predictive performance.

## A brief introduction to the bootstrap

The bootstrap isn’t a particular procedure so much as a general strategy. We assume that our data comes from some underlying population $F$ that we care about. We’re interested in the sampling distribution of some statistic $T$ that we calculate from the data. In our case $T$ is predictive performance.

We don’t know $F$, but we can treat the data as an approximation $\hat F$ of $F$. Here $\hat F$ is the empirical distribution of the data, which gives equal probablity to each observed data point. So we know how to sample from $\hat F$. The bootstrap says to sample from $\hat F$ many times and calculate $T$ using these samples. The sampling distribution of $T$ under $\hat F$ then approximates the sampling distribution of $T$ under $F$.

The canonical diagram comes from Efron and Tibshirani (1994). In the diagram the bootstrap world approximates the real world: In terms of assessing model performance, this means that we sample our original data $X \in \mathbb{R}^{n, p}$ with replacement to generate new datasets $X^*_1, ..., X^*_B \in \mathbb{R}^{n, p}$, and then we estimate model performance on each of the bootstrap samples $X^*_b$.

## Data & Model

In this post, we’ll fit a linear regression1 and compare:

• bootstrap in-sample error
• bootstrap out-of-sample error
• the optimism bootstrap
• the 0.632 bootstrap
• the 0.632+ bootstrap

For data, we’ll use average college tuition costs in each state. The original data source is here, but I downloaded it from the Tidy Tuesday repository.

In particular, we’ll see how predictive tuition costs in 2014-15 are for 2015-16 using a simple linear regression of the form2:

$y_i = f(x_i) = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathrm{N}(0, \sigma^2)$

Here $y_i$ is the average tuition in state $i$ for the 2015-16 academic year3, and $x_i$ is the average tuition in state $i$ for the 2014-15 academic year. We have $n = 50$ data points. Once we fit a model, we’ll refer to the predicted tuition for state $i$ as $f(x_i)$. The data looks like:

library(readxl)
library(here)

library(tidyverse)
library(rsample)
library(Metrics)

set.seed(27)

data <- read_xlsx(here("content", "data", "us_avg_tuition.xlsx")) %>%
transmute(state = State,
avg_tuition_15_16 = 2015-16,
avg_tuition_14_15 = 2014-15) %>%
mutate_if(is.numeric, round)

data
## # A tibble: 50 x 3
##    state       avg_tuition_15_16 avg_tuition_14_15
##    <chr>                   <dbl>             <dbl>
##  1 Alabama                  9751              9496
##  3 Arizona                 10646             10414
##  4 Arkansas                 7867              7606
##  5 California               9270              9187
##  7 Connecticut             11397             10664
##  8 Delaware                11676             11515
##  9 Florida                  6360              6345
## 10 Georgia                  8447              8063
## # ... with 40 more rows

We quickly plot the data with a linear regression overlaid:

ggplot(data, aes(avg_tuition_14_15, avg_tuition_15_16)) +
geom_smooth(method = "lm", se = FALSE) +
geom_point(size = 2) +
labs(title = "Average College Tuition",
subtitle = "Each point represents one state",
y = "2015-2016 Average Tuition (dollars)",
x = "2014-2015 Average Tuition (dollars)") To assess model performance, we’ll use squared error, which we calculate as:

$\mathcal{L(y_i, f(x_i))} = (y_i - f(x_i))^2$

where we treat $y$ as a vector in $\mathbb{R}^n$. Our first measure of model performance is the in sample performance, which is the mean squared error (MSE) on all the training data:

$\texttt{in sample error} = {1 \over n} \sum_{i = 1}^n \mathcal{L}(y_i, f(x_i))$

We can calculate this in R like so:

formula <- avg_tuition_15_16 ~ avg_tuition_14_15
model <- lm(formula, data)

# utility to extract response vector from formula and dataframe
response <- function(formula, data) model.response(model.frame(formula, data))

pred <- predict(model, data)
true <- response(formula, data)

# in the code I use root mean squared error (RMSE) as a metric rather than
# MSE. this keeps the presentation of the math slightly nicer, since I can
# omit square roots, but keeps the errors on a more interpretable scale.

# I use Metrics::rmse for predictions that live in vectors
# and yardstick::rmse for predictions that live in dataframes
in_sample <- Metrics::rmse(pred, true)

The in sample error on the original data is 186 dollars.

### Apparent and out of sample bootstrap error

There are several ways to calculate the error on our bootstrap samples. One first step is to calculate the bootstrap in sample error by estimating the performance of a model fit on each bootstrap sample on the each bootstrap sample $X^*_b$ itself.

To write this out more formally, let $f_b$ be the model fit to the $b^{th}$ bootstrapped dataset, let $I_b$ be the data points that made it into the $b^{th}$ bootstrap sample and let $n_b$ be the total number of data points in the $b^{th}$ bootstrap sample

$\texttt{bootstrap in sample error} = \\ {1 \over B } \sum_{b = 1}^B {1 \over n_b} \sum_{i \in I_b} \mathcal{L}(y_i, f_b(x_i))$

However, we also know that for example bootstrap sample, some of the original data didn’t get used to fit $f_b$. We can use that data to calculate bootstrap out of sample error:

$\texttt{bootstrap out of sample error} = \\ {1 \over B } \sum_{b = 1}^B {1 \over n - n_b} \sum_{i \notin I_b} \mathcal{L}(y_i, f_b(x_i))$

When someone tells me that they used the bootstrap to evaluate their model, I typically assume that they’re reporting the bootstrap out of sample error, especially if they’re from the machine learning community.

The bootstrap in sample error typically underestimates prediction error, and the bootstrap out of sample error typically overestimates prediction error. There are several variants of the bootstrap that try to correct these biases.

### The optimism bootstrap

The first of these is the optimism bootstrap. First we define the optimism of the $b^{th}$ bootstrap model.

$\mathcal{O}_b = {1 \over n} \sum_{i = 1}^n \mathcal{L}(y_i, f_b(x_i)) - {1 \over n_b} \sum_{i \in I_b} \mathcal{L}(y_i, f_b(x_i))$

This is the same as calculating the average error of $f_b$ on the entire original sample, and then subtracting the bootstrap in sample error. To get a better estimate of the overall error we take the average optimism and add it to the in sample error estimate:

$\texttt{optimism bootstrap error} = \\ \texttt{in sample error} + {1 \over B} \sum_{b=1}^B \mathcal{O}_b$

### The 0.632 and 0.632+ bootstrap

Interestingly, the bootstrap out of sample error is somewhat pessimistic (Efron and Tibshirani (1994), Efron and Tibshirani (1997)). The 0.632 bootstrap estimator tries to address this problem by combining the in sample performance estimate with the bootstrap out of sample performance estimate:

$\texttt{0.632 bootstrap error} = \\ \quad \\ 0.632 \cdot \texttt{bootstrap out of sample error} + \\ 0.368 \cdot \texttt{in sample error}$

Where does the 0.632 come from? On average a bootstrap sample contains 63.2% of the data points in the original dataset. This Cross Validated answer has some nice discussion.

Similarly, the 0.632+ bootstrap estimator tries to find a good balance between the in sample error and the bootstrap out of sample error. To do this, it considers the no information error rate:

$\texttt{no info error rate} = {1 \over n^2} \sum_{i=1}^n \sum_{j=1}^n \mathcal{L}(y_i, f(x_j))$

which is the expected error rate when data points and responses are randomly assigned. Averaging over all $i, j$ you get a measure of how well you can predict when you know pretty much nothing. Then you can estimate the relative overfitting of a model with:

$\texttt{overfit} = {\texttt{bootstrap out of sample error} - \texttt{in sample error} \over \texttt{no info error rate} - \texttt{in sample error}}$

The 0.632+ uses this to weight the bootstrap out of sample error and the in sample error according to

$\texttt{w} = \texttt{weight on out of sample bootstrap error} \\ = {0.632 \over 1 - 0.368 \cdot \texttt{overfit}}$

and the final 0.632+ estimator is

$\texttt{0.632+ bootstrap error} = \\ \quad \\ \texttt{w} \cdot \texttt{out of sample bootstrap error} + \\ (1 - \texttt{w}) \cdot \texttt{in sample error}$

The idea is that the 0.632 estimator can be optimistic, and to take overfitting into account to correct this. If the relative overfitting is zero, the 0.632+ estimator reduces to the 0.632 estimator.

## Actually calculating these things in R

Before anything else, we need bootstrap samples. Luckily, the rsample package makes this super convenient.

# create an rset object: a data frame with a list-column full
# of bootstrap resamples
boots <- bootstraps(data, times = 25)
boots
## # Bootstrap sampling
## # A tibble: 25 x 2
##    splits          id
##    <list>          <chr>
##  1 <split [50/17]> Bootstrap01
##  2 <split [50/14]> Bootstrap02
##  3 <split [50/18]> Bootstrap03
##  4 <split [50/19]> Bootstrap04
##  5 <split [50/16]> Bootstrap05
##  6 <split [50/16]> Bootstrap06
##  7 <split [50/20]> Bootstrap07
##  8 <split [50/17]> Bootstrap08
##  9 <split [50/17]> Bootstrap09
## 10 <split [50/17]> Bootstrap10
## # ... with 15 more rows

The individual bootstrap samples are contained in the rsplit objects:

one_boot <- boots\$splits[]
one_boot
## <50/17/50>

This print method is to help track which data went into the bootstrap sample. The format here is <# data points in resampled data set / # original data points not in the resampled data set / # original data points>.

If we want to see the bootstrap sample itself, we can:

analysis(one_boot)
## # A tibble: 50 x 3
##    state         avg_tuition_15_16 avg_tuition_14_15
##    <chr>                     <dbl>             <dbl>
##  1 Wisconsin                  8815              8785
##  2 California                 9270              9187
##  3 Utah                       6363              6171
##  4 Kentucky                   9567              9223
##  5 Idaho                      6818              6610
##  6 Massachusetts             11588             10987
##  7 Arkansas                   7867              7606
##  8 Alabama                    9751              9496
##  9 Connecticut               11397             10664
## 10 Georgia                    8447              8063
## # ... with 40 more rows

If we want to see all the data points that didn’t go into the bootstrap sample, we can use:

assessment(one_boot)
## # A tibble: 17 x 3
##    state        avg_tuition_15_16 avg_tuition_14_15
##    <chr>                    <dbl>             <dbl>
##  2 Florida                   6360              6345
##  3 Hawaii                   10175              9713
##  4 Indiana                   9120              9049
##  5 Kansas                    8530              8270
##  6 Louisiana                 7871              7337
##  7 Maine                     9573              9560
##  8 Minnesota                10831             10582
## 10 North Dakota              7688              7527
## 11 Ohio                     10196             10104
## 12 Oklahoma                  7450              7094
## 13 Pennsylvania             13395             13157
## 14 Rhode Island             11390             10977
## 15 Tennessee                 9263              8941
## 16 Virginia                 11819             11202
## 17 Washington               10288             10703

Now we want to fit models on each of the bootstrap samples and assess various performance metrics. We write some helper functions to smooth things along.

# evaluate model performance on a particular dataset
performance <- function(model, data, metric = Metrics::rmse) {

# aside: it would be nice to be able to extract a data/design object from
# the model, and then extract the response from that. someday.

true <- response(formula, data)
pred <- predict(model, data)
metric(true, pred)
}

# get the no info error rate. should be used on the model
# fit to full original data
no_info_error_rate <- function(model, formula, data, metric = Metrics::rmse) {

true <- response(formula, data)
pred <- predict(model, data)

crossed <- crossing(true, pred)
with(crossed, metric(true, pred))
}

# NOTE: bs in variable names is an abbreviation for bootstrap

# 0.632 bootstrap estimate of performance
bs_632 <- function(in_sample, bs_out_of_sample) {
0.368 * in_sample + 0.632 * bs_out_of_sample
}

# 0.632+ bootstrap estimate of performance
bs_632p <- function(in_sample, bs_out_of_sample, no_info_error_rate) {
relative_overfit <- (bs_out_of_sample - in_sample) /
(no_info_error_rate - in_sample)
w <- 0.632 / (1 - 0.368 * relative_overfit)
w * bs_out_of_sample + (1 - w) * in_sample
}

Now we can fit models on each bootstrapped dataset:

no_info_perf <- no_info_error_rate(model, formula, data)

boot_performance <- boots %>%
mutate(
# fit a model on *bootstrap sample* of data
model = map(splits, ~lm(formula, analysis(.x))),

# bootstrap in sample error
bs_is_perf = map2_dbl(model, splits, ~performance(.x, analysis(.y))),

# bootstrap out of sample error
bs_os_perf = map2_dbl(model, splits, ~performance(.x, assessment(.y))),

# bootstrap error on full data
data_perf = map_dbl(model, ~performance(.x, data)),

# optimism
optimism = bs_is_perf - data_perf,

# optimism corrected error estimate
bs_optimism = in_sample - optimism,

# 0.632 bootstrap error estimate
bs_632_perf = bs_632(bs_is_perf, bs_os_perf),

# 0.632+ bootstrap error estimate
bs_632p_perf = bs_632p(bs_is_perf, bs_os_perf, no_info_perf))

We can calculate the point estimates we discussed above:

clean_performance <- boot_performance %>%
select_if(is.numeric) %>%
select(-optimism, -data_perf) %>%
gather(model, statistic) %>%
transmute(measure = recode(
model,
"bs_is_perf" = "bootstrap in sample",
"bs_os_perf" = "bootstrap out of sample" ,
"bs_optimism" = "optimism bootstrap",
"bs_632_perf" = "0.632 bootstrap",
"bs_632p_perf" = "0.632+ bootstrap"),
error = statistic
)

clean_performance %>%
group_by(measure) %>%
summarize_if(is.numeric, mean) %>%
mutate_if(is.numeric, round) %>%
knitr::kable()
measure error
0.632 bootstrap 189
0.632+ bootstrap 189
bootstrap in sample 186
bootstrap out of sample 190
optimism bootstrap 190

However, point estimates aren’t good ways to compare the performance of two models because they don’t give us a sense of how much model performance might vary on different data sets. Visualizing the sampling distribution of each of the metrics gives us a more complete view of model performance:

clean_performance %>%
ggplot(aes(measure, error,color = measure)) +
geom_boxplot() +
geom_jitter() +
scale_color_brewer(type = "qual") +
coord_flip() +
labs(title = "Sampling distributions of bootstrap performance metrics",
y = "rmse error (dollars)") +
theme(axis.title.y = element_blank(),
legend.position = "none") For this particular model, we see that the 0.632 and 0.632+ estimators are pretty much the same. This makes sense because a linear model fits the data quick well and so there should be little overfitting. We can also see that the bootstrap in sample error rate has the lowest median.

Given how it can be difficult to keep track of which metric should be calculated on which dataset, I should probably write some tests to confirm that my code does what I want it to. Since this is a blog post, I’m going to pass on that for the moment.

## Using the bootstrap in practice

When you have fewer than 20,000 data points, it’s reasonable to use the optimism bootstrap with $B = 200$ to $B = 400$. When you have more data, cross-validation4 or simply splitting the data becomes more reliable. Any steps taken to develop a model should be performed on each bootstrap dataset. For example, if you use LASSO for feature selection and then fit a model, the feature selection step needs to be included in the bootstrap procedure.

In some studies (Steyerberg et al. (2001)) the 0.632 and 0.632+ estimators have similar performance. Asymptotically, the 0.632 estimator is equivalent to the optimism bootstrap (Efron and Tibshirani (1994))5. It isn’t clear if one is preferrable over the other in finite samples, so I’d just stick with the estimator your audience is most likely to already be familiar with. You should be fine so long as you avoid using the bootstrap in sample error estimate.

If you are interested in comparing several models, it probably isn’t sufficient to compare the sampling distributions of these error measures. If you fit multiple models on the same bootstrap datasets, the bootstrap samples will have an impact on model performance. Hierarchicals models are a good way to model this. Once you have your bootstrapped error estimates, Max Kuhn’s tidyposterior package can help you take the next step when comparing models.

If you’re interested, stick around for future posts, where I’ll cover the process of building predictive models and how to select from several predictive models.

Feedback is always welcome!

1. I feel guilty about using linear regression as an example. OLS on the original data is too simple. There’s no feature selection, no parameter tuning, etc. I’m on the lookout for a good canonical modelling example to use in posts like this. Hopefully the work on recipes and parsnip standardizes interfaces enough to make this reasonable sometime soon.

2. I’m trying to get into the habit of always writing out the mathematical form of the model I’m fitting. Richard McElreath writes about why this is important.

3. Typically, regressions using averaged data have inappropriately small confidence intervals. However, the goal here is to demonstrate various bootstrapping methods, rather than inference, so we’ll ignore this for now.

4. Cross validation is much less stable than the bootstrap on data sets with less than 20,000 data points (Steyerberg et al. (2001)). Frank Harrell has run a number of simulations showing that you need at least 100 repetitions of 10-fold cross-validation to get accurate error estimates at this data size (Harrell (2015)). Here’s one such simulation.

5. The relevant chapter can be found online. You may also enjoy the draft of Frank Harrell’s new book.