When we build a predictive model, we are interested in how the model will perform on data it hasn’t seen before. If we have lots of data, we can split it into training and test sets to assess model performance. If we don’t have lots of data, it’s better to fit a model using all of the available data and to assess its predictive performance using resampling techniques. The bootstrap is one such resampling technique. This post discusses several variants of the bootstrap that are appropriate for estimating predictive performance.

## A brief introduction to the bootstrap

The bootstrap isn’t a particular procedure so much as a general strategy. We assume that our data comes from some underlying population \(F\) that we care about. We’re interested in the sampling distribution of some statistic \(T\) that we calculate from the data. In our case \(T\) is predictive performance.

We don’t know \(F\), but we can treat the data as an approximation \(\hat F\) of \(F\). Here \(\hat F\) is the empirical distribution of the data, which gives equal probablity to each observed data point. So we know how to sample from \(\hat F\). The bootstrap says to sample from \(\hat F\) many times and calculate \(T\) using these samples. The sampling distribution of \(T\) under \(\hat F\) then approximates the sampling distribution of \(T\) under \(F\).

The canonical diagram comes from Efron and Tibshirani (1994). In the diagram the bootstrap world approximates the real world:

In terms of assessing model performance, this means that we sample our original data \(X \in \mathbb{R}^{n, p}\) with replacement to generate new datasets \(X^*_1, ..., X^*_B \in \mathbb{R}^{n, p}\), and then we estimate model performance on each of the bootstrap samples \(X^*_b\).

## Data & Model

In this post, we’ll fit a linear regression^{1} and compare:

- bootstrap in-sample error
- bootstrap out-of-sample error
- the optimism bootstrap
- the 0.632 bootstrap
- the 0.632+ bootstrap

For data, we’ll use average college tuition costs in each state. The original data source is here, but I downloaded it from the Tidy Tuesday repository.

In particular, we’ll see how predictive tuition costs in 2014-15 are for 2015-16 using a simple linear regression of the form^{2}:

\[y_i = f(x_i) = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathrm{N}(0, \sigma^2)\]

Here \(y_i\) is the average tuition in state \(i\) for the 2015-16 academic year^{3}, and \(x_i\) is the average tuition in state \(i\) for the 2014-15 academic year. We have \(n = 50\) data points. Once we fit a model, we’ll refer to the predicted tuition for state \(i\) as \(f(x_i)\). The data looks like:

```
library(readxl)
library(here)
library(tidyverse)
library(rsample)
library(Metrics)
set.seed(27)
data <- read_xlsx(here("content", "data", "us_avg_tuition.xlsx")) %>%
transmute(state = State,
avg_tuition_15_16 = `2015-16`,
avg_tuition_14_15 = `2014-15`) %>%
mutate_if(is.numeric, round)
data
```

```
## # A tibble: 50 x 3
## state avg_tuition_15_16 avg_tuition_14_15
## <chr> <dbl> <dbl>
## 1 Alabama 9751 9496
## 2 Alaska 6571 6149
## 3 Arizona 10646 10414
## 4 Arkansas 7867 7606
## 5 California 9270 9187
## 6 Colorado 9748 9299
## 7 Connecticut 11397 10664
## 8 Delaware 11676 11515
## 9 Florida 6360 6345
## 10 Georgia 8447 8063
## # ... with 40 more rows
```

We quickly plot the data with a linear regression overlaid:

```
ggplot(data, aes(avg_tuition_14_15, avg_tuition_15_16)) +
geom_smooth(method = "lm", se = FALSE) +
geom_point(size = 2) +
labs(title = "Average College Tuition",
subtitle = "Each point represents one state",
y = "2015-2016 Average Tuition (dollars)",
x = "2014-2015 Average Tuition (dollars)")
```

To assess model performance, we’ll use squared error, which we calculate as:

\[\mathcal{L(y_i, f(x_i))} = (y_i - f(x_i))^2\]

where we treat \(y\) as a vector in \(\mathbb{R}^n\). Our first measure of model performance is the *in sample* performance, which is the mean squared error (MSE) on all the training data:

\[\texttt{in sample error} = {1 \over n} \sum_{i = 1}^n \mathcal{L}(y_i, f(x_i))\]

We can calculate this in R like so:

```
formula <- avg_tuition_15_16 ~ avg_tuition_14_15
model <- lm(formula, data)
# utility to extract response vector from formula and dataframe
response <- function(formula, data) model.response(model.frame(formula, data))
pred <- predict(model, data)
true <- response(formula, data)
# in the code I use root mean squared error (RMSE) as a metric rather than
# MSE. this keeps the presentation of the math slightly nicer, since I can
# omit square roots, but keeps the errors on a more interpretable scale.
# I use `Metrics::rmse` for predictions that live in vectors
# and `yardstick::rmse` for predictions that live in dataframes
in_sample <- Metrics::rmse(pred, true)
```

The in sample error on the original data is 186 dollars.

### Apparent and out of sample bootstrap error

There are several ways to calculate the error on our bootstrap samples. One first step is to calculate the *bootstrap in sample error* by estimating the performance of a model fit on each bootstrap sample on the each bootstrap sample \(X^*_b\) itself.

To write this out more formally, let \(f_b\) be the model fit to the \(b^{th}\) bootstrapped dataset, let \(I_b\) be the data points that made it into the \(b^{th}\) bootstrap sample and let \(n_b\) be the total number of data points in the \(b^{th}\) bootstrap sample

\[\texttt{bootstrap in sample error} = \\ {1 \over B } \sum_{b = 1}^B {1 \over n_b} \sum_{i \in I_b} \mathcal{L}(y_i, f_b(x_i))\]

However, we also know that for example bootstrap sample, some of the original data didn’t get used to fit \(f_b\). We can use that data to calculate *bootstrap out of sample error*:

\[\texttt{bootstrap out of sample error} = \\ {1 \over B } \sum_{b = 1}^B {1 \over n - n_b} \sum_{i \notin I_b} \mathcal{L}(y_i, f_b(x_i))\]

When someone tells me that they used the bootstrap to evaluate their model, I typically assume that they’re reporting the *bootstrap out of sample error*, especially if they’re from the machine learning community.

The *bootstrap in sample error* typically underestimates prediction error, and the *bootstrap out of sample error* typically overestimates prediction error. There are several variants of the bootstrap that try to correct these biases.

### The optimism bootstrap

The first of these is the optimism bootstrap. First we define the optimism of the \(b^{th}\) bootstrap model.

\[\mathcal{O}_b = {1 \over n} \sum_{i = 1}^n \mathcal{L}(y_i, f_b(x_i)) - {1 \over n_b} \sum_{i \in I_b} \mathcal{L}(y_i, f_b(x_i))\]

This is the same as calculating the average error of \(f_b\) on the entire original sample, and then subtracting the bootstrap in sample error. To get a better estimate of the overall error we take the average optimism and add it to the *in sample error* estimate:

\[\texttt{optimism bootstrap error} = \\ \texttt{in sample error} + {1 \over B} \sum_{b=1}^B \mathcal{O}_b\]

### The 0.632 and 0.632+ bootstrap

Interestingly, the *bootstrap out of sample error* is somewhat pessimistic (Efron and Tibshirani (1994), Efron and Tibshirani (1997)). The *0.632 bootstrap* estimator tries to address this problem by combining the `in sample`

performance estimate with the *bootstrap out of sample* performance estimate:

\[\texttt{0.632 bootstrap error} = \\ \quad \\ 0.632 \cdot \texttt{bootstrap out of sample error} + \\ 0.368 \cdot \texttt{in sample error}\]

Where does the 0.632 come from? On average a bootstrap sample contains 63.2% of the data points in the original dataset. This Cross Validated answer has some nice discussion.

Similarly, the *0.632+ bootstrap estimator* tries to find a good balance between the in sample error and the bootstrap out of sample error. To do this, it considers the no information error rate:

\[\texttt{no info error rate} = {1 \over n^2} \sum_{i=1}^n \sum_{j=1}^n \mathcal{L}(y_i, f(x_j))\]

which is the expected error rate when data points and responses are randomly assigned. Averaging over all \(i, j\) you get a measure of how well you can predict when you know pretty much nothing. Then you can estimate the relative overfitting of a model with:

\[\texttt{overfit} = {\texttt{bootstrap out of sample error} - \texttt{in sample error} \over \texttt{no info error rate} - \texttt{in sample error}}\]

The 0.632+ uses this to weight the bootstrap out of sample error and the in sample error according to

\[\texttt{w} = \texttt{weight on out of sample bootstrap error} \\ = {0.632 \over 1 - 0.368 \cdot \texttt{overfit}}\]

and the final 0.632+ estimator is

\[\texttt{0.632+ bootstrap error} = \\ \quad \\ \texttt{w} \cdot \texttt{out of sample bootstrap error} + \\ (1 - \texttt{w}) \cdot \texttt{in sample error}\]

The idea is that the 0.632 estimator can be optimistic, and to take overfitting into account to correct this. If the relative overfitting is zero, the 0.632+ estimator reduces to the 0.632 estimator.

## Actually calculating these things in R

Before anything else, we need bootstrap samples. Luckily, the `rsample`

package makes this super convenient.

```
# create an `rset` object: a data frame with a list-column full
# of bootstrap resamples
boots <- bootstraps(data, times = 25)
boots
```

```
## # Bootstrap sampling
## # A tibble: 25 x 2
## splits id
## <list> <chr>
## 1 <S3: rsplit> Bootstrap01
## 2 <S3: rsplit> Bootstrap02
## 3 <S3: rsplit> Bootstrap03
## 4 <S3: rsplit> Bootstrap04
## 5 <S3: rsplit> Bootstrap05
## 6 <S3: rsplit> Bootstrap06
## 7 <S3: rsplit> Bootstrap07
## 8 <S3: rsplit> Bootstrap08
## 9 <S3: rsplit> Bootstrap09
## 10 <S3: rsplit> Bootstrap10
## # ... with 15 more rows
```

The individual bootstrap samples are contained in the `rsplit`

objects:

```
one_boot <- boots$splits[[1]]
one_boot
```

`## <50/17/50>`

This print method is to help track which data went into the bootstrap sample. The format here is `<# data points in resampled data set / # original data points not in the resampled data set / # original data points>`

.

If we want to see the bootstrap sample itself, we can:

`analysis(one_boot)`

```
## # A tibble: 50 x 3
## state avg_tuition_15_16 avg_tuition_14_15
## <chr> <dbl> <dbl>
## 1 Wisconsin 8815 8785
## 2 California 9270 9187
## 3 Utah 6363 6171
## 4 Kentucky 9567 9223
## 5 Idaho 6818 6610
## 6 Massachusetts 11588 10987
## 7 Arkansas 7867 7606
## 8 Alabama 9751 9496
## 9 Connecticut 11397 10664
## 10 Georgia 8447 8063
## # ... with 40 more rows
```

If we want to see all the data points that *didn’t* go into the bootstrap sample, we can use:

`assessment(one_boot)`

```
## # A tibble: 17 x 3
## state avg_tuition_15_16 avg_tuition_14_15
## <chr> <dbl> <dbl>
## 1 Alaska 6571 6149
## 2 Florida 6360 6345
## 3 Hawaii 10175 9713
## 4 Indiana 9120 9049
## 5 Kansas 8530 8270
## 6 Louisiana 7871 7337
## 7 Maine 9573 9560
## 8 Minnesota 10831 10582
## 9 Nevada 6667 6438
## 10 North Dakota 7688 7527
## 11 Ohio 10196 10104
## 12 Oklahoma 7450 7094
## 13 Pennsylvania 13395 13157
## 14 Rhode Island 11390 10977
## 15 Tennessee 9263 8941
## 16 Virginia 11819 11202
## 17 Washington 10288 10703
```

Now we want to fit models on each of the bootstrap samples and assess various performance metrics. We write some helper functions to smooth things along.

```
# evaluate model performance on a particular dataset
performance <- function(model, data, metric = Metrics::rmse) {
# aside: it would be nice to be able to extract a data/design object from
# the model, and then extract the response from that. someday.
true <- response(formula, data)
pred <- predict(model, data)
metric(true, pred)
}
# get the no info error rate. should be used on the model
# fit to full original data
no_info_error_rate <- function(model, formula, data, metric = Metrics::rmse) {
true <- response(formula, data)
pred <- predict(model, data)
crossed <- crossing(true, pred)
with(crossed, metric(true, pred))
}
# NOTE: `bs` in variable names is an abbreviation for `bootstrap`
# 0.632 bootstrap estimate of performance
bs_632 <- function(in_sample, bs_out_of_sample) {
0.368 * in_sample + 0.632 * bs_out_of_sample
}
# 0.632+ bootstrap estimate of performance
bs_632p <- function(in_sample, bs_out_of_sample, no_info_error_rate) {
relative_overfit <- (bs_out_of_sample - in_sample) /
(no_info_error_rate - in_sample)
w <- 0.632 / (1 - 0.368 * relative_overfit)
w * bs_out_of_sample + (1 - w) * in_sample
}
```

Now we can fit models on each bootstrapped dataset:

```
no_info_perf <- no_info_error_rate(model, formula, data)
boot_performance <- boots %>%
mutate(
# fit a model on *bootstrap sample* of data
model = map(splits, ~lm(formula, analysis(.x))),
# bootstrap in sample error
bs_is_perf = map2_dbl(model, splits, ~performance(.x, analysis(.y))),
# bootstrap out of sample error
bs_os_perf = map2_dbl(model, splits, ~performance(.x, assessment(.y))),
# bootstrap error on full data
data_perf = map_dbl(model, ~performance(.x, data)),
# optimism
optimism = bs_is_perf - data_perf,
# optimism corrected error estimate
bs_optimism = in_sample - optimism,
# 0.632 bootstrap error estimate
bs_632_perf = bs_632(bs_is_perf, bs_os_perf),
# 0.632+ bootstrap error estimate
bs_632p_perf = bs_632p(bs_is_perf, bs_os_perf, no_info_perf))
```

We can calculate the point estimates we discussed above:

```
clean_performance <- boot_performance %>%
select_if(is.numeric) %>%
select(-optimism, -data_perf) %>%
gather() %>%
transmute(measure = recode(
model,
"bs_is_perf" = "bootstrap in sample",
"bs_os_perf" = "bootstrap out of sample" ,
"bs_optimism" = "optimism bootstrap",
"bs_632_perf" = "0.632 bootstrap",
"bs_632p_perf" = "0.632+ bootstrap"),
error = statistic)
clean_performance %>%
group_by(measure) %>%
summarize_if(is.numeric, mean) %>%
mutate_if(is.numeric, round) %>%
knitr::kable()
```

measure | error |
---|---|

bootstrap in sample | 186 |

bootstrap out of sample | 190 |

optimism bootstrap | 190 |

0.632 bootstrap | 189 |

0.632+ bootstrap | 189 |

However, point estimates aren’t good ways to compare the performance of two models because they don’t give us a sense of how much model performance might vary on different data sets. Visualizing the sampling distribution of each of the metrics gives us a more complete view of model performance:

```
clean_performance %>%
ggplot(aes(measure, error,color = measure)) +
geom_boxplot() +
geom_jitter() +
scale_color_brewer(type = "qual") +
coord_flip() +
labs(title = "Sampling distributions of bootstrap performance metrics",
y = "rmse error (dollars)") +
theme(axis.title.y = element_blank(),
legend.position = "none")
```

For this particular model, we see that the 0.632 and 0.632+ estimators are pretty much the same. This makes sense because a linear model fits the data quick well and so there should be little overfitting. We can also see that the bootstrap in sample error rate has the lowest median.

Given how it can be difficult to keep track of which metric should be calculated on which dataset, I should probably write some tests to confirm that my code does what I want it to. Since this is a blog post, I’m going to pass on that for the moment.

## Using the bootstrap in practice

When you have fewer than 20,000 data points, it’s reasonable to use the optimism bootstrap with \(B = 200\) to \(B = 400\). When you have more data, cross-validation^{4} or simply splitting the data becomes more reliable. Any steps taken to develop a model should be performed on each bootstrap dataset. For example, if you use LASSO for feature selection and then fit a model, the feature selection step needs to be included in the bootstrap procedure.

In some studies (Steyerberg et al. (2001)) the 0.632 and 0.632+ estimators have similar performance. Asymptotically, the 0.632 estimator is equivalent to the optimism bootstrap (Efron and Tibshirani (1994))^{5}. It isn’t clear if one is preferrable over the other in finite samples, so I’d just stick with the estimator your audience is most likely to already be familiar with. You should be fine so long as you avoid using the *bootstrap in sample error* estimate.

If you are interested in comparing several models, it probably isn’t sufficient to compare the sampling distributions of these error measures. If you fit multiple models on the same bootstrap datasets, the bootstrap samples will have an impact on model performance. Hierarchicals models are a good way to model this. Once you have your bootstrapped error estimates, Max Kuhn’s tidyposterior package can help you take the next step when comparing models.

If you’re interested, stick around for future posts, where I’ll cover the process of building predictive models and how to select from several predictive models.

Feedback is always welcome!

## References

Efron, Bradley, and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The .632+ Bootstrap Method.” *Journal of the American Statistical Association* 92 (438). http://www.jstor.org/stable/2965703.

Efron, Bradley, and Robert J Tibshirani. 1994. *An Introduction to the Bootstrap*. CRC press.

Harrell, Frank. 2015. *Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis*. Springer.

Steyerberg, Ewout W, Frank E Harrell, Gerard J.J.M Borsboom, M.J.C Eijkemans, Yvonne Vergouwe, and J.Dik F Habbema. 2001. “Internal Validation of Predictive Models: Efficiency of Some Procedures for Logistic Regression Analysis.” *Journal of Clinical Epidemiology* 54 (8). http://www.sciencedirect.com/science/article/pii/S0895435601003419.

I feel guilty about using linear regression as an example. OLS on the original data is too simple. There’s no feature selection, no parameter tuning, etc. I’m on the lookout for a good canonical modelling example to use in posts like this. Hopefully the work on

`recipes`

and`parsnip`

standardizes interfaces enough to make this reasonable sometime soon.↩I’m trying to get into the habit of always writing out the mathematical form of the model I’m fitting. Richard McElreath writes about why this is important.↩

Typically, regressions using averaged data have inappropriately small confidence intervals. However, the goal here is to demonstrate various bootstrapping methods, rather than inference, so we’ll ignore this for now.↩

Cross validation is much less stable than the bootstrap on data sets with less than 20,000 data points (Steyerberg et al. (2001)). Frank Harrell has run a number of simulations showing that you need at least 100 repetitions of 10-fold cross-validation to get accurate error estimates at this data size (F. Harrell (2015)). Here’s one such simulation.↩

The relevant chapter can be found online. You may also enjoy the draft of Frank Harrell’s new book.↩