The bootstrap — or why you should care about uncertainty

by Adrià Luz

Adrià Luz
5 min readSep 10, 2018

As the Bruce brothers explain in their excellent book Practical Statistics for Data Scientists, one easy and effective way to estimate the sampling distribution of a statistic is to draw additional samples (with replacement) from the sample itself and recalculate the statistic for each resample. This procedure is called the bootstrap.

Resampling is the process of taking repeated samples from observed data (i.e. the dataset), and it includes both the bootstrap and permutation tests. I will cover permutation tests in a future post.

The bootstrap is widely used to find and plot the sampling distribution of a statistic (e.g. mean) or model parameters (e.g. β1 or AIC in a linear regression). This sampling distribution is then used to calculate the standard error of the statistic or model parameters and their confidence intervals. The beauty of the bootstrap is that it doesn’t necessarily involve any assumptions about the data or the sample statistic being normally distributed. This is in contrast to the traditional statistical approach of calculating standard errors and confidence intervals with formulas, which do require us to make sure that such assumptions hold. Nevertheless, it is important to note that these parametric tests (e.g. t-tests) are very suitable when the sampling distribution of the statistic is normal.

But… why would we ever want to know the sampling distribution of a statistic when you could just calculate the statistic itself?! In my experience, this is what most people, and certainly everyone without some kind of statistical background, do in the real world. We are all used to looking at point estimates, which dangerously omit any information about the uncertainty of that estimate. Statistics is about inferring something from our sample about a population we don’t have access to, so being able to understand how uncertain our sample statistic is will help us make better inferences about the population.

Imagine we were interested in knowing how much the average person weights in the UK. In this case, our population would be the UK population. So we would need to go and measure the weight of every single person living in the UK, which would be unfeasible for obvious reasons. What we would realistically do would be the following: take a sample of the UK population (let’s say 10,000 people) and measure their weight. It’s important that we randomly select our sample, otherwise it might not be representative of the whole population. These 10,000 people would be the equivalent of the dataset we have at work. Now imagine we take that dataset and calculate the mean: the average weight in our sample is 75kg. Does that mean that the average weight in the UK is 75kg? Remember that is what we’re actually interested in! In order to get some intuition regarding how good our sample statistic is, we need to understand the uncertainty that surrounds it. Thus, we need to work out how variable our statistic is by calculating the standard error and confidence intervals. This is where the bootstrap comes in handy.

First, we can take a look at the distribution of our data in order to see how the weights of the 10,000 people in our sample are distributed. We can see that in our (fake) dataset, the weight distribution is slightly skewed to the right, with a lower concentration of people in the higher end of the weight scale.

Calculating the mean of these weights gives us a point estimate: the average UK citizen in our sample weights 74.46kg. However, the question that we need to ask ourselves is: what would have happened if we had selected a different set of 10,000 people? Would the mean of that hypothetical dataset be exactly 74.46kg? Probably not, and we need to quantify this uncertainty!

Let’s use bootstrapping to find the sampling distribution of the mean, which we’ll then use to calculate its standard error and confidence intervals. The bootstrap algorithm works like this:

  1. Draw a sample of the same size as your original dataset. Make sure to sample with replacement, effectively creating an infinite population in which the probability of an element being drawn remains unchanged from draw to draw. Since we allow for replacement, this bootstrap sample is most likely not identical to our initial sample — some data points will be duplicated and other data points from the original dataset will be omitted.
  2. Calculate and store the mean (or any other statistic or metric) of the resampled values.
  3. Repeat steps 1–2 R times. R is a large number e.g. 10,000.

This is how I’ve done it in Python:

def bootstrap(data, R=10000):

means = []
n = len(data)

for i in range(R):
sampled_data = data.sample(n=n, replace=True)
mean = sampled_data.weight.mean()
means.append(mean)

return pd.DataFrame(means, columns=[‘means’])

Once we’ve done this, we’ll end up with a new dataset containing R means. Now we’re ready to plot a histogram of this dataset, which will show the sampling distribution of the mean.

As per the central limit theorem, this sampling distribution will be normally distributed and centred around the “true value” of 74.46kg. The width of the bell-shaped curve will be determined by the standard error, which is basically the standard deviation of the statistic. Therefore, we can calculate it by taking the standard deviation of the “means” dataframe. The standard error (SE) is 0.33107.

More interestingly, we can use this sampling distribution to calculate confidence intervals for our statistic. The process is quite simple and again doesn’t involve statistical formulas that depend on several assumptions — it is just a matter of calculating two percentiles. For example, if we wanted to calculate the 90% confidence interval, all we’d need to do is find the 5th and 95th percentiles, leaving 90% of the data in between. Those two percentiles would be the endpoints of our confidence interval. This is how you would do it in Python:

def confidence_intervals(data, confidence_level=0.95): 

low_end = (1 — confidence_level) / 2
high_end = 1 — low_end
bottom_percentile = np.round(data.means.quantile(low_end), 2)
top_percentile = np.round(data.means.quantile(high_end), 2)

print(‘The {}% confidence interval is [{}, {}]’.format(
confidence_level * 100, bottom_percentile, top_percentile))

Then you could call this function and calculate several confidence intervals:

for ci in [0.6, 0.7, 0.8, 0.9, 0.95, 0.99]:
confidence_intervals(bootstrap_means, confidence_level=ci)

And this would be the output:

The 60.0% confidence interval is [74.18, 74.74]
The 70.0% confidence interval is [74.11, 74.81]
The 80.0% confidence interval is [74.04, 74.89]
The 90.0% confidence interval is [73.92, 75.01]
The 95.0% confidence interval is [73.83, 75.11]
The 99.0% confidence interval is [73.61, 75.31]

As you can see, the “true value” of our original dataset (which is acting as the population in this case) is captured by all the confidence intervals above. Also notice the relationship between the confidence level and the width of the confidence interval: the more confident we want to be, the wider the confidence interval will be.

--

--

Adrià Luz
Adrià Luz

Written by Adrià Luz

Tales about data, statistics, machine learning, visualisation, and much more. By Adrià Luz (@adrialuz) and Sara Gaspar (@sargaspar).

No responses yet