The bootstrap — or why you should care about uncertainty

As the Bruce brothers explain in their excellent book Practical Statistics for Data Scientists, one easy and effective way to estimate the sampling distribution of a statistic is to draw additional samples (with replacement) from the sample itself and recalculate the statistic for each resample. This procedure is called the bootstrap.

Image for post
Image for post
  1. Calculate and store the mean (or any other statistic or metric) of the resampled values.
  2. Repeat steps 1–2 R times. R is a large number e.g. 10,000.
def bootstrap(data, R=10000):

means = []
n = len(data)

for i in range(R):
sampled_data = data.sample(n=n, replace=True)
mean = sampled_data.weight.mean()

return pd.DataFrame(means, columns=[‘means’])
Image for post
Image for post
def confidence_intervals(data, confidence_level=0.95): 

low_end = (1 — confidence_level) / 2
high_end = 1 — low_end
bottom_percentile = np.round(data.means.quantile(low_end), 2)
top_percentile = np.round(data.means.quantile(high_end), 2)

print(‘The {}% confidence interval is [{}, {}]’.format(
confidence_level * 100, bottom_percentile, top_percentile))
for ci in [0.6, 0.7, 0.8, 0.9, 0.95, 0.99]:
confidence_intervals(bootstrap_means, confidence_level=ci)
The 60.0% confidence interval is [74.18, 74.74]
The 70.0% confidence interval is [74.11, 74.81]
The 80.0% confidence interval is [74.04, 74.89]
The 90.0% confidence interval is [73.92, 75.01]
The 95.0% confidence interval is [73.83, 75.11]
The 99.0% confidence interval is [73.61, 75.31]

Tales about data, statistics, machine learning, visualisation, and much more. By Adrià Luz (@adrialuz) and Sara Gaspar (@sargaspar).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store