The Bootstrap Algorithm and Applications

Discussion/Review of Sampling Distributions

A statistic \(T\) is a function of the data, that doesn’t depend on any unknown parameters (i.e. we can calculate it given the data). This means that given the data values \(x_1, x_2, \cdots x_n\), the stastistic \(T\) is “just a number”. For instance if \(x_1 =17, x_2=20, x_3 = 13\) then the sample mean (a statistic) is \(\bar x = \frac{50}{3} \approx 16.67\). Before we have observed the data however, the “data” are random variables \(X_1, X_2,\cdots X_n\) and our statistic \(T\), being a function of random variables, is also a random variable. The distribution of \(T\) is called the “Sampling Distribution of \(T\)”.

For example, if we have \[X_1, X_2, \cdots X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)\] and the statistic of interest is \(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\) then we know that \[\bar{X} \sim N(\mu, \sigma^2/n)\] and this is the sampling distribution of \(\bar{X}\). The sampling distribution of a statistic is not always so easy to find. Let’s consider two cases where the sampling distribution is harder to find analytically.

Case 1

Suppose we have \(40\) data points from a skewed distribution. A histogram of the data is given below.

Let’s say we are interested in the mean of this population, so the statistic of interest is the sample mean \(\bar{X}\). Depending on the distribution that we assume for the data, this sampling distribution may or may not be analytically tractable.

We begin by calculating the sample mean and sample standard deviation.

#Mean of data
mean(x)

## [1] 9.60526

#Variance of data
var(x)

## [1] 261.9177

The Central Limit Theorem tells us that when \(n\) is very large, the sample mean will be Normally distributed. But the big question here is how do we know if \(n\) is large enough? Should we trust the CLT approximation here, despite the heavy skew of the data?

Case 2

Consider a new set of \(200\) data points (we will call these data \(y_i\)). The Histogram is given below

Is this distribution skewed right? Or symmetric? It’s hard to tell. Recall that the population skew of a distribution is defined as \[\gamma = E\left(\left(\frac{Y-E(Y)}{SD(Y)}\right)^3\right)\]

A simple estimator (a statistic) for this parameter is the “sample skewness” given below \[\hat\gamma = \frac{\frac{1}{n}\sum_{i=1}^n(y_i - \bar y)^3}{s^3}\] where \(\bar y\) and \(s\) are the sample mean and standard deviation of the data respectively. So here’s the question… what is the samping distribution of \(\hat\gamma\) ? This distribution would surely be a nightmare to calculate analytically.

The Bootstrap Algorithm

The Bootsrap Algorithm was first published by Bradley Efron in 1979 and is known as a resampling technique. The Bootstrap (at least this version) is known as a non-parametric method, because it doesn’t require that we assume any particular distribution on the data, a useful feature.

The basic idea is that, if the sample data accurately reflects the population, we can “resample” the data and construct an approximation of the sampling distribution for a statistic \(T\). This approximation is sometimes called the Bootstrap distribution of \(T\). It is important to keep in mind, that like most statistical methods, the Bootstrap can fail when the sample size is very small.

The Algorithm is actually pretty simple, and works as follows.

Create a “new” dataset by sampling from the original data (with replacement) until you have a new dataset of size \(n\).
Calculate the test statistic of this new dataset and call it \(T_1\).
Repeat steps \(1\) and \(2\) a large number of times (let’s say \(B\) times) so that you have \(B\) estimates \(T_1, T_2, \cdots T_B\). This is a numerical approximation to the sampling distribution of \(T\).

Case 1 - With Bootstrap

In this example, we can use the Bootstrap to approximate the sampling distribution of the sample mean \(\bar{X}\). If the bootstrap distribution looks approximately Normal, it is reasonable to assume that the CLT will give (at least) a decent approximation. If not, we should be weary of trusting the CLT for this data.

B <- 1000 #Set B to a large number
boot_samples <- rep(NA, B) #Create a vector to store the bootstrap estimates
for(i in 1:B){
  x_new <- sample(x, length(x), replace=T) #Create new dataset
  boot_samples[i] <- mean(x_new) #Store the bootstrap estimate
}

Now that we have constructed the bootstrap distribution, we can plot it and check for normality.

par(mfrow=c(1,2)) #Put plots side by side
#Plot the Bootstrap distribution with an overlayed Normal density curve 
hist(boot_samples, breaks=30, freq=F)
curve(dnorm(x, mean(boot_samples), sd=sd(boot_samples)), add=T, col='red', lwd=2)

#Create a QQ-plot of the bootstrap distribution
qqnorm(boot_samples)
qqline(boot_samples)

It is clear from these plots that the sampling distribution of \(\bar{X}\) is slightly skewed right. Technically speaking, we likely need more samples before we should completely trust the CLT. Still, the bootstrap distribution is approximately Normal and thus the CLT may give reasonable answers.

Case 2 - With Bootstrap

We can begin by calculating the sample skewness of the original data.

#Calculate sample skewness
n <- length(y)
(1/n)*sum((y-mean(y))^3)/sd(y)^3

## [1] 0.3757131

As we observed, the skew is positive, indicating that the data is slightly skewed right. But how significant is this result? Since the sample size is fairly large, this is an excellent candidate for the bootstrap. Let’s approximate the sampling distribution of \(\hat\gamma\) as follows.

n <- length(y) #Get sample size
B <- 1000 #Set B to a large number
boot_samples <- rep(NA, B) #Create a vector to store the bootstrap estimates
for(i in 1:B){
  y_new <- sample(y, length(y), replace=T) #Create new dataset
  boot_samples[i] <- (1/n)*sum((y_new-mean(y_new))^3)/sd(y_new)^3 #Store the bootstrap estimate
}
hist(boot_samples, breaks=20) #Show bootstrap distribution

Now that we have an approximation to the sampling distribution, we can actually find a Bootstrapped Confidence Interval for \(\gamma\). For a fixed confidence level \(C \in (0,1)\) we find the interval \((a,b)\) which contains the middle \(C\times 100\)% of the bootstrapp estimates. This can be done quite easily in R as follows.

#Set confidence level to 0.95
C <- 0.95
alpha <- 1-C
#Get confidence interval
CI <- quantile(boot_samples, probs=c(alpha/2, 1-alpha/2))
CI

##      2.5%     97.5% 
## 0.1324308 0.6182025

#Plot bootstrap distribution and show CI
hist(boot_samples, breaks=20)
abline(v=CI, lwd=2, col='red', lty=3)

Interpreting this interval goes something like this: We are \(95\%\) confident that the true population skewness is between \(0.132\) and \(0.618\). Thus we have some level of confidence that the skew of this distribution is indeed positive.

Homework Assignment

Problem One.

For this problem, you will create a dataset by simulating data as follows. Each dataset will have \(n=100\) data points.

a <- 0.1
set.seed(100) 
x <- rgamma(100, a, a/10)

Create a histogram of the data. Setting \(B=10,000\), find the bootstrap distribution for the sample mean \(\bar{X}\). Create a Histogram of the bootstrap distribution and comment on what it says about the trustworthiness of the CLT.

Repeat this problem two more times, setting \(a=0.01\) and \(a = 0.001\).

Problem Two.

Read in the CDI dataset which contains demographic information for \(440\) US counties. Then extract the column titles PercentOver65, which gives the percentage of the population (for each county) over the age of \(65\). This can be done as follows.

data <- read.csv('http://math.unm.edu/~knrumsey/classes/spring17/MiniProjects/data.csv')
y <- data$PercentOver65

Create a histogram of the data.
Calculate the sample skewness of the data. How does this compare to what you observe?
Setting \(B=10,000\), use the Bootstrap to approximate the sampling distribution of the sample skewness. Plot this distribution using a histogram.

Create a \(99\%\) Bootstrapped confidence interval for the population skew \(\gamma\). Interpret your interval. Do you have evidence to believe the underlying data distribution is skewed?