Descriptive Statistics

Kellin Rumsey

9/4/2018

An Illustrative Dataset

The CDI dataset contains Demographic information for the \(440\) most populated counties in the US. Throughout this handout, we will be focusing on two variables.

  1. percentPoverty - The percent of the population which is below poverty level.
  2. percentDiploma - The percent of the population with a HS Diploma.
#Read in the data
CDI_data <- read.csv(
  'http://math.unm.edu/~knrumsey/classes/spring17/MiniProjects/data.csv')

#Create the variables
percentPoverty <- CDI_data$PercentBelowPoverty
percentDiploma <- CDI_data$PercentHSGraduates

Visualizing Data

It is always good to begin with a visual inspection of the data. Histograms are a great place to start. Similarly, Kernel Density Estimates (KDEs) are useful for visualizing the density function of a continuous variable.

#Create a two panel plot
par(mfrow=c(1,2))
#Make a histogram
hist(percentPoverty)
#Plot the KDE
plot(density(percentPoverty))

Measures of Center - Mean

The sample mean is what most people mean when they say “average”. In Statistics, we denote the mean of observations \(x_1, x_2, \cdots x_n\) by \(\bar{X}\) which we obtain by adding up the observations and dividing by the sample size. \[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]

mean(percentPoverty)
## [1] 8.720682

Measures of Center - Median

The sample median of a set of numbers is any number \(M\) that splits the data in half. That is, the number of observations which are less than or equal to \(M\) should be the same as the number of obvservations which are greater than or equal to \(M\).

Definition: If we have \(n\) observations \(x_1\), \(x_2\), \(\cdots x_n\), we can arrange them from smallest to biggest. We call the sorted sample \(x_{(1)}, x_{(2)}, \cdots x_{(n)}\). Then the Median is defined as

\[M = \begin{cases} x_{\left(\frac{n+1}{2} \right)}, & \text{$n$ is odd} \\ \frac{x_{(n/2)} + x_{(n/2+1)}}{2}, & \text{$n$ is even} \end{cases}\]

median(percentPoverty)
## [1] 7.9

Measures of Center - Mode

The sample mode of a set of numbers, is the value which occurs most frequently.

As an example, we can simulate \(n=100\) observations of some dummy data.

#Simulate some Poisson data
dummy_data <- rpois(n=100, lambda=2.5)
table(dummy_data)
## dummy_data
##  0  1  2  3  4  5  6  7 
##  5 19 22 25 12  9  6  2

Based on the table, we can see that the sample mode is 3, since this value occurs 25 times.

#Extract the mode using R (kinda sloppy)
names(which.max(table(dummy_data)))
## [1] "3"

Question: Why does this sample mode not work well for continuous data?

Measures of Center - Mode

For continuous data, the sample mode fails because exact data points are never repeated.

names(which.max(table(percentPoverty)))
## [1] "9.8"

For the percent povery example, the mode is 9.8, but it only occurs 10 times (out of 440).

One option is to find the sample mode after rounding the data.

names(which.max(table(round(percentPoverty))))
## [1] "8"

Measures of Center - Mode \(\star\)

Intuitively, the mode is the just the location of the peak of the distribution. Therefore for continuous data, we can simply use the peak of a Kernel Density Estimate as an estimate for the mode.

#Get KDE
d <- density(percentPoverty)
#Estimate Mode
d$x[which.max(d$y)]
## [1] 7.661602

Measures of Center

Statistics and Resistance

The mean, median and mode that we discussed in chapters \(3\) and \(4\) are different than what we are discussing now. We can refer to those as the “population” or “true” mean, median and mode.

The sample mean, sample median and sample mode are called statistics, because we can calculate them using only the data.

Now, we say that a statistic is resistant if it doesn’t change drastically in the presence of an outlier.

Example:

What’s going on?

Statistics and Resistance

Statistics and Resistance

Sample mean is not resistant to outliers, but the sample median is!

Measures of Spread - Variance

The sample variance, denoted by \(S^2\), i a commonly used measure of spread defined by \[S^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\]

The sample standard deviation, denoted \(S\) is just the square root of \(S^2\). The term \(\sum_{i=1}^n(x_i-\bar{x})^2\) is reffered to as the sum of squares. The fact that we divide by \(n-1\) instead of \(n\) is sometimes unintuitive to people.

There is also a convenient identity for calculating \(S^2\) which is sometimes called the shortcut method. This identity will be very helpful for us in the next chapter.

\[\sum_{i=1}^n(x_i-\bar x)^2 = \sum_{i=1}^nx_i^2 - n\bar{x}^2 \] ## Measures of Spread - Variance

#Calculate the variance
var(percentPoverty)
## [1] 21.6852
#Calculate the standard deviation
sqrt(var(percentPoverty))
## [1] 4.656737
sd(percentPoverty) #This works too
## [1] 4.656737
#Calculate the sum of squares
sum(percentPoverty^2)
## [1] 42981.93
#Calculate variance using shortcut method
n <- length(percentPoverty)

(sum(percentPoverty^2)-n*mean(percentPoverty)^2)/(n-1)
## [1] 21.6852

Order Statistics and Five Number Summary

Given a sample \(x_1, x_2, \cdots x_n\), the order statistics are just the sorted observations \(x_{(1)}, x_{(2)}, \cdots x_{(n)}\). That is, \(x_{(i)} \leq x_{(i+1)}\) for every \(i=1,2,\cdots n-1\). This means that \(x_{(1)}\) can be interpreted as the minimum of the sample and \(x_{(n)}\) is the maximum.

The Five Number Summary refers to \(5\) numbers (shocker) which describe the spread of the observations.

Intuitively, the quartiles divide the observations into \(4\) equally sized groups. For instance, \(25\%\) of the observations should be less than \(Q_1\) and \(25\%\) of the observations should be greater than \(Q_3\).

Five Number Summary

Consider the observations \((3,2,5,1,4)\). The five number summary can be found as follows.

We can calculate the \(5\) number summary in R as follows.

quantile(c(3,2,5,1,4))
##   0%  25%  50%  75% 100% 
##    1    2    3    4    5
quantile(percentPoverty)
##   0%  25%  50%  75% 100% 
##  1.4  5.3  7.9 10.9 36.3

Measures of Spread - Range and IQR

Finally, we define the sample range \[R = x_{(n)} - x_{(n)}\] and the inter-quartile range (IQR) \[IQR= Q_3 - Q_1\]

#Compute range
diff(range(percentPoverty))
## [1] 34.9
#Compute IQR
IQR(percentPoverty)
## [1] 5.6

Identifying Outliers

An outlier is an ovservation point which is far from the other observaytions. An outlier can be

In general we do not remove outliers from the dataset without good reason.

Identifying Outliers:

  1. Visual inspection of data
  2. The \(1.5\times IQR\) Rule
  3. Using \(Z\)-scores
  4. Bonferonni’s correction

Identifying Outliers - Visual Inspection

Most of the time, it is sufficient to plot a histogram and visually inspect for strange points. For instance, the Histogram below shows 3 or 4 points off to the right which may be outliers.

hist(percentPoverty, breaks=30)

Identifying Outliers - Visual Inspection

Boxplots are a convenient way to readily visualize the \(5\) number summary.

par(mfrow=c(1,2))
boxplot(percentPoverty, range=0)
boxplot(percentPoverty)

The default boxplot in R (right side), automatically implements the \(1.5\times IQR\) rule to detect outliers.

Identifying Outliers - \(1.5 \times IQR\) Rule

Define the upper and lower fence by \[UF = Q_3 + 1.5(IQR)\] \[LF = Q_1 - 1.5(IQR)\] Now any observations below the lower fence or above the upper fence are flagged as potential outliers. We can accomplish this in R with the following code.

UF <- quantile(percentPoverty, probs=0.75) + 1.5*IQR(percentPoverty)
LF <- quantile(percentPoverty, probs=0.25) - 1.5*IQR(percentPoverty)
outliers <- (percentPoverty > UF) | (percentPoverty < LF) # | is the or operator
percentPoverty[outliers]
##  [1] 19.5 22.4 27.3 20.6 36.3 33.7 19.6 20.7 33.1 20.8 20.7

Identifying Outliers - With \(Z\)-Scores

If we believe that the data is normally distributed, then we can try to identify outliers based on \(Z\)-scores. If we standardize each observation, \[z_i = \frac{x_i - \bar{x}}{s}\] then we know from the normal distribution that \(|z_i|\) greater than \(2\) should only occur about \(5\%\) of the time. In general, we can choose a critical value \(z_\star\) such that \(P(|Z_i| > z_\star) = \alpha\), by choosing \(z_\star = \Phi^{-1}(1-\alpha/2)\).

#Identify outliers
alpha <- 0.05
z_star <- qnorm(1-alpha/2)
z <- (percentPoverty - mean(percentPoverty))/sd(percentPoverty)
outliers <- which(abs(z) > z_star)
percentPoverty[outliers]
##  [1] 19.5 22.4 27.3 20.6 36.3 18.0 33.7 19.1 19.6 20.7 33.1 18.6 18.7 20.8
## [15] 20.7

Identifying Outliers - Bonferonni’s Correction \(\star\)

In the previous approach, we flag an outlier if it is so extreme, that it would have occured less than \(5\%\) of the time. This means that in a sample of \(100\) observations, we expect \(5\) of these points to be flagged as outliers… with this logic, are these points really outliers?

A better approach is to the let the critical value \(z_\star\) grow with the sample size. Without going into detail, we can use Bonferonni’s correction.

If we choose \(z_\star = \Phi^{-1}\left(1 - \frac{\alpha}{2n} \right)\), then we can guarantee \[P(\text{any of the $|Z_i|$ are greater than $z_\star$}) \leq \alpha\]

alpha <- 0.05
n <- length(percentPoverty)
z_bonf <- qnorm(1 - alpha/(2*n))
outliers <- which(z > z_bonf)
percentPoverty[outliers]
## [1] 27.3 36.3 33.7 33.1

Identifying Outliers - With \(Z\)-scores

#Make plot
hist(z, breaks=30, xlim=range(z, -z_bonf, z_bonf))
abline(v=c(-z_star, z_star), col='red', lty=3)
abline(v=c(-z_bonf, z_bonf), col='blue', lty=3)

QQ-Plots

It is common in statistics to assume that data is Normally distributed. Plotting a Histogram is a good way to determine if this assumption is reasonable. QQ-Plots (or probability plots) provide us with another tool.

Simply put, we plot the sorted data \(x_{(i)}\) on the y-axis, and we plot the theoretical quantiles \(E(X_{(i)})\) on the \(x\)-axis. If these points form a straight line, then the distributional assumption is reasonable.

qqnorm(percentPoverty)
qqline(percentPoverty)

Since the QQ-plot shows a distinct curve, we have further evidence that this data is not quite Normally distributed. In general

Transformations

Sometimes, we can transform our data \(y_i = g(x_i)\) and achieve (approximate) normality. One useful family of transformations is known as the Box-Cox Power Transformation.

\[y_i = \begin{cases} x_i^\lambda, & \lambda \neq 0 \\ \ln(x_i), & \lambda = 0 \end{cases}\]

A Box-Cox Plot tries out many of these transformations, and gives a curve indicating which values of \(\lambda\) are the most likely. To do this in R, we need to install the “MASS” package. Do this by typing install.packages('MASS') into the R console.

library('MASS')
boxcox(lm(percentPoverty~1))

Transformations

Since the band in the Box-Cox Plot is very close to \(0\), we might try a log-transformation of the data.

y <- log(percentPoverty)
par(mfrow=c(1,2))
hist(y, breaks=30, main='Histogram of Transformed Data')
qqnorm(y)
qqline(y)

The normality assumption seems much better for the transformed data.

Question: If a log-transformation makes our data normal, what does that say about the distribution of percent below poverty level?

Relationships Between Variables

Assume we have paired ovservations of two varibles \((x_i, y_i)\). For instance, \(x_i\) and \(y_i\) could represent the height and weight of a particular person. Let \(\bar x\), \(\bar y\), \(s_x\) and \(s_y\) denote the means and standard deviations of the two variables. We define the sample correlation as \[r = \frac{1}{n-1}\frac{\sum_{i=1}^n(x_i-\bar x)(y_i -\bar y)}{s_x \ s_y}\]

cor(percentPoverty, percentDiploma)
## [1] -0.6917505
#Order doesn't matter
cor(percentDiploma, percentPoverty)
## [1] -0.6917505

Scatterplots

Scatter plots are great tools for quickly visualizing the relationship between two variables. Each pair \((x_i, y_i)\) simply becomes a point on the plot.

plot(percentDiploma, percentPoverty, pch=16, main='Scatterplot of Education vs Poverty')

Linear Regression

Regression is a big topic in Statistics and Machine Learning. This is outside the scope of this class, but it is worth noting that R makes it very easy to add a Linear Regression to line to a scatterplot. It should be apparent from the plot that a linear relationship may not be adequate for describing the relationship between these two variables.

plot(percentDiploma, percentPoverty, pch=16, main='Scatterplot of Education vs Poverty')
fit <- lm(percentDiploma~percentPoverty)
abline(fit, lwd=2, lty=3, col='orange')