Today we will learn

  1. How to Construct Confidence Intervals

    1. Analytically
    2. Using Simulation
    3. Using Bootstrapping
  2. What is the idea behind confidence intervals

  3. How to Estimate Difference in Means

In other words, our goals are to:

  • Really understand what “statistical inference” means
  • Review how we can assess the uncertainty of the inferences we make
  • Learn about simulations, bootstrapping and classical hypothesis testing (and it’s only the third session, we are making good progress!)

With today’s repository you downloaded the .Rmd script and the two datasets support.RData and polity.dta. They are in the folder raw-data. You can see them in the Files window.

1 Constructing Confidence Intervals (CIs)

1.1 Theory Refresher

  • With point estimates, we get the estimate that is correct on average.
  • An even more useful quantity can be a range of plausible values for an estimate.
  • Confidence interval is a way to construct an interval that will contain the true value in some fixed proportion of repeated samples.

1.2 Analytical Apporach

In the file support.RData you find the policy support measurements from the lecture. We want to calculate the 95% confidence interval for the mean.

First, we need to load the data. This time it’s easier because the data is in the .RData format, one of the native formats for R. Hence, no extra packages are required:

load(file = here("raw-data/support.RData"))

# Let's have a look at the data first.

summary(support)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   13.00   14.00   14.36   15.75   17.00

# summary() shows you the key measures of central tendency and variability.

Base R

# We should also have a look at the distribution, quick but ugly...
hist(support,
  main = "Histogram of Support for a New Policy",
  xlab = "Support for a new policy",
  col = viridis(1),
  border = F,
  las = 1
)

ggplot2

ggplot() +
  geom_histogram(aes(x = support),
    boundary = 14,
    binwidth = 1,
    color = "white",
    fill = viridis(1)
  ) +
  labs(
    title = "Histogram of Support for a New Policy",
    x = "Support Value",
    y = "Frequency"
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = c(seq(11, 17, by = 1)))

We can easily calculate the mean support for the new policy in our sample:

mean(support)
## [1] 14.36

Since we only have a sample (i.e., the survey was only conducted once), we cannot confidently infer that the mean support in the population (unknown to us, \(\theta\)) is the same as in our sample (what we observed, \(\hat{\theta}\)). That’s why we may want to calculate the confidence intervals.

Let’s calculate the analytical 95% CI. We start by calculating the standard error of the estimate of the population mean:

\(SE(\hat\theta) = \dfrac{\hat\sigma}{\sqrt{N}}\)

\(\hat\theta\) = Estimate of population mean

\(\hat\sigma\) = Standard deviation of a sample

\(N\) = Number of observations

support_se <- sd(support) / sqrt(length(support))
support_se
## [1] 0.2281425

How will the estimate \(SE(\hat\theta)\) change if there were more data points (but standard deviation of the sample were the same)?

Now we can calculate the analytical 95% CI: \(\hat\theta \pm 1.96 \times SE(\hat\theta)\)

ci_lo <- mean(support) + qnorm(0.025, 0, 1) * support_se # why use qnorm?
ci_up <- mean(support) + qnorm(0.975, 0, 1) * support_se
ci_an <- c(ci_lo, ci_up)

ci_an
## [1] 13.91285 14.80715

What should we change to calculate the 99% confidence intervals? Will this 99% interval be larger or smaller than 95% CI?

Tip: you can comment and un-comment multiple lines with Ctrl + Shift + C /Cmd + Shift + C.

# modify the following line to get 99% CIs
# mean(support) + qnorm(c(0.025, 0.975), 0, 1) * support_se 

1.2.1 Understanding Confidence Intervals

In the lecture, you have seen a simulation for the concept of confidence intervals. Let’s try to make one ourselves to better grasp the idea behind the interpretation of confidence intervals.

First, we’ll need to decide on the true population parameters. We will never know true population parameters in real life, but for the sake of demonstration, let’s do this.

# create a population
true_mean <- 500
pop <- rnorm(10000, true_mean, 100)

Now we need to do the sampling part, like we would be doing in real life. We will take a subset of the population and estimate statistics based on that sample. But unlike in real life, we will take samples many times, say 100.

n_iter <- 100 # we will take 100 samples (but only one in real life)
estimates <- NULL # empty object for estimates
confis <- matrix(rep(NA, n_iter * 2), # empty matrix for CIs
  ncol = 2, # for lower and upper CIs
  nrow = n_iter
)

for (i in seq_len(n_iter)) { # for the length of n_iter do:
  pop_sample <- sample(x = pop, size = 100) # take a sample of 100
  estimates[i] <- mean(pop_sample) # store its mean
  se <- sd(pop_sample) / sqrt(length(pop_sample)) # store SE for the sample
  confis[i, ] <- mean(pop_sample) + qnorm(c(0.025, 0.975)) * se # store 95% CIs
}

head(estimates)
## [1] 512.5565 510.3536 498.9780 506.6933 528.1269 513.4134
head(confis)
##          [,1]     [,2]
## [1,] 494.3300 530.7830
## [2,] 489.2261 531.4812
## [3,] 478.5214 519.4347
## [4,] 486.4937 526.8928
## [5,] 507.9621 548.2918
## [6,] 494.2595 532.5673

Base R

plot(
  confis,
  xlim = c(450, 550),
  ylim = c(1, 100),
  type = "n",
  xlab = "",
  ylab = "",
  yaxt = "n",
  bty = "n",
  main = "Confidence Interval for Mean"
)

# Let's add our true parameter
abline(v = true_mean, col = "red") 

for (i in seq_len(n_iter)) {
  segments(confis[i, 1], i, confis[i, 2], i, col = "azure4")
  points(estimates[i],
    i,
    pch = 19,
    cex = 0.4,
    col = "azure4"
  )
  # if lower CI is larger than 10 or upper CI smaller than 10
  if (confis[i, 1] > true_mean | confis[i, 2] < true_mean) {
    # paint red horizontal lines (since y0 == y1)
    segments(
      x0 = confis[i, 1],
      y0 = i,
      x1 = confis[i, 2],
      y1 = i,
      col = "red"
    )
    # add red points
    points(estimates[i], 
      i,
      pch = 19,
      cex = 0.4,
      col = "red"
    )
  }
}

ggplot2


# create a dataset from plotting from existing objects
CIs_data <- data.frame(
  mean = estimates,
  confis,
  n = 1:n_iter
)

# add a column for color
# if lower CI > true_mean or upper CI < true_mean, CI missed the true value and
# is coded as "out"
# ifelse(condition, value-if-true, value-if-false)
CIs_data$missed <- ifelse(CIs_data$X1 > true_mean | CIs_data$X2 < true_mean, "Out", "In")

ggplot(data = CIs_data) +
  geom_pointrange(
    aes(
      x = mean, # point value
      xmin = X1, # lower CI
      xmax = X2, # upper CI
      y = n, # y axis - just observation number
      color = missed
    ) # color varies by missed variable
  ) +
  geom_vline(
    aes(xintercept = true_mean), # add vertical line at true_mean
  ) +
  scale_color_manual(values = c("azure4", "red")) + # set our preferred colors
  theme_minimal() + # some theme to change the appearance
  labs(
    title = "Confidence Interval for Mean",
    subtitle = "Population mean equals 500",
    x = "Mean",
    y = "Sample",
    color = "Is true population parameter inside the 95% CI?"
  ) +
  theme(legend.position = "top") + # switch the legend to the top
  scale_x_continuous(breaks = c(seq(450, 550, by = 10))) # change values on x-axis

1.3 Simulation Approach

Now we turn to the simulation approach. Here we simulate the sampling distribution of the statistic we are interested in and take the empirical quantiles from this simulated distribution.

The central limit theorem tells us that the sampling distribution is normally distributed. We assume that the sample mean is the population mean. The standard error is the standard deviation of this distribution.

First, let’s set the number of simulations to 1000.

nsim <- 1000

Now let’s take random draws from a distribution with assumed parameters.

simsupport <- rnorm(n = nsim, 
                    mean = mean(support), 
                    sd = support_se)

And now we are plotting the resulting distribution:

hist(
  x = simsupport,
  main = "Sampling Distribution of the Sample Mean (via Simulation)",
  xlab = "Support for Policy",
  las = 1,
  col = viridis(2)[2],
  border = F, # remove the border
  breaks = 20,
  freq = FALSE,
  yaxt = "n",
  ylab = ""
)

# Now we will add some extras to the plot:
# Lines for density.

lines(density(simsupport),
  col = viridis(2)[1],
  lwd = 2
)

Now we can calculate empirical quantiles from the distribution:

ci_sim <- quantile(x = simsupport,
                   probs =  c(0.025, 0.975))

ci_sim
##     2.5%    97.5% 
## 13.89720 14.79784

And let’s see if there is a numerical difference between two approaches:

ci_sim - ci_an # should be close to zero!
##         2.5%        97.5% 
## -0.015649749 -0.009315387

Even better, we can even add the quantiles to the plot.

Base R

# This is the plot from above.
hist(simsupport,
  las = 1,
  breaks = 20,
  freq = FALSE,
  main = "Sampling Distribution of the Sample Mean\n(via Simulation)",
  xlab = "Support for Policy",
  col = viridis(4)[1],
  border = F,
  yaxt = "n",
  ylab = ""
)


# Now we will add some extras to the plot:
# Lines for density.

lines(density(simsupport),
  col = viridis(4)[2],
  lwd = 2
)

# Let's add the analytical CI
abline(
  v = ci_an,
  col = viridis(4, alpha = 0.75)[3],
  lwd = 2,
  lty = 2
)

# How does the simulated CI compare to the analytical one?
# Let's have a look at it in the same plot!
abline(
  v = ci_sim,
  col = viridis(4, alpha = 0.75)[4],
  lwd = 2,
  lty = 2
)

# And let's add a legend
legend("topright",
  bty = "n",
  col = viridis(4)[3:4],
  lwd = 2,
  lty = 2,
  legend = c(
    "Analytical 95% CI",
    "Simulated 95% CI"
  )
)

ggplot2

# create a dataset
simsupport_df <- data.frame(simsupport)

ggplot(
  simsupport_df, # use our dataset
  aes(x = simsupport)
) + # put simsupport on x-axis
  labs(
    title = "Sampling Distribution of the Sample Mean (via Simulation)",
    x = "Support for Policy",
    y = ""
  ) +
  geom_histogram(aes(y = ..density..), # add histogram
    boundary = 14,
    binwidth = 0.1,
    fill = viridis(4)[1], # change color
    color = "white"
  ) + # make white bin borders
  geom_density(color = viridis(4)[2]) + # add density
  geom_vline( # add simulated CIs
    xintercept = ci_sim,
    color = viridis(4, alpha = 0.75)[4],
    linetype = "dashed"
  ) +
  geom_vline( # add analytical CIs
    xintercept = ci_an,
    color = viridis(4, alpha = 0.75)[3],
    linetype = "dashed"
  ) +
  theme_minimal() + # change appearance
  annotate("text", # add CIs label
    x = 15, # where to place label on x axis
    y = 1.7, # where to place label on y axis
    label = "Analytical 95% CI",
    color = viridis(4)[3]
  ) +
  annotate("text",
    x = 15,
    y = 1.6,
    label = "Simulated 95% CI",
    color = viridis(4)[4]
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

1.4 Bootstrapping Approach

If we treat our sample as population, how do we achieve sampling variability with bootstrapping?

To remind you, here is what we need to do when using bootstrapping to generate confidence intervals:

  1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
  2. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, etc. computed on the bootstrap samples
  3. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics
  4. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution

Let’s see how we do this in R.

First, we need an empty vector for samples. This is good practice when using a loop, to make the process faster.

bootmeans <- rep(NA, 1000) # repeat "NA" 1000 times, store in object "bootmeans" 

Let’s loop over the mean from random samples with replacement from the support dataset.

Does anyone remember what a for loop does?

for (i in seq_len(1000)) {
  bootmeans[i] <- mean(sample(
    x = support,
    size = length(support),
    replace = TRUE # why?
  )) 
}

Let’s have a look at the results:

Base R

hist(bootmeans,
  las = 1,
  breaks = 20,
  freq = FALSE,
  main = "Sampling Distribution of the Sample Mean (via Bootstrap)",
  xlab = "Support for Policy",
  col = viridis(4)[1],
  border = F,
  yaxt = "n",
  ylab = ""
)

# And we add the density curve again.

lines(density(bootmeans),
  col = viridis(4)[2],
  lwd = 2
)

# calculate empirical quantiles from your distribution

ci_boot <- quantile(bootmeans, p = c(0.025, 0.975))

# As above we can plot the CI.
abline(
  v = ci_boot,
  col = viridis(4, alpha = 0.75)[4],
  lwd = 2,
  lty = 2
)

# And we compare it to the analytical solution.
abline(
  v = ci_an,
  col = viridis(4, alpha = 0.75)[3],
  lwd = 2,
  lty = 2
)

# And the legend
legend("topright",
  bty = "n",
  col = viridis(4)[3:4],
  lwd = 2,
  lty = 2,
  legend = c(
    "Analytical CI",
    "Bootstrapped CI"
  )
)

ggplot2

bootmeans_df <- data.frame(bootmeans)
ci_boot <- quantile(bootmeans, p = c(0.025, 0.975))

ggplot(
  bootmeans_df,
  aes(x = bootmeans)
) +
  labs(
    title = "Sampling Distribution of the Sample Mean (via Bootstrapping)",
    x = "Support for Policy"
  ) +
  geom_histogram(aes(y = ..density..),
    fill = viridis(4)[1],
    color = "white"
  ) +
  geom_density(color = viridis(4)[2]) +
  theme_minimal() +
  geom_vline(
    xintercept = ci_an,
    color = viridis(4, alpha = 0.75)[3],
    linetype = "dashed"
  ) +
  annotate("text",
    x = 15,
    y = 1.7,
    label = "Analytical 95% CI",
    color = viridis(4)[3]
  ) +
  geom_vline(
    xintercept = ci_boot,
    color = viridis(4, alpha = 0.75)[4],
    linetype = "dashed"
  ) +
  annotate("text",
    x = 15,
    y = 1.6,
    label = "Bootstrap 95% CI",
    color = viridis(4)[4]
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also see that the difference between analytical and bootstrap CIs is rather small:

ci_boot - ci_an
##       2.5%      97.5% 
## 0.02715099 0.01334901

2 Beyond Basic Quantities of Interest

What’s the point about the simulation and bootstrap approach if we are able to derive CIs analytically? The next example shows that for some quantities of interest, CIs are not easy to derive analytically, but the simulation and bootstrap approaches work just as before.

2.1 Example: Left-Voting Ratios

In a sample of size 1000 (500 men and 500 women), 55% of men and 65% of women vote left. Calculate the confidence interval for the “left-voting” ratio of men to women using the simulation approach. This is an example from the lecture slides 20-21.

With simulation approach, we need some values to set up the distribution, so mean and standard error for the Normal distribution, and then we will take draws from a distribution with such parameters.

The ratio is \(\dfrac{0.55}{0.65}\).

The formula for the standard error for proportions:

\(SE(\hat p) = \sqrt{\dfrac{\hat p (1-\hat p)}{n}}\)

Let’s translate it all into R:

p_hat_men <- 0.55
p_hat_women <- 0.65

n_men <- 500
n_women <- 500

# Calculate the standard errors: sqrt(p * (1 - p) / n)

se_men <- sqrt(p_hat_men * (1 - p_hat_men) / n_men)
se_women <- sqrt(p_hat_women * (1 - p_hat_women) / n_women)


# Draw a sample
nsim <- 1000

p_men <- rnorm(nsim, p_hat_men, se_men)
p_women <- rnorm(nsim, p_hat_women, se_women)

# Calculate the ratios from the simulated vectors

ratio <- p_men / p_women

# Finally, provide CI via simulation
ci_sim <- quantile(ratio, c(0.025, 0.975))

ci_sim
##      2.5%     97.5% 
## 0.7617279 0.9359397

2.2 Exercise: Difference in Means

Our previous quantity of interest was the “left-voting” ratio of men to women. Now we are interested in calculating a difference in means. Let’s work with another substantive example here.

The file polity.dta contains information on Polity scores, a measure of democracy in the country, which varies from 0 to 10, with 0 being the least democratic. We want to test if the difference in polity scores between Eastern Europe and Western Europe and North America is “statistically significant”.

Let’s load the dataset:

data <- read.dta("raw-data/polity.dta")  # Why doesn't load("polity.dta") work? 

First, we always investigate the dataset:

head(data)
##   ht_region ccode      cname fh_polity2         region
## 1         1   100   Bulgaria   9.333334 Eastern Europe
## 2         1   268    Georgia   6.250000 Eastern Europe
## 3         1   804    Ukraine   6.500000 Eastern Europe
## 4         1   112    Belarus   1.583333 Eastern Europe
## 5         1   705   Slovenia  10.000000 Eastern Europe
## 6         1    31 Azerbaijan   2.000000 Eastern Europe
glimpse(data) # function from dplyr package
## Rows: 159
## Columns: 5
## $ ht_region  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ ccode      <fct> 100, 268, 804, 112, 705, 31, 440, 616, 233, 642, 643, 795, ~
## $ cname      <fct> "Bulgaria", "Georgia", "Ukraine", "Belarus", "Slovenia", "A~
## $ fh_polity2 <dbl> 9.3333340, 6.2500000, 6.5000000, 1.5833333, 10.0000000, 2.0~
## $ region     <fct> Eastern Europe, Eastern Europe, Eastern Europe, Eastern Eur~
table(data$region)
## 
##          East Asia (including Japan & Mongolia) 
##                                               6 
##                                  Eastern Europe 
##                                              26 
##                                   Latin America 
##                                              20 
##                  North Africa & the Middle East 
##                                              20 
##                                      South Asia 
##                                               7 
##                                 South-East Asia 
##                                               9 
##                              Sub-Saharan Africa 
##                                              45 
##                                   The Caribbean 
##                                               3 
## The Pacific (excluding Australia & New Zeeland) 
##                                               3 
##                Western Europe and North America 
##                                              20

We start by extracting Polity scores of Eastern Europe and Western Europe and North America from the data set. Remember selecting/subsetting?

west <- data$fh_polity2[data$region == "Western Europe and North America"]
east <- data$fh_polity2[data$region == "Eastern Europe"]

2.2.1 Difference in Means via Simulation

In order to find out whether the difference in mean Polity scores between Eastern Europe and *Western Europe and North America“* is”statistically significant", we need to derive confidence intervals of this difference.

2.2.1.1 Calculate Standard Errors

To start, calculate the standard errors of the mean estimates of polity scores for Eastern Europe and Western Europe and North America.

Formula for standard errors: \(SE(\hat\theta) = \dfrac{\hat\sigma}{\sqrt N}\)

\(\hat\theta\) = Estimate of population mean

\(\hat\sigma\) = Standard deviation of a sample

\(N\) = Number of observations

# se_east <- 
# se_west <- 

2.2.1.2 Simulate Confidence Intervals for the Means

We now have standard deviations for the mean polity scores of both regions. This equips us to simulate confidence intervals for these estimates.

# # 1. Simulate sampling distributions 
# 
# nsim <- 
# d_west <- 
# d_east <- 
# 
# # 2. Calculate quantiles to get CI
# 
# ci_west <- 
# ci_east <- 

2.2.1.3 Simulate Confidence Intervals for the Difference in Means

Now we can check the difference: Subtract the simulated sampling distribution of the mean of western countries from the simulated sampling distribution of eastern countries and calculate the quantiles to get the CI of the difference in means. You can also plot the sampling distribution of the difference in means and calculate its mean.

# # 1. Subtract the simulated sampling distributions of the mean (west-east)
# 
# diff <- 
# 
# # 2. Calculate quantiles to get CI
# 
# ci_sim <- 
# 
# # 3. plot and mean
# 
# hist() 
# mean()

2.2.2 Appendix 1: This is how we get this analytically

We do it by hand, using the normal approximation.

n1 <- length(west)
n2 <- length(east)

ci_up <- (mean(west) - mean(east)) + 1.96 * sqrt(var(west) / n1 + var(east) / n2)
ci_low <- (mean(west) - mean(east)) - 1.96 * sqrt(var(west) / n1 + var(east) / n2)

ci_ahand <- c(ci_low, ci_up)

Of course there is also an easy way: in this case the t-test function.

t.test(west, east)
## 
##  Welch Two Sample t-test
## 
## data:  west and east
## t = 5.4398, df = 25.073, p-value = 1.189e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.100570 4.659686
## sample estimates:
## mean of x mean of y 
##  9.966667  6.586539

ci_t <- t.test(west, east)[[4]][1:2]

2.2.3 Appendix 2: This is how we get this via bootstrap

We again need an empty vector to save the results.

diff_boot <- rep(NA, 1000)

for (i in seq_len(1000)) {
  a <- mean(sample(west, length(west), replace = T))
  b <- mean(sample(east, length(east), replace = T))

  diff_boot[i] <- a - b
}

mean(diff_boot)
## [1] 3.389298
ci_boot <- quantile(diff_boot, c(0.025, 0.975))

# Finally, we present the results in a table.
res <- matrix(NA, 4, 2)

res[1, ] <- ci_sim
res[2, ] <- ci_ahand
res[3, ] <- ci_t
res[4, ] <- ci_boot

rownames(res) <- c("Simulation", "By hand", "t-test", "Bootstrapping")
colnames(res) <- c("Lower 95% bound", "Upper 95% bound")
res
##               Lower 95% bound Upper 95% bound
## Simulation          0.7617279       0.9359397
## By hand             2.1622324       4.5980239
## t-test              2.1005704       4.6596858
## Bootstrapping       2.2514022       4.6369711

Pro-tip: Use knitr::kable() to get nicely formatted tables from R objects to Markdown

knitr::kable(res,
  digits = 3
)
Lower 95% bound Upper 95% bound
Simulation 0.762 0.936
By hand 2.162 4.598
t-test 2.101 4.660
Bootstrapping 2.251 4.637

Concluding remarks

You learned about the different ways to calculate confidence intervals in R, with simulation and bootstrapping as well as analytically. You may still wonder which confidence level is the most appropriate to use. There is always a trade-off between precision and accuracy of the statements when it comes to reporting the quantities of interest. If we want to be very certain that we capture the population parameter, we may be tempted to use a wider confidence interval. This, however, comes with certain drawbacks, such as the interval becoming rather uninformative:

The best solution to get the best of both worlds - high accuracy as well as high precision - would be to increase the sample size, which is often easier said than done.


In your homework you will:

  • Calculate some CIs using different approaches.
  • See how smart students in Mannheim are.
  • And check whether there was foul play in an election.