How to Construct Confidence Intervals
What is the idea behind confidence intervals
How to Estimate Difference in Means
In other words, our goals are to:
With today’s repository you downloaded the .Rmd
script and the two datasets support.RData
and polity.dta
. They are in the folder raw-data
. You can see them in the Files window.
In the file support.RData
you find the policy support measurements from the lecture. We want to calculate the 95% confidence interval for the mean.
First, we need to load the data. This time it’s easier because the data is in the .RData
format, one of the native formats for R. Hence, no extra packages are required:
load(file = here("raw-data/support.RData"))
# Let's have a look at the data first.
summary(support)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 13.00 14.00 14.36 15.75 17.00
# summary() shows you the key measures of central tendency and variability.
# We should also have a look at the distribution, quick but ugly...
hist(support,
main = "Histogram of Support for a New Policy",
xlab = "Support for a new policy",
col = viridis(1),
border = F,
las = 1
)
ggplot() +
geom_histogram(aes(x = support),
boundary = 14,
binwidth = 1,
color = "white",
fill = viridis(1)
+
) labs(
title = "Histogram of Support for a New Policy",
x = "Support Value",
y = "Frequency"
+
) theme_minimal() +
scale_x_continuous(breaks = c(seq(11, 17, by = 1)))
We can easily calculate the mean support for the new policy in our sample:
mean(support)
## [1] 14.36
Since we only have a sample (i.e., the survey was only conducted once), we cannot confidently infer that the mean support in the population (unknown to us, \(\theta\)) is the same as in our sample (what we observed, \(\hat{\theta}\)). That’s why we may want to calculate the confidence intervals.
Let’s calculate the analytical 95% CI. We start by calculating the standard error of the estimate of the population mean:
\(SE(\hat\theta) = \dfrac{\hat\sigma}{\sqrt{N}}\)
\(\hat\theta\) = Estimate of population mean
\(\hat\sigma\) = Standard deviation of a sample
\(N\) = Number of observations
<- sd(support) / sqrt(length(support))
support_se
support_se## [1] 0.2281425
How will the estimate \(SE(\hat\theta)\) change if there were more data points (but standard deviation of the sample were the same)?
Now we can calculate the analytical 95% CI: \(\hat\theta \pm 1.96 \times SE(\hat\theta)\)
<- mean(support) + qnorm(0.025, 0, 1) * support_se # why use qnorm?
ci_lo <- mean(support) + qnorm(0.975, 0, 1) * support_se
ci_up <- c(ci_lo, ci_up)
ci_an
ci_an## [1] 13.91285 14.80715
What should we change to calculate the 99% confidence intervals? Will this 99% interval be larger or smaller than 95% CI?
Tip: you can comment and un-comment multiple lines with Ctrl + Shift + C
/Cmd + Shift + C
.
# modify the following line to get 99% CIs
# mean(support) + qnorm(c(0.025, 0.975), 0, 1) * support_se
In the lecture, you have seen a simulation for the concept of confidence intervals. Let’s try to make one ourselves to better grasp the idea behind the interpretation of confidence intervals.
First, we’ll need to decide on the true population parameters. We will never know true population parameters in real life, but for the sake of demonstration, let’s do this.
# create a population
<- 500
true_mean <- rnorm(10000, true_mean, 100) pop
Now we need to do the sampling part, like we would be doing in real life. We will take a subset of the population and estimate statistics based on that sample. But unlike in real life, we will take samples many times, say 100.
<- 100 # we will take 100 samples (but only one in real life)
n_iter <- NULL # empty object for estimates
estimates <- matrix(rep(NA, n_iter * 2), # empty matrix for CIs
confis ncol = 2, # for lower and upper CIs
nrow = n_iter
)
for (i in seq_len(n_iter)) { # for the length of n_iter do:
<- sample(x = pop, size = 100) # take a sample of 100
pop_sample <- mean(pop_sample) # store its mean
estimates[i] <- sd(pop_sample) / sqrt(length(pop_sample)) # store SE for the sample
se <- mean(pop_sample) + qnorm(c(0.025, 0.975)) * se # store 95% CIs
confis[i, ]
}
head(estimates)
## [1] 512.5565 510.3536 498.9780 506.6933 528.1269 513.4134
head(confis)
## [,1] [,2]
## [1,] 494.3300 530.7830
## [2,] 489.2261 531.4812
## [3,] 478.5214 519.4347
## [4,] 486.4937 526.8928
## [5,] 507.9621 548.2918
## [6,] 494.2595 532.5673
plot(
confis,xlim = c(450, 550),
ylim = c(1, 100),
type = "n",
xlab = "",
ylab = "",
yaxt = "n",
bty = "n",
main = "Confidence Interval for Mean"
)
# Let's add our true parameter
abline(v = true_mean, col = "red")
for (i in seq_len(n_iter)) {
segments(confis[i, 1], i, confis[i, 2], i, col = "azure4")
points(estimates[i],
i,pch = 19,
cex = 0.4,
col = "azure4"
)# if lower CI is larger than 10 or upper CI smaller than 10
if (confis[i, 1] > true_mean | confis[i, 2] < true_mean) {
# paint red horizontal lines (since y0 == y1)
segments(
x0 = confis[i, 1],
y0 = i,
x1 = confis[i, 2],
y1 = i,
col = "red"
)# add red points
points(estimates[i],
i,pch = 19,
cex = 0.4,
col = "red"
)
} }
# create a dataset from plotting from existing objects
<- data.frame(
CIs_data mean = estimates,
confis,n = 1:n_iter
)
# add a column for color
# if lower CI > true_mean or upper CI < true_mean, CI missed the true value and
# is coded as "out"
# ifelse(condition, value-if-true, value-if-false)
$missed <- ifelse(CIs_data$X1 > true_mean | CIs_data$X2 < true_mean, "Out", "In")
CIs_data
ggplot(data = CIs_data) +
geom_pointrange(
aes(
x = mean, # point value
xmin = X1, # lower CI
xmax = X2, # upper CI
y = n, # y axis - just observation number
color = missed
# color varies by missed variable
) +
) geom_vline(
aes(xintercept = true_mean), # add vertical line at true_mean
+
) scale_color_manual(values = c("azure4", "red")) + # set our preferred colors
theme_minimal() + # some theme to change the appearance
labs(
title = "Confidence Interval for Mean",
subtitle = "Population mean equals 500",
x = "Mean",
y = "Sample",
color = "Is true population parameter inside the 95% CI?"
+
) theme(legend.position = "top") + # switch the legend to the top
scale_x_continuous(breaks = c(seq(450, 550, by = 10))) # change values on x-axis
Now we turn to the simulation approach. Here we simulate the sampling distribution of the statistic we are interested in and take the empirical quantiles from this simulated distribution.
The central limit theorem tells us that the sampling distribution is normally distributed. We assume that the sample mean is the population mean. The standard error is the standard deviation of this distribution.
First, let’s set the number of simulations to 1000.
<- 1000 nsim
Now let’s take random draws from a distribution with assumed parameters.
<- rnorm(n = nsim,
simsupport mean = mean(support),
sd = support_se)
And now we are plotting the resulting distribution:
hist(
x = simsupport,
main = "Sampling Distribution of the Sample Mean (via Simulation)",
xlab = "Support for Policy",
las = 1,
col = viridis(2)[2],
border = F, # remove the border
breaks = 20,
freq = FALSE,
yaxt = "n",
ylab = ""
)
# Now we will add some extras to the plot:
# Lines for density.
lines(density(simsupport),
col = viridis(2)[1],
lwd = 2
)
Now we can calculate empirical quantiles from the distribution:
<- quantile(x = simsupport,
ci_sim probs = c(0.025, 0.975))
ci_sim## 2.5% 97.5%
## 13.89720 14.79784
And let’s see if there is a numerical difference between two approaches:
- ci_an # should be close to zero!
ci_sim ## 2.5% 97.5%
## -0.015649749 -0.009315387
Even better, we can even add the quantiles to the plot.
# This is the plot from above.
hist(simsupport,
las = 1,
breaks = 20,
freq = FALSE,
main = "Sampling Distribution of the Sample Mean\n(via Simulation)",
xlab = "Support for Policy",
col = viridis(4)[1],
border = F,
yaxt = "n",
ylab = ""
)
# Now we will add some extras to the plot:
# Lines for density.
lines(density(simsupport),
col = viridis(4)[2],
lwd = 2
)
# Let's add the analytical CI
abline(
v = ci_an,
col = viridis(4, alpha = 0.75)[3],
lwd = 2,
lty = 2
)
# How does the simulated CI compare to the analytical one?
# Let's have a look at it in the same plot!
abline(
v = ci_sim,
col = viridis(4, alpha = 0.75)[4],
lwd = 2,
lty = 2
)
# And let's add a legend
legend("topright",
bty = "n",
col = viridis(4)[3:4],
lwd = 2,
lty = 2,
legend = c(
"Analytical 95% CI",
"Simulated 95% CI"
) )
# create a dataset
<- data.frame(simsupport)
simsupport_df
ggplot(
# use our dataset
simsupport_df, aes(x = simsupport)
+ # put simsupport on x-axis
) labs(
title = "Sampling Distribution of the Sample Mean (via Simulation)",
x = "Support for Policy",
y = ""
+
) geom_histogram(aes(y = ..density..), # add histogram
boundary = 14,
binwidth = 0.1,
fill = viridis(4)[1], # change color
color = "white"
+ # make white bin borders
) geom_density(color = viridis(4)[2]) + # add density
geom_vline( # add simulated CIs
xintercept = ci_sim,
color = viridis(4, alpha = 0.75)[4],
linetype = "dashed"
+
) geom_vline( # add analytical CIs
xintercept = ci_an,
color = viridis(4, alpha = 0.75)[3],
linetype = "dashed"
+
) theme_minimal() + # change appearance
annotate("text", # add CIs label
x = 15, # where to place label on x axis
y = 1.7, # where to place label on y axis
label = "Analytical 95% CI",
color = viridis(4)[3]
+
) annotate("text",
x = 15,
y = 1.6,
label = "Simulated 95% CI",
color = viridis(4)[4]
+
) theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)
If we treat our sample as population, how do we achieve sampling variability with bootstrapping?
To remind you, here is what we need to do when using bootstrapping to generate confidence intervals:
Let’s see how we do this in R.
First, we need an empty vector for samples. This is good practice when using a loop, to make the process faster.
<- rep(NA, 1000) # repeat "NA" 1000 times, store in object "bootmeans" bootmeans
Let’s loop over the mean from random samples with replacement from the support dataset.
Does anyone remember what a
for
loop does?
for (i in seq_len(1000)) {
<- mean(sample(
bootmeans[i] x = support,
size = length(support),
replace = TRUE # why?
)) }
Let’s have a look at the results:
hist(bootmeans,
las = 1,
breaks = 20,
freq = FALSE,
main = "Sampling Distribution of the Sample Mean (via Bootstrap)",
xlab = "Support for Policy",
col = viridis(4)[1],
border = F,
yaxt = "n",
ylab = ""
)
# And we add the density curve again.
lines(density(bootmeans),
col = viridis(4)[2],
lwd = 2
)
# calculate empirical quantiles from your distribution
<- quantile(bootmeans, p = c(0.025, 0.975))
ci_boot
# As above we can plot the CI.
abline(
v = ci_boot,
col = viridis(4, alpha = 0.75)[4],
lwd = 2,
lty = 2
)
# And we compare it to the analytical solution.
abline(
v = ci_an,
col = viridis(4, alpha = 0.75)[3],
lwd = 2,
lty = 2
)
# And the legend
legend("topright",
bty = "n",
col = viridis(4)[3:4],
lwd = 2,
lty = 2,
legend = c(
"Analytical CI",
"Bootstrapped CI"
) )
<- data.frame(bootmeans)
bootmeans_df <- quantile(bootmeans, p = c(0.025, 0.975))
ci_boot
ggplot(
bootmeans_df,aes(x = bootmeans)
+
) labs(
title = "Sampling Distribution of the Sample Mean (via Bootstrapping)",
x = "Support for Policy"
+
) geom_histogram(aes(y = ..density..),
fill = viridis(4)[1],
color = "white"
+
) geom_density(color = viridis(4)[2]) +
theme_minimal() +
geom_vline(
xintercept = ci_an,
color = viridis(4, alpha = 0.75)[3],
linetype = "dashed"
+
) annotate("text",
x = 15,
y = 1.7,
label = "Analytical 95% CI",
color = viridis(4)[3]
+
) geom_vline(
xintercept = ci_boot,
color = viridis(4, alpha = 0.75)[4],
linetype = "dashed"
+
) annotate("text",
x = 15,
y = 1.6,
label = "Bootstrap 95% CI",
color = viridis(4)[4]
+
) theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also see that the difference between analytical and bootstrap CIs is rather small:
- ci_an
ci_boot ## 2.5% 97.5%
## 0.02715099 0.01334901
What’s the point about the simulation and bootstrap approach if we are able to derive CIs analytically? The next example shows that for some quantities of interest, CIs are not easy to derive analytically, but the simulation and bootstrap approaches work just as before.
In a sample of size 1000 (500 men and 500 women), 55% of men and 65% of women vote left. Calculate the confidence interval for the “left-voting” ratio of men to women using the simulation approach. This is an example from the lecture slides 20-21.
With simulation approach, we need some values to set up the distribution, so mean and standard error for the Normal distribution, and then we will take draws from a distribution with such parameters.
The ratio is \(\dfrac{0.55}{0.65}\).
The formula for the standard error for proportions:
\(SE(\hat p) = \sqrt{\dfrac{\hat p (1-\hat p)}{n}}\)
Let’s translate it all into R:
<- 0.55
p_hat_men <- 0.65
p_hat_women
<- 500
n_men <- 500
n_women
# Calculate the standard errors: sqrt(p * (1 - p) / n)
<- sqrt(p_hat_men * (1 - p_hat_men) / n_men)
se_men <- sqrt(p_hat_women * (1 - p_hat_women) / n_women)
se_women
# Draw a sample
<- 1000
nsim
<- rnorm(nsim, p_hat_men, se_men)
p_men <- rnorm(nsim, p_hat_women, se_women)
p_women
# Calculate the ratios from the simulated vectors
<- p_men / p_women
ratio
# Finally, provide CI via simulation
<- quantile(ratio, c(0.025, 0.975))
ci_sim
ci_sim## 2.5% 97.5%
## 0.7617279 0.9359397
Our previous quantity of interest was the “left-voting” ratio of men to women. Now we are interested in calculating a difference in means. Let’s work with another substantive example here.
The file polity.dta
contains information on Polity scores, a measure of democracy in the country, which varies from 0 to 10, with 0 being the least democratic. We want to test if the difference in polity scores between Eastern Europe and Western Europe and North America is “statistically significant”.
Let’s load the dataset:
<- read.dta("raw-data/polity.dta") # Why doesn't load("polity.dta") work? data
First, we always investigate the dataset:
head(data)
## ht_region ccode cname fh_polity2 region
## 1 1 100 Bulgaria 9.333334 Eastern Europe
## 2 1 268 Georgia 6.250000 Eastern Europe
## 3 1 804 Ukraine 6.500000 Eastern Europe
## 4 1 112 Belarus 1.583333 Eastern Europe
## 5 1 705 Slovenia 10.000000 Eastern Europe
## 6 1 31 Azerbaijan 2.000000 Eastern Europe
glimpse(data) # function from dplyr package
## Rows: 159
## Columns: 5
## $ ht_region <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ ccode <fct> 100, 268, 804, 112, 705, 31, 440, 616, 233, 642, 643, 795, ~
## $ cname <fct> "Bulgaria", "Georgia", "Ukraine", "Belarus", "Slovenia", "A~
## $ fh_polity2 <dbl> 9.3333340, 6.2500000, 6.5000000, 1.5833333, 10.0000000, 2.0~
## $ region <fct> Eastern Europe, Eastern Europe, Eastern Europe, Eastern Eur~
table(data$region)
##
## East Asia (including Japan & Mongolia)
## 6
## Eastern Europe
## 26
## Latin America
## 20
## North Africa & the Middle East
## 20
## South Asia
## 7
## South-East Asia
## 9
## Sub-Saharan Africa
## 45
## The Caribbean
## 3
## The Pacific (excluding Australia & New Zeeland)
## 3
## Western Europe and North America
## 20
We start by extracting Polity scores of Eastern Europe and Western Europe and North America from the data set. Remember selecting/subsetting?
<- data$fh_polity2[data$region == "Western Europe and North America"]
west <- data$fh_polity2[data$region == "Eastern Europe"] east
In order to find out whether the difference in mean Polity scores between Eastern Europe and *Western Europe and North America“* is”statistically significant", we need to derive confidence intervals of this difference.
To start, calculate the standard errors of the mean estimates of polity scores for Eastern Europe and Western Europe and North America.
Formula for standard errors: \(SE(\hat\theta) = \dfrac{\hat\sigma}{\sqrt N}\)
\(\hat\theta\) = Estimate of population mean
\(\hat\sigma\) = Standard deviation of a sample
\(N\) = Number of observations
# se_east <-
# se_west <-
We now have standard deviations for the mean polity scores of both regions. This equips us to simulate confidence intervals for these estimates.
# # 1. Simulate sampling distributions
#
# nsim <-
# d_west <-
# d_east <-
#
# # 2. Calculate quantiles to get CI
#
# ci_west <-
# ci_east <-
Now we can check the difference: Subtract the simulated sampling distribution of the mean of western countries from the simulated sampling distribution of eastern countries and calculate the quantiles to get the CI of the difference in means. You can also plot the sampling distribution of the difference in means and calculate its mean.
# # 1. Subtract the simulated sampling distributions of the mean (west-east)
#
# diff <-
#
# # 2. Calculate quantiles to get CI
#
# ci_sim <-
#
# # 3. plot and mean
#
# hist()
# mean()
We do it by hand, using the normal approximation.
<- length(west)
n1 <- length(east)
n2
<- (mean(west) - mean(east)) + 1.96 * sqrt(var(west) / n1 + var(east) / n2)
ci_up <- (mean(west) - mean(east)) - 1.96 * sqrt(var(west) / n1 + var(east) / n2)
ci_low
<- c(ci_low, ci_up) ci_ahand
Of course there is also an easy way: in this case the t-test function.
t.test(west, east)
##
## Welch Two Sample t-test
##
## data: west and east
## t = 5.4398, df = 25.073, p-value = 1.189e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.100570 4.659686
## sample estimates:
## mean of x mean of y
## 9.966667 6.586539
<- t.test(west, east)[[4]][1:2] ci_t
We again need an empty vector to save the results.
<- rep(NA, 1000)
diff_boot
for (i in seq_len(1000)) {
<- mean(sample(west, length(west), replace = T))
a <- mean(sample(east, length(east), replace = T))
b
<- a - b
diff_boot[i]
}
mean(diff_boot)
## [1] 3.389298
<- quantile(diff_boot, c(0.025, 0.975))
ci_boot
# Finally, we present the results in a table.
<- matrix(NA, 4, 2)
res
1, ] <- ci_sim
res[2, ] <- ci_ahand
res[3, ] <- ci_t
res[4, ] <- ci_boot
res[
rownames(res) <- c("Simulation", "By hand", "t-test", "Bootstrapping")
colnames(res) <- c("Lower 95% bound", "Upper 95% bound")
res## Lower 95% bound Upper 95% bound
## Simulation 0.7617279 0.9359397
## By hand 2.1622324 4.5980239
## t-test 2.1005704 4.6596858
## Bootstrapping 2.2514022 4.6369711
Pro-tip: Use
knitr::kable()
to get nicely formatted tables from R objects to Markdown
::kable(res,
knitrdigits = 3
)
Lower 95% bound | Upper 95% bound | |
---|---|---|
Simulation | 0.762 | 0.936 |
By hand | 2.162 | 4.598 |
t-test | 2.101 | 4.660 |
Bootstrapping | 2.251 | 4.637 |
You learned about the different ways to calculate confidence intervals in R, with simulation and bootstrapping as well as analytically. You may still wonder which confidence level is the most appropriate to use. There is always a trade-off between precision and accuracy of the statements when it comes to reporting the quantities of interest. If we want to be very certain that we capture the population parameter, we may be tempted to use a wider confidence interval. This, however, comes with certain drawbacks, such as the interval becoming rather uninformative:
The best solution to get the best of both worlds - high accuracy as well as high precision - would be to increase the sample size, which is often easier said than done.
In your homework you will: