Clone the repository entitled “ae14-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd
file and replace “Your Name” with your name.
On a given day in 2018, twenty one-bedroom apartments were randomly selected on Craigslist Manhattan from apartments listed as “by owner”. The data are in the manhattan
data frame. We will use this sample to conduct inference on the typical rent of one-bedroom apartments in Manhattan.
Q - State the research question and identify the population and our sample.
We start with loading relevant packages. For bootstrapping, we will use the infer
package, included as part of tidymodels
.
library(tidyverse)
library(tidymodels)
manhattan <- read_csv("data/manhattan.csv")
Let’s start by using bootstrapping to estimate the mean rent of one-bedroom apartments in Manhattan. Therefore,
Q - What is the point estimate of the mean rent?
Let’s bootstrap! We should start by setting a seed. Function set.seed()
is a base R
function that allows us to control R
’s random number generation. This ensures our analysis is reproducible, which means we’ll get the same random sample each time we run the code or knit.
Step 1: Sample with replacement.
Q - How many observations do we need for our bootstrap sample?
set.seed(123)
mht_boot1 <- manhattan %>%
slice_sample(n = ___, replace = ___)
# compare the original sample to the bootstrap sample
data.frame(org = manhattan$rent,
boot1 = mht_boot1$rent)
Step 2: Compute the statistic from the bootstrap sample.
Step 3: Repeat steps 1 and 2 multiple times and create a bootstrap distribution of sample statistics.
set.seed(123)
# how many times?
reps <- 1000
# 1000 bootstrap samples
boot_samp <- manhattan %>%
rep_slice_sample(n = 20, replace = TRUE, reps = reps)
# calculate sample mean from each bootstrap sample
boot_dist <- boot_samp %>%
group_by(replicate) %>%
summarize(stat = mean(rent))
# bootstrap distribution of sample means
boot_dist %>%
ggplot(aes(x = stat)) +
geom_histogram(binwidth = 50) +
geom_vline(xintercept = mean(manhattan$rent),
color = "red", linetype = "dashed") +
labs(x = "Sample mean", y = "Count",
title = "Bootstrap distribution of sample means",
subtitle = "Rent of one-bedroom apartments in Manhattan")
Step 4: Calculate a confidence interval (CI) using percentiles of the bootstrap distribution.
Q - What do you notice about the relationship between the confidence level and the width of the confidence interval?
infer
Steps 1-3 can be done in one pipeline using functions in infer
.
Q - Complete the code chunk below. Use 1000 reps for the in-class activity. (You will use about 15,000 reps for assignments outside of class.)
set.seed(123)
# save resulting bootstrap distribution
boot_dist2 <- manhattan %>%
# specify the variable of interest
specify(response = ____) %>%
# generate reps (say 100, 1000, 10000, etc.) bootstrap samples
generate(reps = ____, type = _____) %>%
# calculate the statistic of each bootstrap sample
calculate(stat = _____)
specify
: what variable will you use?generate
: how many repetitions (samples) to create for the variable you specif
ied? What type? type = draw
, type = permute
or type = bootstrap
.calculate
: what statistic should be computed for each sample generate
d?Order of functions matters! You can read more about each function using the help command, e.g., ?specify
.
Q - How many rows are in boot_dist2
?
Q - What does each row represent?
Q - What are the variables in boot_dist2
? What do they mean?
Q - Visualize the bootstrap distribution using a histogram. Describe the shape, center, and spread of this distribution.
Q - Step 4. Construct the 95% confidence interval for the mean rent of one-bedroom apartments in Manhattan. To get middle 95%, we want to omit 2.5% on the left and on the right.
boot_dist2 %>%
summarize(lower_bound = quantile(____, ____),
upper_bound = quantile(____, 0.975))
Q - What is the correct interpretation for the interval calculated above?
Q - Modify the code used to calculate a 95% confidence interval to calculate a 90% confidence interval for the mean rent of one-bedroom apartments in Manhattan. Hint: upper
- lower
should yield what you want, e.g., 0.975 - 0.025 = 0.95.
Q - Calculate a 99% confidence interval for the mean rent of one-bedroom apartments in Manhattan. How does the width of this interval compare to the width of the 90% or the 95% confidence interval?
Q - What is one advantage to using a 90% confidence interval instead of a 95% confidence interval to estimate a parameter? What is one advantage to using a 99% confidence interval instead of a 95% confidence interval to estimate a parameter? Explain in terms of accuracy and precision.
Use bootstrapping to estimate the median rent for one-bedroom apartments in Manhattan.
boot_dist_median
.set.seed(123)
Calculate a 92% confidence interval.
Interpret the 92% confidence interval.