AE 18: Central Limit Theorem 2

Getting Started

Central Limit Theorem for Proportion

Pokemon

Part 1: Point estimate
Part 2: CLT conditions
Part 3: CLT
Part 4: CLT-based hypothesis test
Part 5: CLT-based confidence interval
Part 6: CLT-based hypothesis test using infer
Part 7: CLT-based confidence interval using infer
Practice

Submitting Application Exercises

Getting Started

Clone the repository entitled “ae18-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd file and replace “Your Name” with your name.

Central Limit Theorem for Proportion

If is a categorical variable with only two categories, for instance, success vs. failure, or heads vs. tails, then we can write . In this case, the sample mean becomes the sample proportion, i.e., . Therefore, the CLT can apply to as it does to .

For a random variable , we know its mean is and its standard deviation is .

Q - Provided that we randomly took observations and there are more than 10 “successes” and 10 “failures” in the sample, what is the distribution of the sample proportion by the CLT?

Suppose we want to formally test if vs. where is a number, e.g., 0.4, 0.5, 0.9, etc.

Q - What would be the test statistic and its distribution under the null?

Pokemon

library(tidyverse)
theme_set(theme_bw())
library(tidymodels)

We will be using the pokemon dataset, which contains information about 42 randomly selected Pokemon (from all generations).

pokemon <- read_csv("data/pokemon.csv")

In this analysis, we will use CLT-based inference to draw conclusions about the mean height among all Pokemon species.

Part 1: Point estimate

Let’s start by looking at the distribution of height_m, the typical height in meters for a Pokemon species, using a visualization and summary statistics.

ggplot(data = pokemon, aes(x = height_m)) +
  geom_histogram(binwidth = 0.25, fill = "steelblue", color = "black") + 
  labs(x = "Height (in meters)", 
       y = "Distributon of Pokemon heights")

mean_height	sd_height	n_pokemon
0.9285714	0.4974499	42

In the previous lecture, we assumed that we knew the mean , and standard deviation , of the population. That is unrealistic in practice (if we knew and , we wouldn’t need to do statistical inference!).

Today we will be realistic - we don’t know what or are. We aim to draw conclusions about , the mean height in the population of Pokemon using our sample data.

Q - What is the point estimate for , i.e., the “best guess” for the mean height of all Pokemon?

In order to construct confidence intervals or conduct hypothesis tests about , we need a sampling distribution of the sample mean. We will use theoretical distributions based on the CLT.

Before moving forward, however, we need to estimate another unknown, but less interesting, parameter .

Q - What is the point estimate for , i.e., the “best guess” for the standard deviation of the distribution of Pokemon heights?

Part 2: CLT conditions

Before applying the CLT, always check if the following CLT conditions are met. For your information, there are approximately 900 Pokemon species in total.

Independence?
Sample size / distribution?

Part 3: CLT

By the Central Limit Theorem,

approximately for a large enough .

Q - Describe the distribution of in words.

In practice, we can’t calculate the standardized score due to the unknown , so instead we will use the standardized random variable when conducting inference for a population mean.

where

Q - How do and differ?

follows a distribution with degrees of freedom. It is written as . We will use the distribution to help us conduct hypothesis tests and construct confidence intervals.

Part 4: CLT-based hypothesis test

The mean height of humans is about 1.65 meters (m). We would like to test whether the mean height of Pokemon is less than the mean height of humans.

Q - Step 1: State the null and alternative hypotheses in words and statistical notation.

Next steps in hypothesis testing is to summarize data into a point estimate and assess how likely it is to observe what we observed or even more extreme if in fact the null hypothesis is true.

In order to use CLT-based distributions we learned in Part 3, we compute a standardized point estimate under the null. We plug the null value for , the observed sample mean for , and the observed sample standard deviation for in . Then we get a test statistic

Q - Step 2: What is the estimated standard error for the Pokemon data?

Q - Step 2: Calculate the test statistic (-value).

Q - Step 3: What is the distribution of the test statistic under the null?

Q - Step 3: Now let’s calculate the p-value. Fill in the code below to use the pt() function to calculate the p-value.

pt(___, df = ___)

State what the p-value means.
State the conclusion in the context of the data using a significance level of .

Part 5: CLT-based confidence interval

We would like to construct a 90% confidence interval for the mean height of Pokemon species. The confidence interval for the population mean is

where is called the margin of error.

We already know and , so let’s talk about . This value is determined based on the confidence level, . It is the point on the distribution with degrees of freedom, such that the area between and is .

Q - What is the critical value for our 90% confidence interval of the mean Pokemon height?

Q - Now calculate the 90% confidence interval for the mean Pokemon height.

Q - Interpret the interval in the context of the data.

Part 6: CLT-based hypothesis test using `infer`

Q - Conduct the hypothesis test from Part 4 using the t_test() function.

pokemon %>%
  t_test(response = ____, 
         alternative = ____, 
         mu = ____, 
         conf_int = FALSE)

Part 7: CLT-based confidence interval using `infer`

Q - Calculate the 90% confidence interval from Part 5 using the t_test() function.

pokemon %>%
  t_test(response = ____, 
         conf_int = TRUE, 
         conf_level = ____) %>%
  select(lower_ci, upper_ci)

Q - Why not doing conf_int = TRUE in Part 6 and finishing CI and HT at once?

Practice

We found that the observed average height of Pokemon in our sample is about 0.93 m. For human, the average weight among those with height about 0.95 meters is 14 kg. We would like to test if the proportion of Pokemon heavier than 14 kg is significantly different from 50%.

Q - State the null and the alternative hypothesis.

Q - Verify if the CLT conditions are satisfied.

Independence?
Sample size / distribution?

Q - Conduct a CLT-based hypothesis test both manually and using infer at the significance level 0.05. Make a conclusion in context of data. Hint: use prop_test().

# sample proportion


# null value 


# compute z-statistic


# p-value under the null distribution

# infer

Q - Suppose the true proportion of Pokemon heavier than 14 kg is 0.5. In your conclusion above, did you make the correct decision, a Type 1 error, or a Type 2 error? Explain.

Submitting Application Exercises

Once you have completed the activity, push your final changes to your GitHub repo.
Make sure you committed at least three times.
Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.

AE 18: Central Limit Theorem 2

due Thursday, June 9 at 9:29am

Bora Jin

Getting Started

Central Limit Theorem for Proportion

Pokemon

Part 1: Point estimate

Part 2: CLT conditions

Part 3: CLT

Part 4: CLT-based hypothesis test

Part 5: CLT-based confidence interval

Part 6: CLT-based hypothesis test using `infer`

Part 7: CLT-based confidence interval using `infer`

Practice

Submitting Application Exercises

AE 18: Central Limit Theorem 2

due Thursday, June 9 at 9:29am

Bora Jin

Getting Started

Central Limit Theorem for Proportion

Pokemon

Part 1: Point estimate

Part 2: CLT conditions

Part 3: CLT

Part 4: CLT-based hypothesis test

Part 5: CLT-based confidence interval

Part 6: CLT-based hypothesis test using infer

Part 7: CLT-based confidence interval using infer

Practice

Submitting Application Exercises

Part 6: CLT-based hypothesis test using `infer`

Part 7: CLT-based confidence interval using `infer`