hw03.Rmd
to open the template R Markdown file.For each exercise:
Make sure we see all relevant code and output in the knitted PDF. If you use inline code, make sure we can still see the code used to derive that answer.
Write all narrative in complete sentences with decimal numbers rounded to three decimal places.
Use a small number of reps
(about 500) as you write and test out your code. Once you have finalized all of your code, increase the number of reps
to 15,000 to produce your final results.
Write your code and narrative in a reproducible way, so you can easily change the number of reps. For example, consider ways you can write your narrative using inline code, so the relevant values update when you change the number of reps.
For each simulation exercise, use the seed specified in the exercise instructions.
All plots should follow best visualization practices; plots should include:
viridis
, scico
, and many others.Place all plots in the center and properly adjust their size so that they are placed nicely in a written report.
Don’t forget to label your R chunk as well. Your label should be short, informative, shouldn’t include spaces, and shouldn’t repeat a previous label.
We’ll use the tidyverse
package for much of the data wrangling and visualization, the tidymodels
package for inference, and the data live in the openintro
package.
library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
library(openintro)
Every year, the US releases a large dataset containing information on births recorded in the country. This dataset is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of 1,000 observations from the dataset released in 2014.
The subsetted data can be found in the openintro
package, and it’s called births14
. Each observation represents a birth in the US. You can find out more about the dataset by running ?births14
in the Console.
premie | min_week | max_week |
---|---|---|
full term | 37 | 46 |
premie | 21 | 36 |
Let \(A\) be the event that a baby is premature and \(B\) be the event that a baby weighs more than 9.5 pounds. Determine if the two events are disjoint or not. Also determine if they are independent. Explain your reasoning.
What is the probability that a baby is premature given the baby is female? What about the probability that a baby is premature given the baby is male? Calculate the probabilities and also create a horizontal stacked bar plot of sex
with relative frequencies of premie
. Have the sex of a baby on the y-axis and fill the bars according to whether the baby was premature or not.
Using the results in exercises above and Bayes’ theorem, compute the probability that a baby is female given the baby is premature. Provided that the event \(A\) is a baby is female and the event \(B\) is a baby is premature, is \(A\) independent of \(B\)? Why or why not?
🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.
According to this article, the World Health Organization (WHO)-released average birth weight of a full-term female baby is 7.125 pounds (lbs).
We want to evaluate whether the average weight of full-term female babies in the US is significantly different than 7.125 lbs.
(a). State the null hypothesis and the alternative hypothesis in math. Clearly define all parameters you introduce.
(b). Create a filtered data frame called births_girl
that contain data only for full-term female babies. Then, calculate the mean of the weights of these babies.
(c). Simulate data under the null hypothesis, visualize it with the p-value region shaded, and calculate the p-value. You may start with the code chunk provided below. Set a seed number at 5. Use rep = 15000
for your final turn in. Remove eval = FALSE
once you fill in the blanks.
<- births_girl %>%
null_dist specify(response = ____) %>%
hypothesize(null = ____, __ = ____) %>%
generate(reps = 500, type = _____) %>%
calculate(stat = ____)
(d). Make a conclusion at \(\alpha = 0.05\) and interpret the results in context of the data.
(e). Construct a confidence interval at the equivalent level to the hypothesis test above and interpret the interval in context of the data. Use the same seed as Ex 5 (c).
🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.
Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
Make side-by-side boxplots displaying the relationship between habit
and weight
. What does the plot highlight about the relationship between these two variables?
Before moving forward, save a version of the dataset omitting observations where there are NAs for habit
. You can call this version births_habitgiven
.
We want to examine if the relationship seen in the side-by-side boxplots is statistically significant. We will conduct a hypothesis test on whether the average weight of babies born to smoking mothers is less than that of babies born to non-smoking mothers.
(a). State the null and the alternative hypothesis in math with a clear definition of any parameter you introduce.
(b). Calculate the observed difference between the average weights of babies born to smoking and non-smoking mothers.
(c). Simulate data under the null hypothesis, visualize it with the p-value region shaded, and calculate the p-value. Set a seed number at 8. Use rep = 15000
for your final turn in.
(d). State your conclusion in context of the research question with \(\alpha = 0.01\).
(e). Construct a confidence interval, at the equivalent level to the hypothesis test, for the difference between the average weights of babies born to smoking and non-smoking mothers, and interpret this interval in context of the data. Again, use the seed number 8.
🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.
Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Workflow & formatting” section with the first page.
Component | Points |
---|---|
Ex 1 | 1 |
Ex 2 | 4 |
Ex 3 | 4 |
Ex 4 | 5 |
Ex 5 | 18 |
Ex 6 | 2 |
Ex 7 | 1.5 |
Ex 8 | 18.5 |
Workflow & formatting | 6 |
Grading notes: