Clone the repository entitled “ae22-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd
file and replace “Your Name” with your name.
library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
car_prices <- read_csv("data/car_prices.csv")
We have suspected based on the plot below that there might be interaction effect between car age and car type in explaining selling price.
We will consider the following two models and compare their fit to the data.
We call Age*Type an interaction term, which is defined as the product of age
and type
. The effect of the interaction term on a dependent variable is called “interaction effects”.
On the other hand, “main effects” represent the effect of a single independent variable on the dependent variable. Since Model 1 includes main effects only for age and type, it is called a main effects only model.
Q - How many more terms do you expect to appear in Model 2 compared to Model 1? Hint: there are three car types (Accord
, Maxima
, and Mazda6
) among which one is used as a baseline.
Q - What do we assume about a relationship between car age and selling price by different car types in Model 1 and Model 2?
Q - Use appropriate functions to find the fitted models and display the results in tidy
format. Hint: y ~ x1*x2
fits two main effects for x1
and x2
and their interaction effect.
# model 1
# model 2
Q - Write out the equation of the fitted models with variable names.
Now we will interpret Model 1-2.
Q - What is the relationship between selling price and age for Maximas / Mazda6 / Accords in Model 1? What about in Model 2?
Model 1:
Model 2:
Q - Compare adjusted R2 between models. Which model do you prefer?
# model 1
# model 2
Model 2 has an interaction term which includes the product of two independent variables…. Huh? Wait, is Model 2 a linear model?
The answer is yes. Model 2 is a linear model. In linear regression, what is linear is parameters, β’s, not data. As long as your parameters are linear, you can transform the data however you like.
For instance, the following models are all linear.
But, these are not linear models.
Q - (Optional) Consider the dataset mystery_data
.
mystery_data <- read_csv("data/lm_example.csv")
mystery_data %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = "Naive linear model is a poor choice!")
Write the equation of the line above (by finding the fitted model). What is the R2 associated with the fitted line?
Which linear model do you think will better describe the relationship between x and y above?
Transform the predictor x as your guess in (2) and plot y versus the transformed predictor below.
Fit a linear model with the transformed x. Find R2 of the new model and compare.
Let’s revisit one of the models from the latest AE:
price=β0+β1 age+β2 mileage+ϵ
Q - To remind ourselves, fit the above model and find ˆβ0, ˆβ1, and ˆβ2.
Beyond these point estimates, we can find some plausible values for β’s.
Let’s build a theory-based CI on regression parameters. In order to make theory-based inferences, we need to assume ϵ∼N(0,σ2) independently and identically (iid). (So far, we haven’t assumed the normality!)
Independence means knowing the error term for one observation doesn’t tell you anything about the error term for another observation.
Then a (1−α)∗100% confidence interval for β is obtained as
ˆβ±t∗n−p×^se
estimate
for β in the regressionstd.error
of ˆβNotice a similar structure with CIs for population mean using CLT.
Q - Manually compute a 95% CI for β2.
# df for t-distribution
# beta_hat
# se_hat
# tstar
# CI
A tidy
way of constructing the same CI is the following:
lm_pam %>%
tidy(conf.int = TRUE, conf.level = 0.95) %>%
filter(term == "mileage") %>%
select(conf.low, conf.high)
Q - Interpret the confidence interval for β2.
Q - Create a 90% confidence interval for β1 manually and using tidymodels
functions and interpret the interval in the context of data.
# beta_hat
# se_hat
# tstar
# CI
# tidy
Based on the point estimate and the confidence interval for β2 in Part 3, we may want to determine if β2 is significantly different from 0.
We can set it up as a hypothesis test, with the hypotheses below:
H0: β2=0 There is no relationship between price and mileage. H1: β2≠0 There is a relationship between price and mileage.
We reject H0 in favor of H1 if the data provide strong evidence that the true slope β2 is different from 0.
Again, we assume ϵiid∼N(0,σ2).
Then our test statistic is t=ˆβ2−0^se,
Q - Using the equation above, compute the p-value.
R
takes care of much of this behind the scenes with the tidy
output and reports a p-value for each β by default.
lm_pam %>%
tidy()
Q - Is β2 significant at the α=0.05 level? State your conclusion.
Q - What about β0 or β1? Are they significant at the α=0.05 level?
After fitting models, we should examine if certain model conditions hold. If they do, our inferences based on the fitted models are reliable. Otherwise, we should consider fitting different models.
We will examine residuals vs. predicted values to check the following conditions:
While we can manually compute residuals and predicted values from the regression output, this time we will use augment()
function in the tidymodels
package.
We first fit a model and pipe it into the augment()
function. For instance, for Model 1,
aug_main <- ____ %>% # fitted model
augment()
aug_main %>%
head()
We will focus on two output values from the augment
ed results:
.fitted
: Fitted or predicted value.resid
: The difference between observed and fitted values?augment
.Q - Create a scatterplot of residuals vs. predicted values using the augmented results above. To aid the visual assessment, add a horizontal gray dashed line at 0.
aug_main %>%
ggplot(aes(y = _____, x = _____)) +
geom_point() +
geom_hline(yintercept = __, linetype = _____, color = ____) +
labs(x = "Predicted price", y = "Residuals")
If we want theory-based inferences about model parameters, we also need to check normality.
Q - Check if the normality assumption is plausible for Model 1 using the augment
ed result.
If the cars in our data are completely randomly selected, then independence assumption also holds as the error for one car wouldn’t inform us of the error for another car.
Q - Repeat the same exercises for Model 2.