Clone the repository entitled “ae22-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd
file and replace “Your Name” with your name.
library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
car_prices <- read_csv("data/car_prices.csv")
We have suspected based on the plot below that there might be interaction effect between car age and car type in explaining selling price.
We will consider the following two models and compare their fit to the data.
We call Age*Type an interaction term, which is defined as the product of age
and type
. The effect of the interaction term on a dependent variable is called “interaction effects”.
On the other hand, “main effects” represent the effect of a single independent variable on the dependent variable. Since Model 1 includes main effects only for age and type, it is called a main effects only model.
Q - How many more terms do you expect to appear in Model 2 compared to Model 1? Hint: there are three car types (Accord
, Maxima
, and Mazda6
) among which one is used as a baseline.
Q - What do we assume about a relationship between car age and selling price by different car types in Model 1 and Model 2?
Q - Use appropriate functions to find the fitted models and display the results in tidy
format. Hint: y ~ x1*x2
fits two main effects for x1
and x2
and their interaction effect.
# model 1
# model 2
Q - Write out the equation of the fitted models with variable names.
Now we will interpret Model 1-2.
Q - What is the relationship between selling price and age for Maximas / Mazda6 / Accords in Model 1? What about in Model 2?
Model 1:
Model 2:
Q - Compare adjusted \(R^2\) between models. Which model do you prefer?
# model 1
# model 2
Model 2 has an interaction term which includes the product of two independent variables…. Huh? Wait, is Model 2 a linear model?
The answer is yes. Model 2 is a linear model. In linear regression, what is linear is parameters, \(\beta\)’s, not data. As long as your parameters are linear, you can transform the data however you like.
For instance, the following models are all linear.
But, these are not linear models.
Q - (Optional) Consider the dataset mystery_data
.
mystery_data <- read_csv("data/lm_example.csv")
mystery_data %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = "Naive linear model is a poor choice!")
Write the equation of the line above (by finding the fitted model). What is the \(R^2\) associated with the fitted line?
Which linear model do you think will better describe the relationship between \(x\) and \(y\) above?
Transform the predictor \(x\) as your guess in (2) and plot \(y\) versus the transformed predictor below.
Fit a linear model with the transformed \(x\). Find \(R^2\) of the new model and compare.
Let’s revisit one of the models from the latest AE:
\[price = \beta_0 + \beta_1~age + \beta_2~mileage + \epsilon \]
Q - To remind ourselves, fit the above model and find \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\beta}_2\).
Beyond these point estimates, we can find some plausible values for \(\beta\)’s.
Let’s build a theory-based CI on regression parameters. In order to make theory-based inferences, we need to assume \(\epsilon \sim N(0, \sigma^2)\) independently and identically (iid). (So far, we haven’t assumed the normality!)
Independence means knowing the error term for one observation doesn’t tell you anything about the error term for another observation.
Then a \((1-\alpha)*100\)% confidence interval for \(\beta\) is obtained as
\[ \hat{\beta} \pm t^*_{n-p} \times \hat{se}\] where
estimate
for \(\beta\) in the regressionstd.error
of \(\hat{\beta}\)Notice a similar structure with CIs for population mean using CLT.
Q - Manually compute a 95% CI for \(\beta_2\).
# df for t-distribution
# beta_hat
# se_hat
# tstar
# CI
A tidy
way of constructing the same CI is the following:
lm_pam %>%
tidy(conf.int = TRUE, conf.level = 0.95) %>%
filter(term == "mileage") %>%
select(conf.low, conf.high)
Q - Interpret the confidence interval for \(\beta_2\).
Q - Create a 90% confidence interval for \(\beta_1\) manually and using tidymodels
functions and interpret the interval in the context of data.
# beta_hat
# se_hat
# tstar
# CI
# tidy
Based on the point estimate and the confidence interval for \(\beta_2\) in Part 3, we may want to determine if \(\beta_2\) is significantly different from 0.
We can set it up as a hypothesis test, with the hypotheses below:
\(H_0\): \(\beta_2 = 0\) There is no relationship between price and mileage. \(H_1\): \(\beta_2 \neq 0\) There is a relationship between price and mileage.
We reject \(H_0\) in favor of \(H_1\) if the data provide strong evidence that the true slope \(\beta_2\) is different from 0.
Again, we assume \(\epsilon \stackrel{iid}{\sim} N(0, \sigma^2)\).
Then our test statistic is \[ t = \frac{\hat{\beta}_2 - 0}{\hat{se}}, \] and its null distribution is a t-distribution with \(n-p\) degrees of freedom.
Q - Using the equation above, compute the p-value.
R
takes care of much of this behind the scenes with the tidy
output and reports a p-value for each \(\beta\) by default.
lm_pam %>%
tidy()
Q - Is \(\beta_2\) significant at the \(\alpha = 0.05\) level? State your conclusion.
Q - What about \(\beta_0\) or \(\beta_1\)? Are they significant at the \(\alpha = 0.05\) level?
After fitting models, we should examine if certain model conditions hold. If they do, our inferences based on the fitted models are reliable. Otherwise, we should consider fitting different models.
We will examine residuals vs. predicted values to check the following conditions:
While we can manually compute residuals and predicted values from the regression output, this time we will use augment()
function in the tidymodels
package.
We first fit a model and pipe it into the augment()
function. For instance, for Model 1,
aug_main <- ____ %>% # fitted model
augment()
aug_main %>%
head()
We will focus on two output values from the augment
ed results:
.fitted
: Fitted or predicted value.resid
: The difference between observed and fitted values?augment
.Q - Create a scatterplot of residuals vs. predicted values using the augmented results above. To aid the visual assessment, add a horizontal gray dashed line at 0.
aug_main %>%
ggplot(aes(y = _____, x = _____)) +
geom_point() +
geom_hline(yintercept = __, linetype = _____, color = ____) +
labs(x = "Predicted price", y = "Residuals")
If we want theory-based inferences about model parameters, we also need to check normality.
Q - Check if the normality assumption is plausible for Model 1 using the augment
ed result.
If the cars in our data are completely randomly selected, then independence assumption also holds as the error for one car wouldn’t inform us of the error for another car.
Q - Repeat the same exercises for Model 2.