Getting Started

Used Cars Price

library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
car_prices <- read_csv("data/car_prices.csv")

Part 1: Price vs. Age + Type + Age*Type

We have suspected based on the plot below that there might be interaction effect between car age and car type in explaining selling price.

We will consider the following two models and compare their fit to the data.

  • Model 1: Price vs. Age + Type
  • Model 2: Price vs. Age + Type + Age*Type

We call Age*Type an interaction term, which is defined as the product of age and type. The effect of the interaction term on a dependent variable is called “interaction effects”.

On the other hand, “main effects” represent the effect of a single independent variable on the dependent variable. Since Model 1 includes main effects only for age and type, it is called a main effects only model.

Q - How many more terms do you expect to appear in Model 2 compared to Model 1? Hint: there are three car types (Accord, Maxima, and Mazda6) among which one is used as a baseline.

Q - What do we assume about a relationship between car age and selling price by different car types in Model 1 and Model 2?

  • Model 1:
  • Model 2:

Q - Use appropriate functions to find the fitted models and display the results in tidy format. Hint: y ~ x1*x2 fits two main effects for x1 and x2 and their interaction effect.

# model 1

# model 2

Q - Write out the equation of the fitted models with variable names.

  • Model 1
  • Model 2

Now we will interpret Model 1-2.

Q - What is the relationship between selling price and age for Maximas / Mazda6 / Accords in Model 1? What about in Model 2?

Model 1:

  • Maxima:
  • Mazda6:
  • Accord:

Model 2:

  • Maxima:
  • Mazda6:
  • Accord:

Q - Compare adjusted \(R^2\) between models. Which model do you prefer?

# model 1

# model 2

Part 2: Linearity in linear regression

Model 2 has an interaction term which includes the product of two independent variables…. Huh? Wait, is Model 2 a linear model?

The answer is yes. Model 2 is a linear model. In linear regression, what is linear is parameters, \(\beta\)’s, not data. As long as your parameters are linear, you can transform the data however you like.

For instance, the following models are all linear.

  • \(y = \beta_0 + \beta_1 x + \epsilon\)
  • \(y = \beta_0 + \beta_1 x_1*x_2 + \epsilon\)
  • \(y = \beta_0 + \beta_1 x^2 + \epsilon\)
  • \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sqrt{x} + \beta_3 x^3 + \epsilon\)

But, these are not linear models.

  • \(y = \beta_0\beta_1 x + \epsilon\)
  • \(y = \beta_1^2 x + \epsilon\)
  • \(y = \beta_0 + \exp(\beta_1) x + \epsilon\)

Q - (Optional) Consider the dataset mystery_data.

mystery_data <- read_csv("data/lm_example.csv")
mystery_data %>%
  ggplot(aes(x = x, y = y)) + 
  geom_point() +
  geom_smooth(method = 'lm') +
  labs(title = "Naive linear model is a poor choice!")

  1. Write the equation of the line above (by finding the fitted model). What is the \(R^2\) associated with the fitted line?

  2. Which linear model do you think will better describe the relationship between \(x\) and \(y\) above?

  3. Transform the predictor \(x\) as your guess in (2) and plot \(y\) versus the transformed predictor below.

  4. Fit a linear model with the transformed \(x\). Find \(R^2\) of the new model and compare.

Part 3: CI for parameters in a regression framework

Let’s revisit one of the models from the latest AE:

\[price = \beta_0 + \beta_1~age + \beta_2~mileage + \epsilon \]

Q - To remind ourselves, fit the above model and find \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\beta}_2\).

  • \(\hat{\beta}_0 = ?\)
  • \(\hat{\beta}_1 = ?\)
  • \(\hat{\beta}_2 = ?\)

Beyond these point estimates, we can find some plausible values for \(\beta\)’s.

Let’s build a theory-based CI on regression parameters. In order to make theory-based inferences, we need to assume \(\epsilon \sim N(0, \sigma^2)\) independently and identically (iid). (So far, we haven’t assumed the normality!)

Independence means knowing the error term for one observation doesn’t tell you anything about the error term for another observation.

Then a \((1-\alpha)*100\)% confidence interval for \(\beta\) is obtained as

\[ \hat{\beta} \pm t^*_{n-p} \times \hat{se}\] where

  • \(\hat{\beta}\) is the estimate for \(\beta\) in the regression
  • \(\hat{se}\) is the estimated std.error of \(\hat{\beta}\)
  • \(t^*_{n-p}\) is a \(1-\alpha/2\) quantile value of a t-distribution with degrees of freedom \(n-p\). The number of observations is \(n\), and the number of parameters in the model is \(p\). For instance, in the above model, \(p=3\).

Notice a similar structure with CIs for population mean using CLT.

Q - Manually compute a 95% CI for \(\beta_2\).

# df for t-distribution

# beta_hat

# se_hat

# tstar

# CI 

A tidy way of constructing the same CI is the following:

lm_pam %>% 
  tidy(conf.int = TRUE, conf.level = 0.95) %>% 
  filter(term == "mileage") %>% 
  select(conf.low, conf.high)

Q - Interpret the confidence interval for \(\beta_2\).

Q - Create a 90% confidence interval for \(\beta_1\) manually and using tidymodels functions and interpret the interval in the context of data.

# beta_hat

# se_hat

# tstar

# CI 

# tidy

Part 4: HT for parameters in a regression framework

Based on the point estimate and the confidence interval for \(\beta_2\) in Part 3, we may want to determine if \(\beta_2\) is significantly different from 0.

We can set it up as a hypothesis test, with the hypotheses below:

\(H_0\): \(\beta_2 = 0\) There is no relationship between price and mileage. \(H_1\): \(\beta_2 \neq 0\) There is a relationship between price and mileage.

We reject \(H_0\) in favor of \(H_1\) if the data provide strong evidence that the true slope \(\beta_2\) is different from 0.

Again, we assume \(\epsilon \stackrel{iid}{\sim} N(0, \sigma^2)\).

Then our test statistic is \[ t = \frac{\hat{\beta}_2 - 0}{\hat{se}}, \] and its null distribution is a t-distribution with \(n-p\) degrees of freedom.

Q - Using the equation above, compute the p-value.

R takes care of much of this behind the scenes with the tidy output and reports a p-value for each \(\beta\) by default.

lm_pam %>% 
  tidy()

Q - Is \(\beta_2\) significant at the \(\alpha = 0.05\) level? State your conclusion.

Q - What about \(\beta_0\) or \(\beta_1\)? Are they significant at the \(\alpha = 0.05\) level?

Part 5: Model diagnostics

After fitting models, we should examine if certain model conditions hold. If they do, our inferences based on the fitted models are reliable. Otherwise, we should consider fitting different models.

We will examine residuals vs. predicted values to check the following conditions:

  • Linearity: There is a linear relationship between the response and predictor variables.
  • Equal variance: The variability of the errors is equal for all values of the predictor variable.

While we can manually compute residuals and predicted values from the regression output, this time we will use augment() function in the tidymodels package.

We first fit a model and pipe it into the augment() function. For instance, for Model 1,

aug_main <- ____ %>% # fitted model 
  augment() 
aug_main %>% 
  head()

We will focus on two output values from the augmented results:

  • .fitted: Fitted or predicted value
  • .resid: The difference between observed and fitted values
  • For information on more outputs, try ?augment.

Q - Create a scatterplot of residuals vs. predicted values using the augmented results above. To aid the visual assessment, add a horizontal gray dashed line at 0.

  • Does linearity condition hold?
  • Does equal variance condition hold?
aug_main %>% 
  ggplot(aes(y = _____, x = _____)) +
  geom_point() + 
  geom_hline(yintercept = __, linetype = _____, color = ____) + 
  labs(x = "Predicted price", y = "Residuals")

If we want theory-based inferences about model parameters, we also need to check normality.

  • Normality: The errors follow a normal distribution.

Q - Check if the normality assumption is plausible for Model 1 using the augmented result.

If the cars in our data are completely randomly selected, then independence assumption also holds as the error for one car wouldn’t inform us of the error for another car.

Q - Repeat the same exercises for Model 2.

Submitting Application Exercises

  • Once you have completed the activity, push your final changes to your GitHub repo.
  • Make sure you committed at least three times.
  • Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.