Clone the repository entitled “ae21-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd
file and replace “Your Name” with your name.
library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
library(plotly)
car_prices <- read_csv("data/car_prices.csv")
We will continue examining the used cars dataset with the following variables:
type
: Model (Accord, Maxima, Mazda6)age
: Age of the used car (in years)price
: Price (in thousands of dollars)mileage
: Previous miles driven (in thousands of miles)We have worked on linear models with a single predictor. But, there is no reason we couldn’t have more predictors. For example,
An associated linear model would be
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon \] where
price
age
mileage
.Linear regression with multiple (\(\geq\) 2) predictors are called “multiple regression” or “multiple linear regression”.
Consider the above regression model with the response price
and predictors age
and mileage
.
Q - Use appropriate functions to find the fitted model and display the results in tidy
format. Hint: multiple predictors are added in the regression function by +
.
## option 1
## option 2
Q - Write out the equation of the fitted model with variable names, and interpret the slope and intercept in the context of data.
age
: All else held constant,mileage
:Q - What is the predicted selling price of a 5 year-old car with 45,000 miles?
Q - Five cars in the data actually are 5 years old. What are the residuals associated with these observations? What do negative / positive residuals mean? Remember the residual for the \(i\)th observation is \[e_i = y_i - \hat{y}_i\].
car_prices %>%
filter(____) %>%
mutate(pred = ____) %>%
mutate(resid = ____)
After fitting a model, we may wonder how “good” our model is. It depends on how much the response variable is explained by explanatory variables, which is summarized as a statistic called \(R^2\). \(R^2\) is the percentage of variability in the response variable explained by the model.
In mathematical definition,
\[ R^2 = 1 - \frac{\sum_i^n e_i^2}{\sum_i^n (y_i - \bar{y})^2} \]
where \(\bar{y}\) is the sample mean of \(y_1, \dots, y_n\).
In words,
\[ R^2 = 1 - \frac{\text{sum of squared residuals}}{\text{sum of squared deviations}} \]
Let’s focus on the second term to build intuition.
The numerator “sum of squared residuals” is a measure of how wrong our model is (the amount of variability not explained by the model).
The denominator is proportional to the sample variance i.e., the amount of variability in the data. With the sample variance denoted by \(S^2\), we have \(\text{sum of squared deviations} = \sum_i^n (y_i - \bar{y})^2 = (n-1)S^2\).
Together, the fraction represents the proportion of variability not explained by the model.
\(R^2\) is 1 minus the fraction. Therefore, it’s the proportion of variability explained by the model.
If the sum of squared residuals is 0, then the model is perfect and \(R^2 = 1 - 0 = 1\).
If the sum of squared residuals is the same as all the variability in the data, then model is so poor, not explaining any variability, and \(R^2 = 1 - 1 = 0\).
Final take-away: \(R^2\) is a measure of the proportion of variability the model explains. An \(R^2\) of 0 is a poor fit and \(R^2\) of 1 is a perfect fit.
Q - Compute \(R^2\) of the following models we fitted last time. Based on the \(R^2\) statistics, which model is better? Hint: use glance
to construct a single row summary of a model.
## model 1
## model 2
Q - We now wonder how good today’s model is in Part 1. Before writing any code, do you think \(R^2\) will increase, decrease or stay the same? Why?
Q - Report \(R^2\) of Model 3. Based on the \(R^2\) statistics, how does it compare to Model 1 and 2? Which one is the best model?
## model 3
Suppose a completely irrelevant variable is added in the dataset. No matter how irrelevant that is, including one more variable will only increase the amount of variability explained, if not the same. Therefore, \(R^2\) increases as more variables are added, which is not desirable.
Q - (Optional) Create a dataset car_prices2
with mutating a silly variable from car_prices
. It can be any variable. Be creative! Then fit a multiple linear model for price vs. age
, mileage
, and silly
. Report \(R^2\) and verify if it doesn’t decrease.
For the reason described in Part 2, we should use adjusted \(R^2\) to compare multiple regression models.
In mathematical definition, the adjusted \(R^2\) is
\[ 1 - (1 - R^2) \frac{n-1}{n - k - 1} \]
where \(n\) is the number of observations (in the data) and \(k\) is the number of predictors (in the model).
Q - Report adjusted \(R^2\) values for Model 1-3. Based on those, determine which one is the best model.
## model 1
## model 2
## model 3
We now consider a regression model with the response price
versus a numeric variable age
and a categorical variable type
(Accord, Maxima, Mazda6).
Last time, we created a plot below.
car_prices %>%
ggplot(aes(x = age, y = price, color = type)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Car age (in years)",
y = "Price (in thousands of dollars)",
color = "Model") +
scale_color_viridis_d() +
theme(aspect.ratio = .6)
Q - From the visualization, do you think the relationship between age and price depends on the model of a car?
Q - Fit a linear model (Model 4) for price
with age
and type
as predictors. Write out the equation of the fitted model with variable names. Interpret the coefficients in the context of data. Report either \(R^2\) or adjusted \(R^2\) to compare with previous models (Model 1-3).
Interpretations
age
:type = Maxima
:type = Mazda6
:Q - In the above visualization, lines for Accord and Mazda6 seem parallel, while the slope of the line for Maxima is different. What do you think it suggests?