AE 21: Models with Multiple Predictors 1

Getting Started

Used Cars Price

Part 1: Price vs. Age + Mileage
Part 2:
Part 3: Adjusted
Part 4: Price vs. Age + Type

Submitting Application Exercises

Getting Started

Clone the repository entitled “ae21-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd file and replace “Your Name” with your name.

Used Cars Price

library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
library(plotly)

car_prices <- read_csv("data/car_prices.csv")

We will continue examining the used cars dataset with the following variables:

type: Model (Accord, Maxima, Mazda6)
age: Age of the used car (in years)
price: Price (in thousands of dollars)
mileage: Previous miles driven (in thousands of miles)

We have worked on linear models with a single predictor. But, there is no reason we couldn’t have more predictors. For example,

An associated linear model would be

where

: price
: age
: mileage.

Linear regression with multiple ( 2) predictors are called “multiple regression” or “multiple linear regression”.

Part 1: Price vs. Age + Mileage

Consider the above regression model with the response price and predictors age and mileage.

Q - Use appropriate functions to find the fitted model and display the results in tidy format. Hint: multiple predictors are added in the regression function by +.

## option 1

## option 2

Q - Write out the equation of the fitted model with variable names, and interpret the slope and intercept in the context of data.

Intercept:
Slope of age: All else held constant,
Slope of mileage:

Q - What is the predicted selling price of a 5 year-old car with 45,000 miles?

Q - Five cars in the data actually are 5 years old. What are the residuals associated with these observations? What do negative / positive residuals mean? Remember the residual for the th observation is

car_prices %>% 
  filter(____) %>% 
  mutate(pred = ____) %>% 
  mutate(resid = ____)

Part 2:

After fitting a model, we may wonder how “good” our model is. It depends on how much the response variable is explained by explanatory variables, which is summarized as a statistic called . is the percentage of variability in the response variable explained by the model.

In mathematical definition,

where is the sample mean of .

In words,

Let’s focus on the second term to build intuition.

The numerator “sum of squared residuals” is a measure of how wrong our model is (the amount of variability not explained by the model).
The denominator is proportional to the sample variance i.e., the amount of variability in the data. With the sample variance denoted by , we have .
Together, the fraction represents the proportion of variability not explained by the model.
is 1 minus the fraction. Therefore, it’s the proportion of variability explained by the model.

If the sum of squared residuals is 0, then the model is perfect and .

If the sum of squared residuals is the same as all the variability in the data, then model is so poor, not explaining any variability, and .

Final take-away: is a measure of the proportion of variability the model explains. An of 0 is a poor fit and of 1 is a perfect fit.

Q - Compute of the following models we fitted last time. Based on the statistics, which model is better? Hint: use glance to construct a single row summary of a model.

Model 1: Price vs. Mileage
Model 2: Price vs. Type

## model 1

## model 2

Q - We now wonder how good today’s model is in Part 1. Before writing any code, do you think will increase, decrease or stay the same? Why?

Model 3: Price vs. Age + Mileage

Q - Report of Model 3. Based on the statistics, how does it compare to Model 1 and 2? Which one is the best model?

## model 3

Suppose a completely irrelevant variable is added in the dataset. No matter how irrelevant that is, including one more variable will only increase the amount of variability explained, if not the same. Therefore, increases as more variables are added, which is not desirable.

It is perfectly fine to compare simple linear models (w/ a single predictor) using .
It is not okay to compare multiple linear models with especially when they have different number of predictors.

Q - (Optional) Create a dataset car_prices2 with mutating a silly variable from car_prices. It can be any variable. Be creative! Then fit a multiple linear model for price vs. age, mileage, and silly. Report and verify if it doesn’t decrease.

Part 3: Adjusted

For the reason described in Part 2, we should use adjusted to compare multiple regression models.

Adjusted doesn’t increase if a new variable does not provide any new information or is completely unrelated.
Adjusted penalizes the number of predictors in the model.
Therefore, adjusted decreases unless the new variable helps explain the response.

In mathematical definition, the adjusted is

where is the number of observations (in the data) and is the number of predictors (in the model).

is the proportion of variability not explained by the model.
If increases, the denominator of (“weight”) decreases and thus the weight increases.
Together, the larger weight amplifies the unexplained proportion of the variability by the model and reduces the adjusted .

Q - Report adjusted values for Model 1-3. Based on those, determine which one is the best model.

## model 1

## model 2

## model 3

Part 4: Price vs. Age + Type

We now consider a regression model with the response price versus a numeric variable age and a categorical variable type (Accord, Maxima, Mazda6).

Last time, we created a plot below.

car_prices %>% 
  ggplot(aes(x = age, y = price, color = type)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Car age (in years)", 
       y = "Price (in thousands of dollars)", 
       color = "Model") +
  scale_color_viridis_d() +
  theme(aspect.ratio = .6)

Q - From the visualization, do you think the relationship between age and price depends on the model of a car?

Q - Fit a linear model (Model 4) for price with age and type as predictors. Write out the equation of the fitted model with variable names. Interpret the coefficients in the context of data. Report either or adjusted to compare with previous models (Model 1-3).

Model 4: Price vs. Age + Type

Interpretations

Intercept:
Slope of age:
Slope of type = Maxima:
Slope of type = Mazda6:

Q - In the above visualization, lines for Accord and Mazda6 seem parallel, while the slope of the line for Maxima is different. What do you think it suggests?

Submitting Application Exercises

Once you have completed the activity, push your final changes to your GitHub repo.
Make sure you committed at least three times.
Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.

AE 21: Models with Multiple Predictors 1

due Monday, June 13 at 9:29am

Bora Jin

Getting Started

Used Cars Price

Part 1: Price vs. Age + Mileage

Part 2:

Part 3: Adjusted

Part 4: Price vs. Age + Type

Submitting Application Exercises

AE 21: Models with Multiple Predictors 1

due Monday, June 13 at 9:29am

Bora Jin

Getting Started

Used Cars Price

Part 1: Price vs. Age + Mileage

Part 2: R2

Part 3: Adjusted R2

Part 4: Price vs. Age + Type

Submitting Application Exercises

Part 2:

Part 3: Adjusted