Clone the repository entitled “ae20-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd
file and replace “Your Name” with your name.
library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
The dataset car_prices.csv
contains attributes of cars offered for sale on cars.com in 20171. The codebook is available below:
type
: Model (Accord, Maxima, Mazda6)age
: Age of the used car (in years)price
: Price (in thousands of dollars)mileage
: Previous miles driven (in thousands of miles)car_prices <- read_csv("data/car_prices.csv")
glimpse(car_prices)
## Rows: 90
## Columns: 4
## $ type <chr> "Mazda6", "Mazda6", "Mazda6", "Mazda6", "Mazda6", "Mazda6", "M…
## $ age <dbl> 3, 2, 1, 2, 2, 1, 2, 3, 3, 4, 4, 3, 3, 3, 3, 1, 7, 8, 6, 7, 10…
## $ price <dbl> 15.9, 16.4, 18.9, 16.9, 20.5, 19.0, 17.5, 18.0, 13.6, 12.0, 10…
## $ mileage <dbl> 17.8, 19.0, 20.9, 24.0, 24.0, 24.2, 30.1, 32.0, 34.8, 35.7, 49…
Consider a regression model with the response price
and a single predictor mileage
.
Q - Write out the equation of a model using parameters and variable names.
Q - Create a scatterplot of price and mileage. Do you see any patterns?
Q - Use appropriate functions to find the fitted model and display the results in tidy
format.
## option 1
linear_reg(engine = "lm") %>%
fit(____ ~ _____, data = ______) %>%
tidy()
## option 2
lm_pm <- lm(____ ~ _____, data = ______) %>%
tidy()
lm_pm
linear_reg(engine = "lm")
: specify which regression model to use (“lm” = linear model)fit
: fit, i.e., estimate parameters for a given model. y ~ x
. use variable names in data
argument.tidy
: construct a tidy data frame summarizing model resultslm
: R
base function to fit a linear model. same formula syntax as fit
Q - Write out the equation of the fitted model, and interpret the slope and intercept in the context of data.
Q - What is the predicted selling price of a car with 50,000 miles?
Q - Include a visualization of the linear model on the scatterplot we created above. Try two options provided below and compare the two visualizations focusing on any difference.
## option 1: intercept and slope
## option 2: predicted values
car_prices <- car_prices %>%
mutate(pred = ____)
car_prices %>%
ggplot() +
geom_point(aes(x = mileage, y = price)) +
geom_line(aes(x = _____, y = _____),
size = 1.5, color = "red") +
labs(x = "Mileage (in thousands of miles)",
y = "Price (in thousands of dollars)")
Q - Suppose my friend has a Honda Accord with 225,000 miles. Suppose another friend has a BMW car with 80,000 miles. Is it appropriate to use this model to make a prediction for the selling prices? Why or why not?
Consider a regression model with the response price
and the categorical predictor type
(Accord, Maxima, Mazda6).
Q - Create side-by-side boxplots of price for each type. Comment on what you observe.
Q - Use appropriate functions to find the fitted model and display the results in tidy
format. Write out the equation of the fitted model.
## option 1
## option 2
Q - How many terms are in the model for type
? Is this equal to the number of car types in the dataset? If not, briefly explain why the number of terms for type
in the model differs from the number car types in the dataset.
Q - Interpret the intercept and slope(s) in the context of the problem.
Q - Create a scatterplot of price and age. Comment on what you observe.
Q - Add fitted linear lines for each type
of cars on top of the scatterplot of price and age. Comment on what you observe.
car_prices %>%
ggplot(aes(x = ___, y = ___, color = ___)) +
geom_point() +
geom_XXXX(method = ___, se = FALSE) +
labs(x = "Car age (in years)",
y = "Price (in thousands of dollars)",
color = "Model") +
scale_color_viridis_d()
Q - What are possible limitations of two regression models in Part 1 and 2?
The data is from the Stat2Data
R package.↩︎