Clone the repository entitled “ae23-GitHubUsername” at course GitHub organization page on your RStudio.
Open the .Rmd
file and replace “Your Name” with your name.
Multiple linear regression allows us to relate a numerical response variable to one or more numerical or categorical predictors. We have used them to understand relationships and make predictions.
But what about a situation where the response of interest is binary?
Then we use logistic regression!
To illustrate logistic regression, we will build a spam filter from email data. Today’s data consists of 4,601 emails from George’s inbox that are classified as spam or non-spam. The data were collected at Hewlett-Packard labs and contains 58 variables. The first 48 variables are specific keywords, and each observation is the percentage of appearance of that word in the message. Click here to read more.
type
\(= 1\) is spamtype
\(= 0\) is non-spamWe first load relevant packages
library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
and then read the data.
spam <- read_csv("data/spam.csv")
glimpse(spam)
## Rows: 4,601
## Columns: 58
## $ make <dbl> 0.00, 0.21, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15…
## $ address <dbl> 0.64, 0.28, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ all <dbl> 0.64, 0.50, 0.71, 0.00, 0.00, 0.00, 0.00, 0.00, 0.46…
## $ num3d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ our <dbl> 0.32, 0.14, 1.23, 0.63, 0.63, 1.85, 1.92, 1.88, 0.61…
## $ over <dbl> 0.00, 0.28, 0.19, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ remove <dbl> 0.00, 0.21, 0.19, 0.31, 0.31, 0.00, 0.00, 0.00, 0.30…
## $ internet <dbl> 0.00, 0.07, 0.12, 0.63, 0.63, 1.85, 0.00, 1.88, 0.00…
## $ order <dbl> 0.00, 0.00, 0.64, 0.31, 0.31, 0.00, 0.00, 0.00, 0.92…
## $ mail <dbl> 0.00, 0.94, 0.25, 0.63, 0.63, 0.00, 0.64, 0.00, 0.76…
## $ receive <dbl> 0.00, 0.21, 0.38, 0.31, 0.31, 0.00, 0.96, 0.00, 0.76…
## $ will <dbl> 0.64, 0.79, 0.45, 0.31, 0.31, 0.00, 1.28, 0.00, 0.92…
## $ people <dbl> 0.00, 0.65, 0.12, 0.31, 0.31, 0.00, 0.00, 0.00, 0.00…
## $ report <dbl> 0.00, 0.21, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ addresses <dbl> 0.00, 0.14, 1.75, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ free <dbl> 0.32, 0.14, 0.06, 0.31, 0.31, 0.00, 0.96, 0.00, 0.00…
## $ business <dbl> 0.00, 0.07, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ email <dbl> 1.29, 0.28, 1.03, 0.00, 0.00, 0.00, 0.32, 0.00, 0.15…
## $ you <dbl> 1.93, 3.47, 1.36, 3.18, 3.18, 0.00, 3.85, 0.00, 1.23…
## $ credit <dbl> 0.00, 0.00, 0.32, 0.00, 0.00, 0.00, 0.00, 0.00, 3.53…
## $ your <dbl> 0.96, 1.59, 0.51, 0.31, 0.31, 0.00, 0.64, 0.00, 2.00…
## $ font <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num000 <dbl> 0.00, 0.43, 1.16, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ money <dbl> 0.00, 0.43, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15…
## $ hp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hpl <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ george <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num650 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ lab <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ labs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ telnet <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num857 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ data <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15…
## $ num415 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num85 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ technology <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ num1999 <dbl> 0.00, 0.07, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ parts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ direct <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ cs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ meeting <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ original <dbl> 0.00, 0.00, 0.12, 0.00, 0.00, 0.00, 0.00, 0.00, 0.30…
## $ project <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ re <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ edu <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ table <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ conference <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ charSemicolon <dbl> 0.000, 0.000, 0.010, 0.000, 0.000, 0.000, 0.000, 0.0…
## $ charRoundbracket <dbl> 0.000, 0.132, 0.143, 0.137, 0.135, 0.223, 0.054, 0.2…
## $ charSquarebracket <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.0…
## $ charExclamation <dbl> 0.778, 0.372, 0.276, 0.137, 0.135, 0.000, 0.164, 0.0…
## $ charDollar <dbl> 0.000, 0.180, 0.184, 0.000, 0.000, 0.000, 0.054, 0.0…
## $ charHash <dbl> 0.000, 0.048, 0.010, 0.000, 0.000, 0.000, 0.000, 0.0…
## $ capitalAve <dbl> 3.756, 5.114, 9.821, 3.537, 3.537, 3.000, 1.671, 2.4…
## $ capitalLong <dbl> 61, 101, 485, 40, 40, 15, 4, 11, 445, 43, 6, 11, 61,…
## $ capitalTotal <dbl> 278, 1028, 2259, 191, 191, 54, 112, 49, 1257, 749, 2…
## $ type <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
The basic modeling logic is that the percentage of certain words in each email can help us determine whether or not the email is spam.
Let’s start by examining one predictor, the word “george” (george
) because it’s George’s inbox.
Q - Create faceted histogram showing the percentage of the word “george” by being spam or non-spam. Make a table of mean and maximum number of “george” by spam or non-spam. Would you expect emails with or without the word “george” to be spam?
Then we might assume the following model:
\[type = \beta_0 + \beta_1~george + \epsilon\]
Q - Fit the linear regression model above and print the output in a tidy
way. Visualize the fitted line on a scatterplot between type
and george
using geom_smooth()
. Discuss your visualization with your neighbor. Is this a good model? Why or why not?
Q - What are some problems with this approach?
\[type = \beta_0 + \beta_1~george + \epsilon\] We need a new tool to match ranges of the left hand side (LHS) and the right hand side (RHS) of the model equation.
Q - Suppose there is a 70% chance it will rain tomorrow. Calculate the following:
Probability it will rain:
Probability it won’t rain:
Odds it will rain:
Q - What do you notice in the visualization of \(logit(p)\) vs. \(p\) below in terms of ranges in \(x\) and \(y\) axes?
Logistic regression is a generalized linear model that explains a binary response variable using a linear regression with a logit link:
\[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \cdots + \beta_kx_k\]
Using the inverse logit function, we can find the equation for \(p\)
\[ p = \frac{\exp(\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k)}{1+\exp(\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k)} = \frac{1}{1+\exp\{-(\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k)\}}\]
Suppose we want to relate the probability that an email is spam to the percentage of the word “you” (you
) and the total number of capital letters (capitalTotal
) in the email.
Q - Write out a logistic regression model in context of our data using parameters and variable names. Write out a probability equation as well using the inverse logit function.
Q - Using functions in R
, fit the logistic regression model in the previous question and print its output in a tidy
way.
# option 1
glm1 <- XXXXX_reg(engine = "____") %>%
fit(_______, data = spam, family = _______)
glm1 %>%
tidy()
Q - What is different in the code above from previous linear models we fit?
Another way of fitting the same glm is the following:
# option 2
glm2 <- glm(type ~ you + capitalTotal, data = spam, family = "binomial")
glm2 %>%
tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.50 0.0554 -27.1 2.97e-162
## 2 you 0.361 0.0198 18.3 1.84e- 74
## 3 capitalTotal 0.00173 0.000104 16.6 5.66e- 62
Here, you don’t have to manually make type
as a factor.
Q - Based on the regression result, write out the fitted logistic regression model and reformulate it into the fitted equation for \(\hat{p}\) as well.
Q - Interpret the estimated coefficients in context of the data using the term “log odds”.
you
:capitalTotal
:Q - Interpret the estimated coefficients in context of the data using the term “odds”.
you
:capitalTotal
:Let’s examine predicted values using the augment()
function. One caveat is the augment()
function should apply to the fit
element in the model object from the first option with logistic_reg()
, while it can directly apply to the model object from the second option with glm()
. See the code below.
# option 1
glm1$fit %>%
augment() %>%
head()
# option 2
glm2 %>%
augment() %>%
head()
Q - Which of the following is the predicted value (values in .fitted
column) in this model? Hint: What is on LHS of a logistic regression model?
For instance, the predicted odds of being spam for the first email is \(\exp(-0.324) = 0.723\).
Q - Calculate the predicted probability of being spam for the first email.
Q - Calculate the predicted probability of being spam for the third email.
We can also obtain predicted probabilities of being spam at new levels of the explanatory variables.
Q - Calculate predicted probability the email is spam if the percentage of the word “you” is 5% in the email and there are 2500 capital letters.
You can also obtain the predicted probabilities at new values of the explanatory variables using the augment()
function.
new_data <- tibble(you = 5,
capitalTotal = 2500)
glm2 %>%
augment(., newdata = ______) %>%
mutate(phat = _____)
Q - Compute the predicted probability of the following new email to be spam. The percentage of the word “you” is defined as 100*(total number of “you”)/(total number of words).
email <- read_lines("data/test_email.txt")
email
## [1] "You Have Been Selected To Win A Free Trip To Disney World!"
## [2] ""
## [3] "YOU HAVE 30 SECONDS TO CLICK HERE TO CLAIM YOUR REWARD!"
## [4] ""
## [5] "WHAT ARE YOU WAITING FOR? ACT NOW!"
## [6] ""
## [7] "Sincerely,"
## [8] ""
## [9] "WALT DISNEY"
# total number of words
total_word <- sum(str_count(email, " ")) + 5
# total number of the word "you"
total_you <- sum(str_count(tolower(email), "you")) - 1
# total number of capital letters
total_cap <- sum(str_count(email, "[A-Z]"))
One might also be interested in the predicted labels (spam vs. non-spam), not just predicted probabilities.
Just because there’s greater than 50% chance an email is spam doesn’t mean we have to label it as such. We can adjust our threshold or cutoff probability to be more or less sensitive to spam emails.
In other words, we get to select a number \(p_c\) such that if \(p > p_c\), then label the email as spam.
Let’s make a 2x2 contingency table at cutoff probability = 0.05 where rows represent predicted labels of emails, columns represent the actual type of emails, and each cell contains counts.
cutoff <- 0.5
glm2 %>%
augment() %>%
mutate(type_name = ifelse(type == 1, "spam", "non-spam"),
phat = exp(.fitted)/(1+exp(.fitted))) %>%
mutate(type_pred = ifelse(phat > cutoff,
"labeled spam", "labeled non-spam")) %>%
count(type_pred, type_name) %>%
pivot_wider(values_from = n,
names_from = type_name)
Q - Change the cutoff value in the above code chunk to 0.1, 0.3, 0.7, and 0.99. What do you observe?