Getting Started

Introduction

Multiple linear regression allows us to relate a numerical response variable to one or more numerical or categorical predictors. We have used them to understand relationships and make predictions.

But what about a situation where the response of interest is binary?

  • spam or non-spam
  • malignant or benign tumor
  • survived or died
  • admitted or denied
  • won or lost an election

Then we use logistic regression!

Email Spam

To illustrate logistic regression, we will build a spam filter from email data. Today’s data consists of 4,601 emails from George’s inbox that are classified as spam or non-spam. The data were collected at Hewlett-Packard labs and contains 58 variables. The first 48 variables are specific keywords, and each observation is the percentage of appearance of that word in the message. Click here to read more.

  • type \(= 1\) is spam
  • type \(= 0\) is non-spam

We first load relevant packages

library(tidyverse)
theme_set(theme_bw())
library(tidymodels)

and then read the data.

spam <- read_csv("data/spam.csv")
glimpse(spam)
## Rows: 4,601
## Columns: 58
## $ make              <dbl> 0.00, 0.21, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15…
## $ address           <dbl> 0.64, 0.28, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ all               <dbl> 0.64, 0.50, 0.71, 0.00, 0.00, 0.00, 0.00, 0.00, 0.46…
## $ num3d             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ our               <dbl> 0.32, 0.14, 1.23, 0.63, 0.63, 1.85, 1.92, 1.88, 0.61…
## $ over              <dbl> 0.00, 0.28, 0.19, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ remove            <dbl> 0.00, 0.21, 0.19, 0.31, 0.31, 0.00, 0.00, 0.00, 0.30…
## $ internet          <dbl> 0.00, 0.07, 0.12, 0.63, 0.63, 1.85, 0.00, 1.88, 0.00…
## $ order             <dbl> 0.00, 0.00, 0.64, 0.31, 0.31, 0.00, 0.00, 0.00, 0.92…
## $ mail              <dbl> 0.00, 0.94, 0.25, 0.63, 0.63, 0.00, 0.64, 0.00, 0.76…
## $ receive           <dbl> 0.00, 0.21, 0.38, 0.31, 0.31, 0.00, 0.96, 0.00, 0.76…
## $ will              <dbl> 0.64, 0.79, 0.45, 0.31, 0.31, 0.00, 1.28, 0.00, 0.92…
## $ people            <dbl> 0.00, 0.65, 0.12, 0.31, 0.31, 0.00, 0.00, 0.00, 0.00…
## $ report            <dbl> 0.00, 0.21, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ addresses         <dbl> 0.00, 0.14, 1.75, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ free              <dbl> 0.32, 0.14, 0.06, 0.31, 0.31, 0.00, 0.96, 0.00, 0.00…
## $ business          <dbl> 0.00, 0.07, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ email             <dbl> 1.29, 0.28, 1.03, 0.00, 0.00, 0.00, 0.32, 0.00, 0.15…
## $ you               <dbl> 1.93, 3.47, 1.36, 3.18, 3.18, 0.00, 3.85, 0.00, 1.23…
## $ credit            <dbl> 0.00, 0.00, 0.32, 0.00, 0.00, 0.00, 0.00, 0.00, 3.53…
## $ your              <dbl> 0.96, 1.59, 0.51, 0.31, 0.31, 0.00, 0.64, 0.00, 2.00…
## $ font              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num000            <dbl> 0.00, 0.43, 1.16, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ money             <dbl> 0.00, 0.43, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15…
## $ hp                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hpl               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ george            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num650            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ lab               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ labs              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ telnet            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num857            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ data              <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.15…
## $ num415            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num85             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ technology        <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ num1999           <dbl> 0.00, 0.07, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ parts             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pm                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ direct            <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ cs                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ meeting           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ original          <dbl> 0.00, 0.00, 0.12, 0.00, 0.00, 0.00, 0.00, 0.00, 0.30…
## $ project           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ re                <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ edu               <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ table             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ conference        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ charSemicolon     <dbl> 0.000, 0.000, 0.010, 0.000, 0.000, 0.000, 0.000, 0.0…
## $ charRoundbracket  <dbl> 0.000, 0.132, 0.143, 0.137, 0.135, 0.223, 0.054, 0.2…
## $ charSquarebracket <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.0…
## $ charExclamation   <dbl> 0.778, 0.372, 0.276, 0.137, 0.135, 0.000, 0.164, 0.0…
## $ charDollar        <dbl> 0.000, 0.180, 0.184, 0.000, 0.000, 0.000, 0.054, 0.0…
## $ charHash          <dbl> 0.000, 0.048, 0.010, 0.000, 0.000, 0.000, 0.000, 0.0…
## $ capitalAve        <dbl> 3.756, 5.114, 9.821, 3.537, 3.537, 3.000, 1.671, 2.4…
## $ capitalLong       <dbl> 61, 101, 485, 40, 40, 15, 4, 11, 445, 43, 6, 11, 61,…
## $ capitalTotal      <dbl> 278, 1028, 2259, 191, 191, 54, 112, 49, 1257, 749, 2…
## $ type              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Part 1: Business as usual

The basic modeling logic is that the percentage of certain words in each email can help us determine whether or not the email is spam.

Let’s start by examining one predictor, the word “george” (george) because it’s George’s inbox.

Q - Create faceted histogram showing the percentage of the word “george” by being spam or non-spam. Make a table of mean and maximum number of “george” by spam or non-spam. Would you expect emails with or without the word “george” to be spam?

Then we might assume the following model:

\[type = \beta_0 + \beta_1~george + \epsilon\]

Q - Fit the linear regression model above and print the output in a tidy way. Visualize the fitted line on a scatterplot between type and george using geom_smooth(). Discuss your visualization with your neighbor. Is this a good model? Why or why not?

Q - What are some problems with this approach?

\[type = \beta_0 + \beta_1~george + \epsilon\] We need a new tool to match ranges of the left hand side (LHS) and the right hand side (RHS) of the model equation.

Part 2: Preliminaries

  • Let \(p\) be the success probability of some event, e.g., \(P(type = spam) = p\) or \(P(game = win) = p\).
  • Odds of success is \(\frac{p}{1-p}\)
    • Odds are sometimes expressed as \(p:(1-p)\) and read \(p\) to \(1-p\).
  • Log odds is natural log of the odds, yielding the logit of \(p\), \[\log\left(\frac{p}{1-p}\right) = logit(p)\]
    • The logit function takes an input in \([0,1]\) and maps it to a value between \(-\infty\) and \(\infty\).
  • Logistic (or inverse-logit) function undoes the logit transformation. \[logistic(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\]
    • It takes a value between \(-\infty\) and \(\infty\) and maps it to a value between 0 and 1.

Q - Suppose there is a 70% chance it will rain tomorrow. Calculate the following:

Probability it will rain:

Probability it won’t rain:

Odds it will rain:

Q - What do you notice in the visualization of \(logit(p)\) vs. \(p\) below in terms of ranges in \(x\) and \(y\) axes?

Part 3: Logistic regression

Logistic regression is a generalized linear model that explains a binary response variable using a linear regression with a logit link:

\[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + \cdots + \beta_kx_k\]

Using the inverse logit function, we can find the equation for \(p\)

\[ p = \frac{\exp(\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k)}{1+\exp(\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k)} = \frac{1}{1+\exp\{-(\beta_0 + \beta_1x_1 + \cdots + \beta_kx_k)\}}\]

Suppose we want to relate the probability that an email is spam to the percentage of the word “you” (you) and the total number of capital letters (capitalTotal) in the email.

Q - Write out a logistic regression model in context of our data using parameters and variable names. Write out a probability equation as well using the inverse logit function.

Q - Using functions in R, fit the logistic regression model in the previous question and print its output in a tidy way.

# option 1
glm1 <- XXXXX_reg(engine = "____") %>% 
  fit(_______, data = spam, family = _______)
glm1 %>% 
  tidy()

Q - What is different in the code above from previous linear models we fit?

Another way of fitting the same glm is the following:

# option 2
glm2 <- glm(type ~ you + capitalTotal, data = spam, family = "binomial") 
glm2 %>% 
  tidy()
## # A tibble: 3 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  -1.50     0.0554       -27.1 2.97e-162
## 2 you           0.361    0.0198        18.3 1.84e- 74
## 3 capitalTotal  0.00173  0.000104      16.6 5.66e- 62

Here, you don’t have to manually make type as a factor.

Q - Based on the regression result, write out the fitted logistic regression model and reformulate it into the fitted equation for \(\hat{p}\) as well.

Q - Interpret the estimated coefficients in context of the data using the term “log odds”.

  • Intercept:
  • Slope of you:
  • Slope of capitalTotal:

Q - Interpret the estimated coefficients in context of the data using the term “odds”.

  • Intercept:
  • Slope of you:
  • Slope of capitalTotal:

Part 4: Predicted probabilities

Let’s examine predicted values using the augment() function. One caveat is the augment() function should apply to the fit element in the model object from the first option with logistic_reg(), while it can directly apply to the model object from the second option with glm(). See the code below.

# option 1
glm1$fit %>% 
  augment() %>% 
  head()

# option 2
glm2 %>% 
  augment() %>% 
  head()

Q - Which of the following is the predicted value (values in .fitted column) in this model? Hint: What is on LHS of a logistic regression model?

  1. predicted odds
  2. predicted log odds
  3. predicted binary response
  4. predicted probability

For instance, the predicted odds of being spam for the first email is \(\exp(-0.324) = 0.723\).

Q - Calculate the predicted probability of being spam for the first email.

Q - Calculate the predicted probability of being spam for the third email.

We can also obtain predicted probabilities of being spam at new levels of the explanatory variables.

Q - Calculate predicted probability the email is spam if the percentage of the word “you” is 5% in the email and there are 2500 capital letters.

You can also obtain the predicted probabilities at new values of the explanatory variables using the augment() function.

new_data <- tibble(you = 5, 
                   capitalTotal = 2500)
glm2 %>% 
  augment(., newdata = ______) %>% 
  mutate(phat = _____)

Q - Compute the predicted probability of the following new email to be spam. The percentage of the word “you” is defined as 100*(total number of “you”)/(total number of words).

email <- read_lines("data/test_email.txt")
email 
## [1] "You Have Been Selected To Win A Free Trip To Disney World!"
## [2] ""                                                          
## [3] "YOU HAVE 30 SECONDS TO CLICK HERE TO CLAIM YOUR REWARD!"   
## [4] ""                                                          
## [5] "WHAT ARE YOU WAITING FOR? ACT NOW!"                        
## [6] ""                                                          
## [7] "Sincerely,"                                                
## [8] ""                                                          
## [9] "WALT DISNEY"
# total number of words
total_word <- sum(str_count(email, " ")) + 5
# total number of the word "you"
total_you <- sum(str_count(tolower(email), "you")) - 1
# total number of capital letters 
total_cap <- sum(str_count(email, "[A-Z]"))

Part 5: Sensitivity and specificity

One might also be interested in the predicted labels (spam vs. non-spam), not just predicted probabilities.

Just because there’s greater than 50% chance an email is spam doesn’t mean we have to label it as such. We can adjust our threshold or cutoff probability to be more or less sensitive to spam emails.

In other words, we get to select a number \(p_c\) such that if \(p > p_c\), then label the email as spam.

Let’s make a 2x2 contingency table at cutoff probability = 0.05 where rows represent predicted labels of emails, columns represent the actual type of emails, and each cell contains counts.

cutoff <- 0.5
glm2 %>% 
  augment() %>%
  mutate(type_name = ifelse(type == 1, "spam", "non-spam"),
         phat = exp(.fitted)/(1+exp(.fitted))) %>% 
  mutate(type_pred = ifelse(phat > cutoff, 
                            "labeled spam", "labeled non-spam")) %>% 
  count(type_pred, type_name) %>% 
  pivot_wider(values_from = n, 
              names_from = type_name)

Q - Change the cutoff value in the above code chunk to 0.1, 0.3, 0.7, and 0.99. What do you observe?

  • Lower threshold means that we are sensitive to spam emails and label more emails as spam.
  • Higher threshold means we can tolerate spam emails and label fewer emails as spam.
  • Sensitivity = P(labeled spam | spam), how well we can detect success events
  • Specificity = P(labeled non-spam | non-spam), how well we can rule out failure events
  • We want both of them to be high, but there is always a trade-off!
    • Imagine a spam filter which labels all emails as spam: 1 sensitivity, but 0 specificity
  • We can adjust the threshold depending on how much we dislike spam (sensitivity) vs. how much we value receiving important emails (specificity).

Submitting Application Exercises

  • Once you have completed the activity, push your final changes to your GitHub repo.
  • Make sure you committed at least three times.
  • Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.