Homework #04: Linear Regression

due Thursday, June 16 at 11:59pm

Goals

Getting started

Instructions

For each exercise:

Formatting

Packages

We’ll use the tidyverse package for much of the data wrangling and visualization, the tidymodels package for inference, and the data live in the openintro package. We use the knitr package for nice-looking tables.

library(tidyverse)
theme_set(theme_bw())
library(tidymodels)
library(openintro)
library(knitr)

Grading the professor1 This homework is adapted from lab assignments in Data Science in a Box. See the original exercises at https://rstudio-education.github.io/datascience-box/course-materials/lab-instructions/lab-10/lab-10-slr-course-evals.html and https://rstudio-education.github.io/datascience-box/course-materials/lab-instructions/lab-11/lab-11-mlr-course-evals.html

Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings. (Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, August 2005, Pages 369-376, ISSN 0272-7757, 10.1016/j.econedurev.2004.07.013. http://www.sciencedirect.com/science/article/pii/S0272775704001165.)

In this homework you will analyze the data from this study in order to learn what goes into a positive professor evaluation.

The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance. (This is a slightly modified version of the original dataset that was released as part of the replication data for Data Analysis Using Regression and Multilevel / Hierarchical Models (Gelman and Hill, 2007).) The result is a data frame where each row contains a different course and columns represent variables about the courses and professors.

The data can be found in the openintro package, and it’s called evals. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?evals in the Console.

Exploratory Data Analysis

  1. Visualize the distribution of score. Is the distribution skewed? What does that tell you about how students rate courses? Is this what you expected to see? Why, or why not? Include any summary statistics and visualizations you use in your response.

  2. Visualize and describe the relationship between score and bty_avg in a scatterplot.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

Linear Regression with a Numerical Predictor

  1. Let’s see if the trend in the plot in Ex 2 is something more than natural variation. Fit a linear model called score_bty_fit to predict average professor evaluation score by average beauty rating (bty_avg). Print the regression output as a nice kable table with all numbers rounded up to three decimal places. Based on the regression output, write the linear model.

  2. Recreate the scatterplot from Ex 2, and add the regression line to this plot in orange color, with shading for the uncertainty of the line turned off. Hint: use the geom_smooth() function.

  3. Interpret the slope and the intercept of the linear model in context of the data.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

Linear Regression with a Categorical Predictor

  1. Fit a new linear model called score_gender_fit to predict average professor evaluation score based on gender of the professor. Based on the regression output, write the linear model using an indicator variable and interpret the slope and intercept in context of the data.

  2. Determine the \(R^2\) of both models and interpret the \(R^2\) statistics in context of the data. Which model do you prefer and why?

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

Multiple Linear Regression

  1. Fit a linear model: score_bty_gen_fit, predicting average professor evaluation score based on average beauty rating (bty_avg) and gender. Write the linear model with an indicator variable for gender and interpret the slopes and intercept in context of the data.

  2. What is the equation of the line between score and bty_avg corresponding to male professors? What is it for female professors?

  3. For two professors who received the same beauty rating, which gender tends to have higher course evaluation score? Why?

  4. How do the \(R^2\) and the adjusted \(R^2\) values of score_bty_gen_fit and score_bty_fit compare? Which metric do you believe more in this case? Why?

  5. What does this tell us about how useful gender is in explaining the variability in evaluation scores when we already have information on the beauty score of the professor?

  6. Fit a linear model: score_bty_gen_int_fit, predicting average professor evaluation score based on average beauty rating (bty_avg), gender, and their interaction. Print the regression output as a nice kable table with all numbers rounded up to three decimal places. Based on the regression output, write the linear model with an indicator variable for gender.

  7. What is the equation of the line between score and bty_avg corresponding to male professors from score_bty_gen_int_fit? What is it for female professors? Which gender professors benefit more on the course evaluation by a one point increase in the average beauty rating? Why?

  8. Compare the adjusted \(R^2\) values of score_bty_gen_fit and score_bty_gen_int_fit compare? Which model do you prefer and why?

  9. What does this tell us about how useful the interaction term between bty_avg and gender is in explaining the variability in evaluation scores when we already have information on main effects of the beauty score of a professor and gender?

  10. We will verify if the coefficient parameter for the interaction term (\(\beta_3\)) is significantly different from zero via the Central Limit Theorem based hypothesis testing.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards. Go to your homework repo on GitHub and see if your commit is well reflected.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Workflow & formatting” section with the first page.

Grading (60 pts)


Component Points
Ex 1 4
Ex 2 2
Ex 3 3
Ex 4 2
Ex 5 3
Ex 6 6
Ex 7 5
Ex 8 7
Ex 9 2
Ex 10 2
Ex 11 4
Ex 12 1
Ex 13 3
Ex 14 4
Ex 15 2
Ex 16 1
Ex 17 7
Workflow & formatting 2

Grading notes: