Homework #01: Data Visualization and Wrangling

due Tuesday, May 24 at 11:59 PM

Goals

Getting started

Formatting

Packages

library(tidyverse)
theme_set(theme_bw())
library(sf)

Hawaii Airbnb Listings in 2021

The following data were originally distributed by Inside Airbnb (date compiled: 11 Dec, 2021) and has been modified for the purpose of this homework.

We will work on two datasets, namely, hawaii and hawaii_nb. The dataset hawaii contains summary information and metrics for listings in Hawaii for the following variables:

An sf object hawaii_nbcontains geometry information on neighborhoods of Hawaii.

Note: You do not have to worry about the last line in the code chunk below; it simply replaces column names with better understandable ones for future use.

hawaii <- read_csv("data/hawaii_airbnb.csv")
hawaii_nb <- st_read("data/hawaii_airbnb_neighborhood.shp", quiet = TRUE)
names(hawaii_nb)[1:2] <- c("neighbourhood", "neighbourhood_group")
  1. How many observations (rows) does the dataset hawaii have? What does each row in the dataset represent? How many variables (columns) does hawaii have? Identify review_scores_rating and room_type as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

  1. Reproduce the following tables. The column “count” represents the number of listings, and “percentage” is the associated percentage. Hint:

    • Include knitr::kable(., digits = 3) at the end of your pipeline to neatly display tables in your final document with numbers rounded to three decimal places.
    • You may create a new variable using ifelse() for the column “minimum stay < 30?”.
    • When your variable name has space in between, you need quotation marks.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

  1. Create a faceted histogram where each facet represents a neighborhood and displays the distribution of Airbnb prices in that neighborhood. In order to better understand the neighborhoods, fill histograms for neighborhoods in the same group (neighbourhood_group) in the same color. How would you describe the distribution of price in general? How do neighborhoods compare to each other in terms of price?

    • Think critically about whether it makes more sense to stack the facets on top of each other in a column, lay them out in a row, or wrap them around. Along with your visualization, include your reasoning for the layout you chose for your facets.
    • You should decide what makes a reasonable bin width for the histogram by trying out a few options.
    • You should decide if scales should be fixed or not. If the scales are fixed, every facet is drawn on the same ranges of x values and y values.
    • Make sure your figure is large enough to properly display histograms for all facets.
  2. We will examine median listing prices of neighborhoods.

    • Create a summary dataset called hawaii_summary with the minimum, mean, median, standard deviation, IQR, and maximum listing price in each neighborhood and identify neighborhoods with the top five median listing prices.
    • Create a new sf object hawaii_sf that includes all rows and columns from hawaii_summary and the associated geometry information from hawaii_nb.
    • Create a map of Hawaii where each neighborhood is colored by the level of its median listing price using scale_fill_gradient. Describe what you observe.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

  1. Now you will meet the Airbnb hosts in Hawaii.
    • How many hosts are there in Hawaii?
    • Using one pipeline:
      • create a new variable annual_revenue that estimates the annual revenue for the listing in the last 12 months. You may assume that visitors always left a review and stayed in the listing for minimum nights only. Hint: use minimum_nights, price and number_of_reviews.
      • calculate the sum of the estimated annual revenue and the number of listings by each of the hosts identified by host_id.
      • create a scatterplot illustrating the two variables’ relationship. Put the number of listings on x-axis. Use geom_smooth() for a fitted line.
    • Interpret your visualization.
    • Are there any outliers? If so, identify them.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

  1. Come up with a question you want to answer using these data, and write it down. Then, answer it using summary statistic(s) and/or visualization(s).

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Workflow & formatting” section with the first page.

Grading (60 pts)


Component Points
Ex 1 5
Ex 2 8
Ex 3 8
Ex 4 15
Ex 5 8
Ex 6 6
Workflow & formatting 10

Grading notes: