AE 04: Data Visualization 2

Getting Started

Minneapolis Housing Data

Part 1: Identify variable types
Part 2: Numeric variables
Part 3: Categorical variables
Part 4: Fix errors

Practice

Submitting Application Exercises

Additional Resources

Getting Started

Go to the course GitHub organization page and find the repository entitled “ae04-GitHubUsername”.
Click the green “code” button and copy the SSH URL.
Go to RStudio, select File New Project Version Control Git and paste the URL.
Open the .Rmd file and replace “Your Name” with your name.

Minneapolis Housing Data

We always begin by loading relevant libraries.

library(tidyverse)

Next, we load data. We will continue our investigation of single-family home prices in Minneapolis, Minnesota.

mn_homes <- read_csv("data/mn_homes.csv")

Part 1: Identify variable types

Add a glimpse() to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.

salesprice
numfireplaces
community

# code here

The summary() command is also useful in looking at numerical variables. Use this command to look at the numeric variables from the previous chunk.

# code here

Part 2: Numeric variables

We can use a histogram to summarize a numeric variable. Play with different binwidth.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now let’s look at the distribution of price for each community. We will make a faceted histogram where each facet represents a community and displays the distribution of sales price for that community. Fill in the blank with an appropriate variable.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
  geom_histogram(binwidth = ____) + 
  facet_wrap(~____)

You might notice that 1) a few expensive homes exist in one community and 2) some communities have less homes sold than others. These obscure our ability to see the histograms for sales prices in all of the communities at once. In this case, we might change the scales for each histogram by scales = "free_x", scales = "free_y", or scales = "free". Choose the option that you think best solves the issues.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

ggplot(data = mn_homes, 
       mapping = aes(x = salesprice)) + 
  geom_histogram(binwidth = 100000) + 
  facet_wrap(~____, scales = ____)

A density plot is another option. We just connect the boxes in a histogram with a smooth curve. Fill in the code below to create a density plot of salesprice.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

ggplot(data = mn_homes, 
       mapping = aes(x = ____)) + 
   geom_xxxx()

Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.

ggplot(data = mn_homes, 
       mapping = aes(x = community, y = salesprice)) + 
  geom_boxplot() + 
  coord_flip() + 
  labs(title = "Sales Price by Community", x = "Community", y = "Sales Price")

Q - What is coord_flip() doing in the code chunk above? Try removing it to see. Does it affect labs()?

Q - Can you detect any homes sold for unusually high prices? In which community? Does it match what we observed in the faceted histogram above?

Part 3: Categorical variables

Bar plots allow us to visualize categorical variables.

ggplot(data = mn_homes, mapping = aes(x = community)) + 
  geom_bar() + 
  labs(title = "Homes by Community", x = "Community", y = "Number of Homes")

Segmented bar plots can be used to visualize two categorical variables. Fill in the blanks for segmenting the number of homes in each community into whether a fireplace exists or not (the variable fireplace). We intend to make horizontal bars.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

ggplot(data = mn_homes, mapping = aes(x = ____, fill = ____)) + 
  geom_bar() +
  coord_flip() + 
  scale_fill_viridis_d(option = "cividis", name = "Fireplace?") + 
  labs(title = "Fireplaces by Community", 
       x = "Community", y = "Number of Homes")

Fill in the blank with an informative y-label in the following code chunk.

ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) + 
  geom_bar(position = "fill") + 
  scale_fill_viridis_d(option = "cividis", name="Fireplace?") + 
  coord_flip() + 
  labs(title = "Percentage of Homes with a Fireplace by Community", 
       x = "Community", y = "____")

Q - Which of the above two visualizations do you prefer? Why? Is this answer always the same?

Part 4: Fix errors

There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = lotsize, y = salesprice,
                           shape = 21, size = .85))

ggplot(data = mn_homes, mapping = (x = otsize, y = area)) +
  geom_point(, shape = 21, size = .85)

ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = lotsize, y = area),
             color = community, size = 0.85)

ggplot(data = mn_homes) +
  geom_point(mapping = aes(x = 1otsize, y = area))

General principles for effective data visualization

keep it simple
use color effectively
tell a story

Practice

Modify the code outline to create a ridge plot examining the distribution of year built within each community.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

library(ggridges)
ggplot(data = ___, aes(x = ___, y = ___, fill = ____, color = ____)) +
  geom_density_ridges(alpha = 0.5) + 
  scale_fill_viridis_d() + 
  scale_color_viridis_d() + 
  labs(x = "_____", 
       y = "_____", 
       fill = "_____", 
       color = "_____", 
       title = "_____", 
       subtitle = "_____")

Submitting Application Exercises

Once you have completed the activity, push your final changes to your GitHub repo.
Make sure you committed at least three times.
Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.

AE 04: Data Visualization 2

due Tuesday, May 17 at 9:29am

Bora Jin

Getting Started

Minneapolis Housing Data

Part 1: Identify variable types

Part 2: Numeric variables

Part 3: Categorical variables

Part 4: Fix errors

Practice

Submitting Application Exercises

Additional Resources