SSH
URL.File
\(\rightarrow\) New Project
\(\rightarrow\) Version Control
\(\rightarrow\) Git
and paste the URL..Rmd
file and replace “Your Name” with your name.We always begin by loading relevant libraries.
library(tidyverse)
Next, we load data. We will continue our investigation of single-family home prices in Minneapolis, Minnesota.
mn_homes <- read_csv("data/mn_homes.csv")
Add a glimpse()
to the code chunk below and identify the following variables as numeric continuous, numeric discrete, categorical ordinal, or categorical nominal.
salesprice
numfireplaces
community
# code here
The summary()
command is also useful in looking at numerical variables. Use this command to look at the numeric variables from the previous chunk.
# code here
We can use a histogram to summarize a numeric variable. Play with different binwidth
.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now let’s look at the distribution of price for each community. We will make a faceted histogram where each facet represents a community and displays the distribution of sales price for that community. Fill in the blank with an appropriate variable.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_histogram(binwidth = ____) +
facet_wrap(~____)
You might notice that 1) a few expensive homes exist in one community and 2) some communities have less homes sold than others. These obscure our ability to see the histograms for sales prices in all of the communities at once. In this case, we might change the scales for each histogram by scales = "free_x"
, scales = "free_y"
, or scales = "free"
. Choose the option that you think best solves the issues.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
ggplot(data = mn_homes,
mapping = aes(x = salesprice)) +
geom_histogram(binwidth = 100000) +
facet_wrap(~____, scales = ____)
A density plot is another option. We just connect the boxes in a histogram with a smooth curve. Fill in the code below to create a density plot of salesprice
.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
ggplot(data = mn_homes,
mapping = aes(x = ____)) +
geom_xxxx()
Side-by-side boxplots are helpful to visualize the distribution of a numeric variable across the levels of a categorical variable.
ggplot(data = mn_homes,
mapping = aes(x = community, y = salesprice)) +
geom_boxplot() +
coord_flip() +
labs(title = "Sales Price by Community", x = "Community", y = "Sales Price")
Q - What is coord_flip()
doing in the code chunk above? Try removing it to see. Does it affect labs()
?
Q - Can you detect any homes sold for unusually high prices? In which community? Does it match what we observed in the faceted histogram above?
Bar plots allow us to visualize categorical variables.
ggplot(data = mn_homes, mapping = aes(x = community)) +
geom_bar() +
labs(title = "Homes by Community", x = "Community", y = "Number of Homes")
Segmented bar plots can be used to visualize two categorical variables. Fill in the blanks for segmenting the number of homes in each community
into whether a fireplace exists or not (the variable fireplace
). We intend to make horizontal bars.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
ggplot(data = mn_homes, mapping = aes(x = ____, fill = ____)) +
geom_bar() +
coord_flip() +
scale_fill_viridis_d(option = "cividis", name = "Fireplace?") +
labs(title = "Fireplaces by Community",
x = "Community", y = "Number of Homes")
Fill in the blank with an informative y-label in the following code chunk.
ggplot(data = mn_homes, mapping = aes(x = community, fill = fireplace)) +
geom_bar(position = "fill") +
scale_fill_viridis_d(option = "cividis", name="Fireplace?") +
coord_flip() +
labs(title = "Percentage of Homes with a Fireplace by Community",
x = "Community", y = "____")
Q - Which of the above two visualizations do you prefer? Why? Is this answer always the same?
There is something wrong with each of the plots below. Run the code for each plot, read the error, then identify and fix the problem.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = salesprice,
shape = 21, size = .85))
ggplot(data = mn_homes, mapping = (x = otsize, y = area)) +
geom_point(, shape = 21, size = .85)
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = lotsize, y = area),
color = community, size = 0.85)
ggplot(data = mn_homes) +
geom_point(mapping = aes(x = 1otsize, y = area))
General principles for effective data visualization
Modify the code outline to create a ridge plot examining the distribution of year built within each community.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
library(ggridges)
ggplot(data = ___, aes(x = ___, y = ___, fill = ____, color = ____)) +
geom_density_ridges(alpha = 0.5) +
scale_fill_viridis_d() +
scale_color_viridis_d() +
labs(x = "_____",
y = "_____",
fill = "_____",
color = "_____",
title = "_____",
subtitle = "_____")