Getting Started

  • Go to the course GitHub organization page and find the repository entitled “ae03-GitHubUsername”.
  • Click the green “code” button and copy the SSH URL.
  • Go to RStudio, select File \(\rightarrow\) New Project \(\rightarrow\) Version Control \(\rightarrow\) Git and paste the URL.
  • Open the .Rmd file

Minneapolis Housing Data

Exploratory data analysis (EDA) is an approach to analyzing datasets in order to summarize the main characteristics, often with visual representations of the data (today). We can also calculate summary statistics and perform data wrangling, manipulation, and transformation (next week).

We will introduce visualization using data on single-family homes sold in Minneapolis, Minnesota between 2005 and 2015.

We first start with loading a relevant package for plotting:

library(tidyverse)

Part 1: Data

Q - What happens when you click the green arrow in the code chunk below? What changes in the “Environment” pane?

[Write your answer here, you will do this for questions like this in your RMD file.]

mn_homes <- read_csv("data/mn_homes.csv")

Q - In a data frame, what does each row represent? Each column? Does glimpse() output match this?

glimpse(mn_homes)
## Rows: 495
## Columns: 13
## $ saleyear      <dbl> 2012, 2014, 2005, 2010, 2010, 2013, 2011, 2007, 2013, 20…
## $ salemonth     <dbl> 6, 7, 7, 6, 2, 9, 1, 9, 10, 6, 7, 8, 5, 2, 7, 6, 10, 6, …
## $ salesprice    <dbl> 690467.0, 235571.7, 272507.7, 277767.5, 148324.1, 242871…
## $ area          <dbl> 3937, 1440, 1835, 2016, 2004, 2822, 2882, 1979, 3140, 35…
## $ beds          <dbl> 5, 2, 2, 3, 3, 3, 4, 3, 4, 3, 3, 3, 2, 3, 3, 6, 2, 3, 2,…
## $ baths         <dbl> 4, 1, 1, 2, 1, 3, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1,…
## $ stories       <dbl> 2.5, 1.7, 1.7, 2.5, 1.0, 2.0, 1.7, 1.5, 1.5, 2.5, 1.0, 2…
## $ yearbuilt     <dbl> 1907, 1919, 1913, 1910, 1956, 1934, 1951, 1929, 1940, 19…
## $ neighborhood  <chr> "Lowry Hill", "Cooper", "Hiawatha", "King Field", "Shing…
## $ community     <chr> "Calhoun-Isles", "Longfellow", "Longfellow", "Southwest"…
## $ lotsize       <dbl> 6192, 5160, 5040, 4875, 5060, 6307, 6500, 5600, 6350, 75…
## $ numfireplaces <dbl> 0, 0, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,…
## $ fireplace     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TR…

Part 2: ggplot layers

ggplot() creates the initial base coordinate system that we will add layers to. We first specify the dataset we will use with data = mn_homes. The mapping argument is paired with an aesthetic (aes), which tells us how the variables in our dataset should be mapped to the visual properties of the graph.

Q - What does the first code chunk immediately below do?

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice))

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point()

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Q - What does geom_smooth() do? Hint: Run ?geom_smooth in the console.

This fits a loess regression line (moving regression) to the data.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   geom_smooth() +
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The procedure used to construct plots can be summarized using the code below.

ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxxx() +
   geom_xxxx() + 
   other options

Q - What do you think eval = FALSE is doing in the code chunk above?

Part 3: Aesthetics

An aesthetic is a visual property of one of the objects in your plot.

  • shape
  • color
  • size
  • alpha (transparency)

We can map a variable in our dataset to a color, a size, a transparency, and so on. The aesthetics that can be used with each geom_xxxx can be found in the documentation.

Here we are going to use the viridis package, which has more color-blind accessible colors. scale_color_viridis specifies which colors you want to use. You can learn more about the options here.

Other sources that can be helpful in devising accessible color schemes include the scico package, Color Brewer, the Wes Anderson package, and the cividis package.

This visualization shows a scatter plot of area (x variable) and sales price (y variable). Using the viridis function, we make points for houses with a fireplace yellow and those without navy. We also add axis and an overall label.

library(viridis)
## Loading required package: viridisLite
ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice,
                     color = fireplace)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
  scale_color_viridis_d(option = "cividis", name="Fireplace?")

Q - What will the visualization look like below? Write your answer down before running the code.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice,
                     shape = fireplace)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)",
        shape="Fireplace?") 

Q - This one?

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice,
                     color = fireplace,
                     size = lotsize)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)", 
        size = "Lot Size") +
  scale_color_viridis_d(option = "cividis", name="Fireplace?")

Q - Are the above visualizations effective? Why or why not? How might you improve them?

Q - What is the difference between the two plots below?

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = area, y = salesprice, color = "blue"))

ggplot(data = mn_homes) + 
  geom_point(mapping = aes(x = area, y = salesprice), color = "blue")

Use aes to map variables to plot features, use arguments in geom_xxxx for customization not mapped to a variable.

Mapping in the ggplot function is global, meaning they apply to every layer we add. Mapping in a particular geom_xxxx function treats the mappings as local.

Create a scatter plot using variables of your choosing using the mn_homes data.

Modify your scatter plot above by coloring the points for each community.

Part 4: Faceting

We can use smaller plots to display different subsets of the data using faceting. This is helpful to examine conditional relationships.

Let’s try a few simple examples of faceting. Note that these plots should be improved by careful consideration of labels, aesthetics, etc.

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(. ~ beds)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(beds ~ .)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_grid(beds ~ baths)

ggplot(data = mn_homes, 
       mapping = aes(x = area, y = salesprice)) + 
   geom_point() + 
   labs(title = "Sales price vs. area of homes in Minneapolis, MN",
        x = "Area (square feet)", y = "Sales Price (dollars)") + 
   facet_wrap(~ community)

facet_grid()

  • 2d grid
  • rows ~ cols
  • use . for no plot

facet_wrap()

  • 1d ribbon wrapped into 2d

Practice

  1. Modify the code outline to make the changes described below.
  • Change the color of the points to green.
  • Add alpha to make the points more transparent.
  • Add labels for the x axis, y axis, and the color of the points.
  • Add an informative title.
  • Consider using the viridis palette. (Note, you can’t do all of these things at once in terms of color, these are just suggestions.)

When you are finished, remove eval = FALSE and knit the file to see the changes.

Here is some starter code:

ggplot(data = mn_homes, 
       mapping = aes(x = lotsize, y = salesprice)) + 
   geom_point(color = ____, alpha = ____) + 
   labs(____)
  1. Modify the code outline to make the changes described below.
  • Create a histogram of lotsize.
  • Modify the histogram by adding fill = "blue" inside the geom_histogram() function.
  • Modify the histogram by adding color = "red" inside the geom_histogram() function.

When you are finished, remove eval = FALSE and knit the file to see the changes.

ggplot(data = mn_homes, 
       mapping = aes(x = _____)) +
  geom_histogram(fill = ____, color = ____) +
  labs(title = "Histogram of Lot Size" , x = "Size of Lot", y = "Number of Homes")

Q - What is the difference between the color and fill arguments?

  1. Develop an effective visualization on your own using the code chunk provided below. Use three variables and at least one aesthetic mapping.

Submitting Application Exercises

  • Once you have completed the activity, push your final changes to your GitHub repo.
  • Make sure you committed at least three times.
  • Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.