AE 07: Spatial Data and Visualization

Getting Started

Covid-19 Community Level

Part 1: Spatial Data are Different
Part 2: sf and dplyr
Part 3: sf and ggplot

Practice

Submitting Application Exercises

Additional Resources

Getting Started

Go to the course GitHub organization page and find the repository entitled “ae07-GitHubUsername”.
Click the green “code” button and copy the SSH URL.
Go to RStudio, select File New Project Version Control Git and paste the URL.
Open the .Rmd file and replace “Your Name” with your name.

Covid-19 Community Level

The following paragraph and figure directly come from the website of Centers for Disease Control and Prevention (CDC):

COVID-19 Community Levels are a new tool to help communities decide what prevention steps to take based on the latest data. Levels can be low, medium, or high and are determined by looking at hospital beds being used, hospital admissions, and the total number of new COVID-19 cases in an area.

We will investigate how Covid-19 community levels have changed across US counties since the first data release on Feb 24, 2022. The data are weekly updated, and we use data slightly modified from the version updated on May 5, 2022. Click here if you are interested in raw data.

We first load relevant packages:

library(tidyverse)
library(sf)

Part 1: Spatial Data are Different

We load Covid-19 community level data in a csv file. This is our typical “tidy” data frame.

covid <- read_csv("data/covid.csv") 
covid

## # A tibble: 35,464 × 12
##    county   county_fips state county_populati… health_service_… health_service_…
##    <chr>    <chr>       <chr>            <dbl>            <dbl> <chr>           
##  1 America… 60000       Amer…            47392              901 American Samoa  
##  2 Guam     66000       Guam            168489              902 Guam            
##  3 Commonw… 69000       Comm…            51851              903 Commonwealth of…
##  4 United … 78000       Unit…           106290              905 United States V…
##  5 America… 60000       Amer…            47392              901 American Samoa  
##  6 Guam     66000       Guam            168489              902 Guam            
##  7 Commonw… 69000       Comm…            51851              903 Commonwealth of…
##  8 United … 78000       Unit…           106290              905 United States V…
##  9 America… 60000       Amer…            47392              901 American Samoa  
## 10 Guam     66000       Guam            168489              902 Guam            
## # … with 35,454 more rows, and 6 more variables:
## #   health_service_area_population <dbl>,
## #   covid_inpatient_bed_utilization <dbl>,
## #   covid_hospital_admissions_per_100k <dbl>, covid_cases_per_100k <dbl>,
## #   covid_community_level <chr>, date <date>

We use filter and select to focus on the contiguous US and the following variables:

variable name	description
`state`	State name
`county`	County name
`county_fips`	Federal Information Processing Standards (FIPS) five character county code
`county_population`	County population (2019 Census estimate)
`covid_community_level`	Covid-19 community level
`date`	Date of data release

covid <- covid %>% 
  filter(!(state %in% c("United States Virgin Islands", 
                        "Commonwealth of the Northern Mariana Islands", 
                        "American Samoa",
                        "Puerto Rico", 
                        "Alaska", 
                        "Hawaii", 
                        "Guam"))) %>% 
  select(state, county, county_fips, county_population, 
         covid_community_level, date)

In order to draw maps based on Covid-19 community level, we need another data set that has information on geometry of US counties. We read the shapefile us_counties.shp using st_read(). Its original shapefile (.shp) is downloadable from the US Census Bureau under “cb_2018_us_county_20m.zip” (resolution level 1:20,000,000).

The loaded data frame has the following variables:

variable name	description
`statefips`	State FIPS
`countyfips`	County FIPS
`county`	County name
`geometry`	geometry features

This is an sf object.

us_counties <- st_read("data/us_counties.shp", quiet = TRUE)
us_counties

## Simple feature collection with 3108 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83
## First 10 features:
##    statefips countyfips   county                       geometry
## 1         37        017   Bladen MULTIPOLYGON (((-78.902 34....
## 2         37        167   Stanly MULTIPOLYGON (((-80.49737 3...
## 3         39        153   Summit MULTIPOLYGON (((-81.68699 4...
## 4         42        113 Sullivan MULTIPOLYGON (((-76.81373 4...
## 5         48        459   Upshur MULTIPOLYGON (((-95.15274 3...
## 6         48        049    Brown MULTIPOLYGON (((-99.19587 3...
## 7         45        021 Cherokee MULTIPOLYGON (((-81.87441 3...
## 8         01        043  Cullman MULTIPOLYGON (((-87.11199 3...
## 9         54        023    Grant MULTIPOLYGON (((-79.48687 3...
## 10        46        055   Haakon MULTIPOLYGON (((-102.0011 4...

Q - What differences do you observe when comparing a typical tidy data frame to the new simple feature object?

Part 2: `sf` and `dplyr`

The sf package plays nicely with our earlier data wrangling functions from dplyr.

`filter()`

Compare county FIPS from two data sets by filtering for Durham county in NC. The state FIPS for NC is 37. Check variable names, types, and value formats before filtering.

us_counties %>% 
  filter(____, ____) 

covid %>% 
  filter(____, ____)

Q - How are they different?

`mutate()`

In us_counties data, we will create a new variable county_fips that exactly matches county_fips from covid data. Here we use paste0() to concatenate strings.

us_counties <- us_counties %>% 
  mutate(county_fips = paste0(statefips, countyfips))

`select()`

Let’s check if county_fips is well created as we intended by selecting statefips, countyfips, and county_fips from us_counties.

us_counties %>% 
  select(statefips, countyfips, county_fips)

## Simple feature collection with 3108 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83
## First 10 features:
##    statefips countyfips county_fips                       geometry
## 1         37        017       37017 MULTIPOLYGON (((-78.902 34....
## 2         37        167       37167 MULTIPOLYGON (((-80.49737 3...
## 3         39        153       39153 MULTIPOLYGON (((-81.68699 4...
## 4         42        113       42113 MULTIPOLYGON (((-76.81373 4...
## 5         48        459       48459 MULTIPOLYGON (((-95.15274 3...
## 6         48        049       48049 MULTIPOLYGON (((-99.19587 3...
## 7         45        021       45021 MULTIPOLYGON (((-81.87441 3...
## 8         01        043       01043 MULTIPOLYGON (((-87.11199 3...
## 9         54        023       54023 MULTIPOLYGON (((-79.48687 3...
## 10        46        055       46055 MULTIPOLYGON (((-102.0011 4...

Notice that geometries are “sticky”. They are kept until deliberately dropped using st_drop_geometry. Manipulating spatial data objects is similar, but not identical to manipulating data frames.

us_counties %>% 
  select(county, county_fips) %>% 
  st_drop_geometry() %>% 
  slice(1:5)

##     county county_fips
## 1   Bladen       37017
## 2   Stanly       37167
## 3   Summit       39153
## 4 Sullivan       42113
## 5   Upshur       48459

`something_join()`

Create a new sf object covid_sf by joining covid and us_counties keeping all rows in covid. Note that two variable names are common, namely, county_fips and county, in both data frames.

glimpse(covid)

## Rows: 34,188
## Columns: 6
## $ state                 <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
## $ county                <chr> "Geneva County", "Greene County", "Hale County",…
## $ county_fips           <chr> "01061", "01063", "01065", "01067", "01069", "01…
## $ county_population     <dbl> 26271, 8111, 14651, 17205, 105882, 51626, 658573…
## $ covid_community_level <chr> "High", "Medium", "Medium", "High", "High", "Hig…
## $ date                  <date> 2022-02-24, 2022-02-24, 2022-02-24, 2022-02-24,…

glimpse(us_counties)

## Rows: 3,108
## Columns: 5
## $ statefips   <chr> "37", "37", "39", "42", "48", "48", "45", "01", "54", "46"…
## $ countyfips  <chr> "017", "167", "153", "113", "459", "049", "021", "043", "0…
## $ county      <chr> "Bladen", "Stanly", "Summit", "Sullivan", "Upshur", "Brown…
## $ geometry    <MULTIPOLYGON [°]> MULTIPOLYGON (((-78.902 34...., MULTIPOLYGON …
## $ county_fips <chr> "37017", "37167", "39153", "42113", "48459", "48049", "450…

Q - Which variable should we use to join the data frames by? Can we use any? Why or why not?

covid_sf <- covid %>% 
  left_join(us_counties, by = "county_fips") %>% # tbl_df/ tbl/ data.frame, no sf!
  st_as_sf()
covid_sf

## Simple feature collection with 34188 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83
## # A tibble: 34,188 × 10
##    state   county.x     county_fips county_populati… covid_community… date      
##    <chr>   <chr>        <chr>                  <dbl> <chr>            <date>    
##  1 Alabama Geneva Coun… 01061                  26271 High             2022-02-24
##  2 Alabama Greene Coun… 01063                   8111 Medium           2022-02-24
##  3 Alabama Hale County  01065                  14651 Medium           2022-02-24
##  4 Alabama Henry County 01067                  17205 High             2022-02-24
##  5 Alabama Houston Cou… 01069                 105882 High             2022-02-24
##  6 Alabama Jackson Cou… 01071                  51626 High             2022-02-24
##  7 Alabama Jefferson C… 01073                 658573 High             2022-02-24
##  8 Alabama Lamar County 01075                  13805 Medium           2022-02-24
##  9 Alabama Lauderdale … 01077                  92729 Medium           2022-02-24
## 10 Alabama Lawrence Co… 01079                  32924 Low              2022-02-24
## # … with 34,178 more rows, and 4 more variables: statefips <chr>,
## #   countyfips <chr>, county.y <chr>, geometry <MULTIPOLYGON [°]>

Q - Check what happens to the common variable that is not used in joining.

`group_by()`, `summarize()`

Using covid, let’s compute the total population living in counties with each of the Covid-19 community levels based on the latest data (date == "2022-05-05"). Complete the code chunk below.

covid %>% 
  filter(date == "2022-05-05") %>%

Part 3: `sf` and `ggplot`

We will examine how the Covid-19 community levels change over space and over time. As usual, we can build up a visualization layer-by-layer beginning with ggplot. Let’s start by making a basic plot of all US counties.

covid_sf %>% 
  ggplot() + 
  geom_sf() + 
  labs(title = "US counties")

Let’s try NC.

covid_sf %>% 
  filter(state == "_____") %>% 
  ggplot() + 
  geom_xxxx() + 
  labs(title = "NC counties")

Now adjust the theme with theme_bw().

covid_sf %>% 
  filter(state == "_____") %>% 
  ggplot() + 
  geom_xxxx() + 
  labs(title = "NC counties") +
  theme_bw() # white canvas

If you liked the theme, we can globally fix it as a default theme instead of attaching that layer for every ggplot.

theme_set(theme_bw())

We now fill each county by the latest value of Covid-19 community level (date == "2022-05-05"). To match colors used in CDC, we use scale_fill_manual().

covid_sf %>% 
  ______(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = _______)) + 
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Now adjust color in geom_sf to change the color of the county borders.

covid_sf %>% 
  filter(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level), # mapping
          _________) + # setting
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Adjust the width of the county borders using size.

covid_sf %>% 
  filter(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level), # mapping
          _________) + # setting
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Finally, adjust the transparency using alpha by mapping

covid_sf %>% 
  filter(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level), # mapping
          _________) + # setting
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Q - What is your final choice of color, size, and alpha?

# code here

Q - Create the same plot for NC counties only. Different choices of color, size, and alpha might be more appealing in this case.

# code here

Q - What is the county that has a different community level than the rest of the counties in NC?

# code here

We can visualize temporal trends of Covid-19 community levels by faceting.

Q - Create a faceted plot of US counties over all available dates where each county is filled with Covid-19 community level. Briefly explain the spatial and temporal trends.

# code here

Q - Repeat the previous exercise for states of your choosing.

# code here

Practice

Write a brief research question that you could answer with this dataset and then investigate it here.

# code here

What are limitations of your visualizations above?

Submitting Application Exercises

Once you have completed the activity, push your final changes to your GitHub repo.
Make sure you committed at least three times.
Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.

AE 07: Spatial Data and Visualization

due Monday, May 23 at 9:29am

Bora Jin

Getting Started

Covid-19 Community Level

Part 1: Spatial Data are Different

Part 2: `sf` and `dplyr`

`filter()`

`mutate()`

`select()`

`something_join()`

`group_by()`, `summarize()`

Part 3: `sf` and `ggplot`

Practice

Submitting Application Exercises

Additional Resources

AE 07: Spatial Data and Visualization

due Monday, May 23 at 9:29am

Bora Jin

Getting Started

Covid-19 Community Level

Part 1: Spatial Data are Different

Part 2: sf and dplyr

filter()

mutate()

select()

something_join()

group_by(), summarize()

Part 3: sf and ggplot

Practice

Submitting Application Exercises

Additional Resources

Part 2: `sf` and `dplyr`

`filter()`

`mutate()`

`select()`

`something_join()`

`group_by()`, `summarize()`

Part 3: `sf` and `ggplot`