SSH
URL.File
\(\rightarrow\) New Project
\(\rightarrow\) Version Control
\(\rightarrow\) Git
and paste the URL..Rmd
file and replace “Your Name” with your name.The following paragraph and figure directly come from the website of Centers for Disease Control and Prevention (CDC):
COVID-19 Community Levels are a new tool to help communities decide what prevention steps to take based on the latest data. Levels can be low, medium, or high and are determined by looking at hospital beds being used, hospital admissions, and the total number of new COVID-19 cases in an area.
We will investigate how Covid-19 community levels have changed across US counties since the first data release on Feb 24, 2022. The data are weekly updated, and we use data slightly modified from the version updated on May 5, 2022. Click here if you are interested in raw data.
We first load relevant packages:
library(tidyverse)
library(sf)
We load Covid-19 community level data in a csv file. This is our typical “tidy” data frame.
covid <- read_csv("data/covid.csv")
covid
## # A tibble: 35,464 × 12
## county county_fips state county_populati… health_service_… health_service_…
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 America… 60000 Amer… 47392 901 American Samoa
## 2 Guam 66000 Guam 168489 902 Guam
## 3 Commonw… 69000 Comm… 51851 903 Commonwealth of…
## 4 United … 78000 Unit… 106290 905 United States V…
## 5 America… 60000 Amer… 47392 901 American Samoa
## 6 Guam 66000 Guam 168489 902 Guam
## 7 Commonw… 69000 Comm… 51851 903 Commonwealth of…
## 8 United … 78000 Unit… 106290 905 United States V…
## 9 America… 60000 Amer… 47392 901 American Samoa
## 10 Guam 66000 Guam 168489 902 Guam
## # … with 35,454 more rows, and 6 more variables:
## # health_service_area_population <dbl>,
## # covid_inpatient_bed_utilization <dbl>,
## # covid_hospital_admissions_per_100k <dbl>, covid_cases_per_100k <dbl>,
## # covid_community_level <chr>, date <date>
We use filter
and select
to focus on the contiguous US and the following variables:
variable name | description |
---|---|
state |
State name |
county |
County name |
county_fips |
Federal Information Processing Standards (FIPS) five character county code |
county_population |
County population (2019 Census estimate) |
covid_community_level |
Covid-19 community level |
date |
Date of data release |
covid <- covid %>%
filter(!(state %in% c("United States Virgin Islands",
"Commonwealth of the Northern Mariana Islands",
"American Samoa",
"Puerto Rico",
"Alaska",
"Hawaii",
"Guam"))) %>%
select(state, county, county_fips, county_population,
covid_community_level, date)
In order to draw maps based on Covid-19 community level, we need another data set that has information on geometry of US counties. We read the shapefile us_counties.shp
using st_read()
. Its original shapefile (.shp) is downloadable from the US Census Bureau under “cb_2018_us_county_20m.zip” (resolution level 1:20,000,000).
The loaded data frame has the following variables:
variable name | description |
---|---|
statefips |
State FIPS |
countyfips |
County FIPS |
county |
County name |
geometry |
geometry features |
This is an sf
object.
us_counties <- st_read("data/us_counties.shp", quiet = TRUE)
us_counties
## Simple feature collection with 3108 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
## First 10 features:
## statefips countyfips county geometry
## 1 37 017 Bladen MULTIPOLYGON (((-78.902 34....
## 2 37 167 Stanly MULTIPOLYGON (((-80.49737 3...
## 3 39 153 Summit MULTIPOLYGON (((-81.68699 4...
## 4 42 113 Sullivan MULTIPOLYGON (((-76.81373 4...
## 5 48 459 Upshur MULTIPOLYGON (((-95.15274 3...
## 6 48 049 Brown MULTIPOLYGON (((-99.19587 3...
## 7 45 021 Cherokee MULTIPOLYGON (((-81.87441 3...
## 8 01 043 Cullman MULTIPOLYGON (((-87.11199 3...
## 9 54 023 Grant MULTIPOLYGON (((-79.48687 3...
## 10 46 055 Haakon MULTIPOLYGON (((-102.0011 4...
Q - What differences do you observe when comparing a typical tidy data frame to the new simple feature object?
sf
and dplyr
The sf
package plays nicely with our earlier data wrangling functions from dplyr
.
filter()
Compare county FIPS from two data sets by filtering for Durham county in NC. The state FIPS for NC is 37. Check variable names, types, and value formats before filtering.
us_counties %>%
filter(____, ____)
covid %>%
filter(____, ____)
Q - How are they different?
mutate()
In us_counties
data, we will create a new variable county_fips
that exactly matches county_fips
from covid
data. Here we use paste0()
to concatenate strings.
us_counties <- us_counties %>%
mutate(county_fips = paste0(statefips, countyfips))
select()
Let’s check if county_fips
is well created as we intended by selecting statefips
, countyfips
, and county_fips
from us_counties
.
us_counties %>%
select(statefips, countyfips, county_fips)
## Simple feature collection with 3108 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
## First 10 features:
## statefips countyfips county_fips geometry
## 1 37 017 37017 MULTIPOLYGON (((-78.902 34....
## 2 37 167 37167 MULTIPOLYGON (((-80.49737 3...
## 3 39 153 39153 MULTIPOLYGON (((-81.68699 4...
## 4 42 113 42113 MULTIPOLYGON (((-76.81373 4...
## 5 48 459 48459 MULTIPOLYGON (((-95.15274 3...
## 6 48 049 48049 MULTIPOLYGON (((-99.19587 3...
## 7 45 021 45021 MULTIPOLYGON (((-81.87441 3...
## 8 01 043 01043 MULTIPOLYGON (((-87.11199 3...
## 9 54 023 54023 MULTIPOLYGON (((-79.48687 3...
## 10 46 055 46055 MULTIPOLYGON (((-102.0011 4...
Notice that geometries are “sticky”. They are kept until deliberately dropped using st_drop_geometry
. Manipulating spatial data objects is similar, but not identical to manipulating data frames.
us_counties %>%
select(county, county_fips) %>%
st_drop_geometry() %>%
slice(1:5)
## county county_fips
## 1 Bladen 37017
## 2 Stanly 37167
## 3 Summit 39153
## 4 Sullivan 42113
## 5 Upshur 48459
something_join()
Create a new sf
object covid_sf
by joining covid
and us_counties
keeping all rows in covid
. Note that two variable names are common, namely, county_fips
and county
, in both data frames.
glimpse(covid)
## Rows: 34,188
## Columns: 6
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
## $ county <chr> "Geneva County", "Greene County", "Hale County",…
## $ county_fips <chr> "01061", "01063", "01065", "01067", "01069", "01…
## $ county_population <dbl> 26271, 8111, 14651, 17205, 105882, 51626, 658573…
## $ covid_community_level <chr> "High", "Medium", "Medium", "High", "High", "Hig…
## $ date <date> 2022-02-24, 2022-02-24, 2022-02-24, 2022-02-24,…
glimpse(us_counties)
## Rows: 3,108
## Columns: 5
## $ statefips <chr> "37", "37", "39", "42", "48", "48", "45", "01", "54", "46"…
## $ countyfips <chr> "017", "167", "153", "113", "459", "049", "021", "043", "0…
## $ county <chr> "Bladen", "Stanly", "Summit", "Sullivan", "Upshur", "Brown…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-78.902 34...., MULTIPOLYGON …
## $ county_fips <chr> "37017", "37167", "39153", "42113", "48459", "48049", "450…
Q - Which variable should we use to join the data frames by? Can we use any? Why or why not?
covid_sf <- covid %>%
left_join(us_counties, by = "county_fips") %>% # tbl_df/ tbl/ data.frame, no sf!
st_as_sf()
covid_sf
## Simple feature collection with 34188 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
## # A tibble: 34,188 × 10
## state county.x county_fips county_populati… covid_community… date
## <chr> <chr> <chr> <dbl> <chr> <date>
## 1 Alabama Geneva Coun… 01061 26271 High 2022-02-24
## 2 Alabama Greene Coun… 01063 8111 Medium 2022-02-24
## 3 Alabama Hale County 01065 14651 Medium 2022-02-24
## 4 Alabama Henry County 01067 17205 High 2022-02-24
## 5 Alabama Houston Cou… 01069 105882 High 2022-02-24
## 6 Alabama Jackson Cou… 01071 51626 High 2022-02-24
## 7 Alabama Jefferson C… 01073 658573 High 2022-02-24
## 8 Alabama Lamar County 01075 13805 Medium 2022-02-24
## 9 Alabama Lauderdale … 01077 92729 Medium 2022-02-24
## 10 Alabama Lawrence Co… 01079 32924 Low 2022-02-24
## # … with 34,178 more rows, and 4 more variables: statefips <chr>,
## # countyfips <chr>, county.y <chr>, geometry <MULTIPOLYGON [°]>
Q - Check what happens to the common variable that is not used in joining.
group_by()
, summarize()
Using covid
, let’s compute the total population living in counties with each of the Covid-19 community levels based on the latest data (date == "2022-05-05"
). Complete the code chunk below.
covid %>%
filter(date == "2022-05-05") %>%
sf
and ggplot
We will examine how the Covid-19 community levels change over space and over time. As usual, we can build up a visualization layer-by-layer beginning with ggplot
. Let’s start by making a basic plot of all US counties.
covid_sf %>%
ggplot() +
geom_sf() +
labs(title = "US counties")
Let’s try NC.
covid_sf %>%
filter(state == "_____") %>%
ggplot() +
geom_xxxx() +
labs(title = "NC counties")
Now adjust the theme with theme_bw()
.
covid_sf %>%
filter(state == "_____") %>%
ggplot() +
geom_xxxx() +
labs(title = "NC counties") +
theme_bw() # white canvas
If you liked the theme, we can globally fix it as a default theme instead of attaching that layer for every ggplot.
theme_set(theme_bw())
We now fill each county by the latest value of Covid-19 community level (date == "2022-05-05"
). To match colors used in CDC, we use scale_fill_manual()
.
covid_sf %>%
______(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = _______)) +
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Now adjust color
in geom_sf
to change the color of the county borders.
covid_sf %>%
filter(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level), # mapping
_________) + # setting
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Adjust the width of the county borders using size
.
covid_sf %>%
filter(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level), # mapping
_________) + # setting
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Finally, adjust the transparency using alpha
by mapping
covid_sf %>%
filter(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level), # mapping
_________) + # setting
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Q - What is your final choice of color
, size
, and alpha
?
# code here
Q - Create the same plot for NC counties only. Different choices of color
, size
, and alpha
might be more appealing in this case.
# code here
Q - What is the county that has a different community level than the rest of the counties in NC?
# code here
We can visualize temporal trends of Covid-19 community levels by faceting.
Q - Create a faceted plot of US counties over all available dates where each county is filled with Covid-19 community level. Briefly explain the spatial and temporal trends.
# code here
Q - Repeat the previous exercise for states of your choosing.
# code here
# code here