AE 06: Data Wrangling 2

Getting Started

Part 1: Demonstrations

Part 2: RDU flights

Practice

Submitting Application Exercises

Additional Resources

Getting Started

Go to the course GitHub organization page and find the repository entitled “ae06-GitHubUsername”.
Click the green “code” button and copy the SSH URL.
Go to RStudio, select File New Project Version Control Git and paste the URL.
Open the .Rmd file and replace “Your Name” with your name.

Part 1: Demonstrations

library(tidyverse)

# lab scores
Lab <- tibble(id = c(2, 3, 4, 6),
              lab1 = c(100, 100, 100, 100), 
              lab2 = c(80, 20, 60, 100))
# ae scores
AE <- tibble(id = c(1, 2, 3, 5),
             ae1 = c(100, 80, 50, 100), 
             ae2 = c(90, 100, 40, 95),  
             ae3 = c(70, 80, 60, 100), 
             ae4 = c(90, 70, 30, 90))
            
Lab

## # A tibble: 4 × 3
##      id  lab1  lab2
##   <dbl> <dbl> <dbl>
## 1     2   100    80
## 2     3   100    20
## 3     4   100    60
## 4     6   100   100

AE

## # A tibble: 4 × 5
##      id   ae1   ae2   ae3   ae4
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1   100    90    70    90
## 2     2    80   100    80    70
## 3     3    50    40    60    30
## 4     5   100    95   100    90

We will demonstrate each of the joins on these small, toy datasets.

Print all the scores on the labs and AEs for all six students. NA if the assignment is not turned in.

full_join(Lab, AE)

## Joining, by = "id"

## # A tibble: 6 × 7
##      id  lab1  lab2   ae1   ae2   ae3   ae4
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     2   100    80    80   100    80    70
## 2     3   100    20    50    40    60    30
## 3     4   100    60    NA    NA    NA    NA
## 4     6   100   100    NA    NA    NA    NA
## 5     1    NA    NA   100    90    70    90
## 6     5    NA    NA   100    95   100    90

Who has completed all the labs and AEs and what are their scores on the completed assignments?

inner_join(Lab, AE)

## Joining, by = "id"

## # A tibble: 2 × 7
##      id  lab1  lab2   ae1   ae2   ae3   ae4
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     2   100    80    80   100    80    70
## 2     3   100    20    50    40    60    30

What are the scores of labs and AEs for those who have completed lab assignments?

left_join(Lab, AE)

## Joining, by = "id"

## # A tibble: 4 × 7
##      id  lab1  lab2   ae1   ae2   ae3   ae4
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     2   100    80    80   100    80    70
## 2     3   100    20    50    40    60    30
## 3     4   100    60    NA    NA    NA    NA
## 4     6   100   100    NA    NA    NA    NA

This time, I want the scores for those who have completed AEs.

right_join(Lab, AE)

## Joining, by = "id"

## # A tibble: 4 × 7
##      id  lab1  lab2   ae1   ae2   ae3   ae4
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     2   100    80    80   100    80    70
## 2     3   100    20    50    40    60    30
## 3     1    NA    NA   100    90    70    90
## 4     5    NA    NA   100    95   100    90

What are the lab scores for those who have completed both AEs and labs?

semi_join(Lab, AE)

## Joining, by = "id"

## # A tibble: 2 × 3
##      id  lab1  lab2
##   <dbl> <dbl> <dbl>
## 1     2   100    80
## 2     3   100    20

What are the lab scores for those who have completed labs but not AEs?

anti_join(Lab, AE)

## Joining, by = "id"

## # A tibble: 2 × 3
##      id  lab1  lab2
##   <dbl> <dbl> <dbl>
## 1     4   100    60
## 2     6   100   100

How do the join functions above know to join Lab and AE by id? Examine the names to find out.

names(Lab)

## [1] "id"   "lab1" "lab2"

names(AE)

## [1] "id"  "ae1" "ae2" "ae3" "ae4"

Also, try above functions by specifying the join variable. Do you see any differences?

full_join(Lab, AE)

## Joining, by = "id"

## # A tibble: 6 × 7
##      id  lab1  lab2   ae1   ae2   ae3   ae4
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     2   100    80    80   100    80    70
## 2     3   100    20    50    40    60    30
## 3     4   100    60    NA    NA    NA    NA
## 4     6   100   100    NA    NA    NA    NA
## 5     1    NA    NA   100    90    70    90
## 6     5    NA    NA   100    95   100    90

full_join(Lab, AE, by = "id")

## # A tibble: 6 × 7
##      id  lab1  lab2   ae1   ae2   ae3   ae4
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     2   100    80    80   100    80    70
## 2     3   100    20    50    40    60    30
## 3     4   100    60    NA    NA    NA    NA
## 4     6   100   100    NA    NA    NA    NA
## 5     1    NA    NA   100    90    70    90
## 6     5    NA    NA   100    95   100    90

Part 2: RDU flights

We will again work with RDU data stored in rduall_2021.RDS. This can be downloaded from the anyflights package.

library(anyflights)

rdu <- readRDS("data/rduall_2021.RDS")
airports <- rdu$airports
flights <- rdu$flights

The R object rdu is a list where each element is a data frame. We will focus on the following data sets:

flights: Data on all flights that departed from RDU in 2021
airports: Data on airports at which those flights arrived
airlines: Data to translate between two letter carrier codes and names
planes: Construction information about each plane

flights <- rdu$flights
airports <- rdu$airports
airlines <- rdu$airlines
planes <- rdu$planes

Q - How are these datasets related? Examine their variable names. You may find helpful to read help pages of the datasets:

?get_flights
?get_airports
?get_airlines
?get_planes

# code here

Q - Last time, it was not straightforward to understand Airlines by their carrier code. Let’s append full airline names in flights.

flights %>% 
  xxxx_join(____, by = "____") %>%
  rename(carrier_name = name) %>% # new = old
  select(year:day, contains("carrier"))

Q - Suppose you wanted to make a map of the route of every flight from RDU. What variables would you need from which datasets?

Let’s add latitude and longitude information of the destination airports in flights. Note these flights and airports have no variables in common so we will have to specify the variable to join by using by =. Check out the documentation for more information.

flights %>% 
  left_join(______, 
            by = c("____" = "faa"))

Practice

Create a new dataset dest_delays with the median arrival delay in 2021 for each destination. Note this question does not require you to use joins.

# code here

Q - How many different destinations could you visit directly from RDU in 2021?

Join dest_delays and airports, only including observations that have both delay and airport information. Note dest_delays and airports have no variables in common so you will need to specify the variables to join using by as in the example above.

# code here

Q - Are all of the observations in dest_delays included in the new dataset you created by joining dest_delays and airports? Use an appropriate join function to investigate this issue and determine what is going on here.

# code here

Is there a relationship between the age of a plane and its delays? The plane tail number is given in the tailnum variable in the flights dataset. The year the plane was manufactured is given in the year variable in the planes dataset.

Step #1: Start by finding the average arrival delay for each plane and store the resulting dataset in plane_delays.

# code here

Step #2: Join plane_delays to the planes data using an appropriate join and then use mutate to create an age variable.

# code here

Step #3: Finally, create an effective visualization of the data.

# code here

Submitting Application Exercises

Once you have completed the activity, push your final changes to your GitHub repo.
Make sure you committed at least three times.
Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.

Additional Resources

https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf