SSH
URL.File
\(\rightarrow\) New Project
\(\rightarrow\) Version Control
\(\rightarrow\) Git
and paste the URL..Rmd
file and replace “Your Name” with your name.library(tidyverse)
# lab scores
Lab <- tibble(id = c(2, 3, 4, 6),
lab1 = c(100, 100, 100, 100),
lab2 = c(80, 20, 60, 100))
# ae scores
AE <- tibble(id = c(1, 2, 3, 5),
ae1 = c(100, 80, 50, 100),
ae2 = c(90, 100, 40, 95),
ae3 = c(70, 80, 60, 100),
ae4 = c(90, 70, 30, 90))
Lab
## # A tibble: 4 × 3
## id lab1 lab2
## <dbl> <dbl> <dbl>
## 1 2 100 80
## 2 3 100 20
## 3 4 100 60
## 4 6 100 100
AE
## # A tibble: 4 × 5
## id ae1 ae2 ae3 ae4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 100 90 70 90
## 2 2 80 100 80 70
## 3 3 50 40 60 30
## 4 5 100 95 100 90
We will demonstrate each of the joins on these small, toy datasets.
Print all the scores on the labs and AEs for all six students. NA
if the assignment is not turned in.
full_join(Lab, AE)
## Joining, by = "id"
## # A tibble: 6 × 7
## id lab1 lab2 ae1 ae2 ae3 ae4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 100 80 80 100 80 70
## 2 3 100 20 50 40 60 30
## 3 4 100 60 NA NA NA NA
## 4 6 100 100 NA NA NA NA
## 5 1 NA NA 100 90 70 90
## 6 5 NA NA 100 95 100 90
Who has completed all the labs and AEs and what are their scores on the completed assignments?
inner_join(Lab, AE)
## Joining, by = "id"
## # A tibble: 2 × 7
## id lab1 lab2 ae1 ae2 ae3 ae4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 100 80 80 100 80 70
## 2 3 100 20 50 40 60 30
What are the scores of labs and AEs for those who have completed lab assignments?
left_join(Lab, AE)
## Joining, by = "id"
## # A tibble: 4 × 7
## id lab1 lab2 ae1 ae2 ae3 ae4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 100 80 80 100 80 70
## 2 3 100 20 50 40 60 30
## 3 4 100 60 NA NA NA NA
## 4 6 100 100 NA NA NA NA
This time, I want the scores for those who have completed AEs.
right_join(Lab, AE)
## Joining, by = "id"
## # A tibble: 4 × 7
## id lab1 lab2 ae1 ae2 ae3 ae4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 100 80 80 100 80 70
## 2 3 100 20 50 40 60 30
## 3 1 NA NA 100 90 70 90
## 4 5 NA NA 100 95 100 90
What are the lab scores for those who have completed both AEs and labs?
semi_join(Lab, AE)
## Joining, by = "id"
## # A tibble: 2 × 3
## id lab1 lab2
## <dbl> <dbl> <dbl>
## 1 2 100 80
## 2 3 100 20
What are the lab scores for those who have completed labs but not AEs?
anti_join(Lab, AE)
## Joining, by = "id"
## # A tibble: 2 × 3
## id lab1 lab2
## <dbl> <dbl> <dbl>
## 1 4 100 60
## 2 6 100 100
How do the join functions above know to join Lab
and AE
by id
? Examine the names to find out.
names(Lab)
## [1] "id" "lab1" "lab2"
names(AE)
## [1] "id" "ae1" "ae2" "ae3" "ae4"
Also, try above functions by specifying the join variable. Do you see any differences?
full_join(Lab, AE)
## Joining, by = "id"
## # A tibble: 6 × 7
## id lab1 lab2 ae1 ae2 ae3 ae4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 100 80 80 100 80 70
## 2 3 100 20 50 40 60 30
## 3 4 100 60 NA NA NA NA
## 4 6 100 100 NA NA NA NA
## 5 1 NA NA 100 90 70 90
## 6 5 NA NA 100 95 100 90
full_join(Lab, AE, by = "id")
## # A tibble: 6 × 7
## id lab1 lab2 ae1 ae2 ae3 ae4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 100 80 80 100 80 70
## 2 3 100 20 50 40 60 30
## 3 4 100 60 NA NA NA NA
## 4 6 100 100 NA NA NA NA
## 5 1 NA NA 100 90 70 90
## 6 5 NA NA 100 95 100 90
We will again work with RDU data stored in rduall_2021.RDS
. This can be downloaded from the anyflights
package.
library(anyflights)
rdu <- readRDS("data/rduall_2021.RDS")
airports <- rdu$airports
flights <- rdu$flights
The R
object rdu
is a list where each element is a data frame. We will focus on the following data sets:
flights
: Data on all flights that departed from RDU in 2021airports
: Data on airports at which those flights arrivedairlines
: Data to translate between two letter carrier codes and namesplanes
: Construction information about each planeflights <- rdu$flights
airports <- rdu$airports
airlines <- rdu$airlines
planes <- rdu$planes
Q - How are these datasets related? Examine their variable names. You may find helpful to read help pages of the datasets:
?get_flights
?get_airports
?get_airlines
?get_planes
# code here
Q - Last time, it was not straightforward to understand Airlines by their carrier code. Let’s append full airline names in flights
.
flights %>%
xxxx_join(____, by = "____") %>%
rename(carrier_name = name) %>% # new = old
select(year:day, contains("carrier"))
Q - Suppose you wanted to make a map of the route of every flight from RDU. What variables would you need from which datasets?
Let’s add latitude and longitude information of the destination airports in flights
. Note these flights
and airports
have no variables in common so we will have to specify the variable to join by using by =
. Check out the documentation for more information.
flights %>%
left_join(______,
by = c("____" = "faa"))
dest_delays
with the median arrival delay in 2021 for each destination. Note this question does not require you to use joins.# code here
dest_delays
and airports
, only including observations that have both delay and airport information. Note dest_delays
and airports
have no variables in common so you will need to specify the variables to join using by
as in the example above.# code here
dest_delays
included in the new dataset you created by joining dest_delays
and airports
? Use an appropriate join function to investigate this issue and determine what is going on here.# code here
tailnum
variable in the flights
dataset. The year the plane was manufactured is given in the year
variable in the planes
dataset.plane_delays
.# code here
plane_delays
to the planes
data using an appropriate join and then use mutate
to create an age variable.# code here
# code here