SSH
URL.File
\(\rightarrow\) New Project
\(\rightarrow\) Version Control
\(\rightarrow\) Git
and paste the URL..Rmd
file and replace “Your Name” with your name.We start with loading relevant packages. For the very first time only, uncomment the line install.packages("anyflights")
and install the package anyflights
.
library(tidyverse)
# install.packages("anyflights")
library(anyflights)
To demonstrate data wrangling we will use a dataset of characteristics of all flights departing from RDU (Raleigh–Durham International Airport) in 2021. We could download the data directly from the package anyflights
with the commented line of the following code chunk; but to save time, I did it for you, and all you need to do is to read the RDS file using readRDS()
.
# flights <- get_flights("RDU", 2021)
flights <- readRDS("data/rduflights_2021.RDS")
We first explore the data a bit. Examine the documentation as well.
glimpse(flights)
## Rows: 44,136
## Columns: 19
## $ year <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 552, 553, 557, 648, 655, 701, 701, 715, 722, 724, 730, …
## $ sched_dep_time <int> 600, 600, 600, 654, 700, 705, 705, 700, 730, 730, 735, …
## $ dep_delay <dbl> -8, -7, -3, -6, -5, -4, -4, 15, -8, -6, -5, -6, -1, -8,…
## $ arr_time <int> 725, 704, 722, 749, 850, 847, 826, 935, 922, 844, 826, …
## $ sched_arr_time <int> 758, 722, 727, 807, 915, 912, 837, 926, 945, 909, 840, …
## $ arr_delay <dbl> -33, -18, -5, -18, -25, -25, -11, 9, -23, -25, -14, -24…
## $ carrier <chr> "YX", "AA", "DL", "AA", "AA", "9E", "DL", "AA", "9E", "…
## $ flight <int> 203, 33, 63, 21, 31, 152, 59, 25, 149, 193, 117, 16, 15…
## $ tailnum <chr> "N815MD", "N9021H", "N823DN", "N109UW", "N929NN", "N907…
## $ origin <chr> "RDU", "RDU", "RDU", "RDU", "RDU", "RDU", "RDU", "RDU",…
## $ dest <chr> "DTW", "ORD", "ATL", "CLT", "MIA", "MSP", "ATL", "DFW",…
## $ air_time <dbl> 76, 110, 66, 41, 100, 146, 64, 176, 105, 61, 102, 245, …
## $ distance <dbl> 501, 646, 356, 130, 700, 980, 356, 1061, 680, 431, 632,…
## $ hour <dbl> 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9…
## $ minute <dbl> 0, 0, 0, 54, 0, 5, 5, 0, 30, 30, 35, 40, 40, 0, 15, 30,…
## $ time_hour <dttm> 2021-01-01 06:00:00, 2021-01-01 06:00:00, 2021-01-01 0…
names(flights) # to get column names
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
?get_flights # to examine data dictionary
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 1 1 552 600 -8 725 758
## 2 2021 1 1 553 600 -7 704 722
## 3 2021 1 1 557 600 -3 722 727
## 4 2021 1 1 648 654 -6 749 807
## 5 2021 1 1 655 700 -5 850 915
## 6 2021 1 1 701 705 -4 847 912
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
tail(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 12 31 NA 1445 NA NA 1624
## 2 2021 12 31 NA 710 NA NA 1032
## 3 2021 12 31 NA 720 NA NA 903
## 4 2021 12 31 NA 530 NA NA 725
## 5 2021 12 31 NA 1434 NA NA 1550
## 6 2021 12 31 NA 2059 NA NA 2250
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
The head()
/ tail()
function returns “A tibble: 6 x 19” and then the first / last six rows of the flights
data.
A tibble is a tweaked, opinionated version of the R
data frame. The two main differences between a data frame vs a tibble are printing and subsetting.
You can specify the number of rows shown with the second argument in the functions, e.g. tail(flights, 2)
.
Q - Can you print the first three rows of the flights
data?
# code here
Often we need to wrangle our data to extract meaning. This includes calculating new variables, summary statistics, grouping by variables, renaming, reordering, selecting subsets of data, filtering by various conditions, etc.
We can accomplish a great deal of wrangling by learning some key verbs and a pipe operator. Each of the following functions takes a data frame as input and returns a data frame as output.
select()
The select()
function picks off one or more columns by name.
Let’s say we want a dataset that only contains the variables month
and dest
.
select(flights, month, dest)
## # A tibble: 44,136 × 2
## month dest
## <int> <chr>
## 1 1 DTW
## 2 1 ORD
## 3 1 ATL
## 4 1 CLT
## 5 1 MIA
## 6 1 MSP
## 7 1 ATL
## 8 1 DFW
## 9 1 FLL
## 10 1 LGA
## # … with 44,126 more rows
We can also use select()
to exclude variables. Let’s exclude year
but keep all the other variables because we know the year is 2021.
select(flights, -year)
## # A tibble: 44,136 × 18
## month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <dbl> <int> <int>
## 1 1 1 552 600 -8 725 758
## 2 1 1 553 600 -7 704 722
## 3 1 1 557 600 -3 722 727
## 4 1 1 648 654 -6 749 807
## 5 1 1 655 700 -5 850 915
## 6 1 1 701 705 -4 847 912
## 7 1 1 701 705 -4 826 837
## 8 1 1 715 700 15 935 926
## 9 1 1 722 730 -8 922 945
## 10 1 1 724 730 -6 844 909
## # … with 44,126 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
We can also use select()
with certain characteristics. Here, we select all the variables whose names contain “time”.
select(flights, contains("time"))
## # A tibble: 44,136 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 552 600 725 758 76 2021-01-01 06:00:00
## 2 553 600 704 722 110 2021-01-01 06:00:00
## 3 557 600 722 727 66 2021-01-01 06:00:00
## 4 648 654 749 807 41 2021-01-01 06:00:00
## 5 655 700 850 915 100 2021-01-01 07:00:00
## 6 701 705 847 912 146 2021-01-01 07:00:00
## 7 701 705 826 837 64 2021-01-01 07:00:00
## 8 715 700 935 926 176 2021-01-01 07:00:00
## 9 722 730 922 945 105 2021-01-01 07:00:00
## 10 724 730 844 909 61 2021-01-01 07:00:00
## # … with 44,126 more rows
Try ?contains
for more info about other select helpers.
arrange()
The arrange()
function orders rows (observations) in ascending / descending order.
Let’s arrange the data by descending departure delays, with large departure delays on top.
arrange(flights, desc(dep_delay))
## # A tibble: 44,136 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 5 11 838 1120 2718 1041 1326
## 2 2021 8 7 637 600 1477 NA 819
## 3 2021 5 29 517 525 1432 709 631
## 4 2021 10 28 655 915 1300 806 1038
## 5 2021 7 4 751 1156 1195 945 1354
## 6 2021 10 5 759 1252 1147 908 1406
## 7 2021 9 23 737 1325 1092 911 1505
## 8 2021 8 14 1016 1630 1066 1152 1827
## 9 2021 7 6 1258 1915 1063 1447 2109
## 10 2021 8 12 1113 1730 1063 1225 1847
## # … with 44,126 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Or with low departure delays on top.
arrange(flights, dep_delay)
## # A tibble: 44,136 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 3 8 2037 2109 -32 2214 2302
## 2 2021 3 12 2003 2035 -32 2203 2215
## 3 2021 3 15 2005 2035 -30 2135 2215
## 4 2021 2 12 2120 2147 -27 2326 2336
## 5 2021 11 17 654 720 -26 849 846
## 6 2021 3 19 2010 2035 -25 2139 2215
## 7 2021 6 5 2155 2220 -25 2310 2355
## 8 2021 3 1 2059 2123 -24 2243 2312
## 9 2021 7 7 946 1010 -24 1138 1200
## 10 2021 8 31 2021 2045 -24 2209 2240
## # … with 44,126 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
%>%
What if we only want to examine the dep_delay
, dest
, month
, and day
columns? We can combine arrange()
and select()
!
arrange(select(flights, month:day, dest, dep_delay), desc(dep_delay))
## # A tibble: 44,136 × 4
## month day dest dep_delay
## <int> <int> <chr> <dbl>
## 1 5 11 DFW 2718
## 2 8 7 LAX 1477
## 3 5 29 CLT 1432
## 4 10 28 ORD 1300
## 5 7 4 DFW 1195
## 6 10 5 CLT 1147
## 7 9 23 EWR 1092
## 8 8 14 IAH 1066
## 9 7 6 JFK 1063
## 10 8 12 ORD 1063
## # … with 44,126 more rows
It is not easy to understand what is going on in the code chunk above.
The pipe is a technique for passing information from one process to another. It can rewrite the code with many parentheses (like the one above!) into a more natural and easier-to-read structure (see the code chunk below!).
flights %>%
select(month:day, dest, dep_delay) %>%
arrange(desc(dep_delay))
## # A tibble: 44,136 × 4
## month day dest dep_delay
## <int> <int> <chr> <dbl>
## 1 5 11 DFW 2718
## 2 8 7 LAX 1477
## 3 5 29 CLT 1432
## 4 10 28 ORD 1300
## 5 7 4 DFW 1195
## 6 10 5 CLT 1147
## 7 9 23 EWR 1092
## 8 8 14 IAH 1066
## 9 7 6 JFK 1063
## 10 8 12 ORD 1063
## # … with 44,126 more rows
When reading code “in English”, say “then” whenever you see a pipe.
Q - How would you read the code chunk above in English? What is it accomplishing?
slice()
Slice subsets a data frame for certain row numbers.
Here we slice off the first 5 rows of the flights data.
flights %>%
slice(1:5)
## # A tibble: 5 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 1 1 552 600 -8 725 758
## 2 2021 1 1 553 600 -7 704 722
## 3 2021 1 1 557 600 -3 722 727
## 4 2021 1 1 648 654 -6 749 807
## 5 2021 1 1 655 700 -5 850 915
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We can also slice the bottom 5 rows of the flights
data. Here n()
computes the number of rows in the data.
flights %>%
slice((n()-4):n())
## # A tibble: 5 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 12 31 NA 710 NA NA 1032
## 2 2021 12 31 NA 720 NA NA 903
## 3 2021 12 31 NA 530 NA NA 725
## 4 2021 12 31 NA 1434 NA NA 1550
## 5 2021 12 31 NA 2059 NA NA 2250
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Q - You can accomplish the same thing with one of the functions introduced above. What is it?
We can also slice every other row, e.g., 1st, 3rd, 5th, \(\cdots\).
flights %>%
slice(seq(1, n(), by=2))
## # A tibble: 22,068 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 1 1 552 600 -8 725 758
## 2 2021 1 1 557 600 -3 722 727
## 3 2021 1 1 655 700 -5 850 915
## 4 2021 1 1 701 705 -4 826 837
## 5 2021 1 1 722 730 -8 922 945
## 6 2021 1 1 730 735 -5 826 840
## 7 2021 1 1 739 740 -1 920 932
## 8 2021 1 1 811 815 -4 959 1025
## 9 2021 1 1 827 836 -9 933 952
## 10 2021 1 1 854 900 -6 1053 1116
## # … with 22,058 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Q - What is the code chunk below accomplishing? Guess before running the code.
flights %>%
select(month:day, dest, dep_delay) %>%
arrange(desc(dep_delay)) %>%
slice(1:5)
## # A tibble: 5 × 4
## month day dest dep_delay
## <int> <int> <chr> <dbl>
## 1 5 11 DFW 2718
## 2 8 7 LAX 1477
## 3 5 29 CLT 1432
## 4 10 28 ORD 1300
## 5 7 4 DFW 1195
Q - What will happen if you comment out the line containing arrange()
in the code chunk above? Try it.
filter()
filter()
selects rows satisfying certain conditions.
Around 8am on Dec 12, I was at RDU waiting for my Delta flight to Atlanta (ATL). I was supposed to take a direct flight from ATL to South Korea at around 11am. However, because of a plane maintenance delay I could not depart to ATL on time and had to reschedule my itinerary completely. Here we will find that incident.
We first select all flights that experienced departure delays to ATL.
flights %>%
filter(dep_delay > 0, dest == "ATL")
## # A tibble: 1,020 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 1 1 1256 1238 18 1427 1400
## 2 2021 1 1 1418 1410 8 1541 1536
## 3 2021 1 1 1546 1515 31 1705 1645
## 4 2021 1 1 1749 1745 4 1907 1913
## 5 2021 1 2 604 600 4 740 727
## 6 2021 1 2 840 830 10 1005 1000
## 7 2021 1 3 847 830 17 1011 1000
## 8 2021 1 3 1110 1105 5 1235 1235
## 9 2021 1 3 1247 1238 9 1415 1400
## 10 2021 1 3 1616 1615 1 1738 1745
## # … with 1,010 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
We add another condition for Delta Airlines (DL).
flights %>%
filter(dep_delay > 0, dest == "ATL",
carrier == "DL")
## # A tibble: 802 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 1 1 1256 1238 18 1427 1400
## 2 2021 1 1 1418 1410 8 1541 1536
## 3 2021 1 1 1749 1745 4 1907 1913
## 4 2021 1 2 604 600 4 740 727
## 5 2021 1 2 840 830 10 1005 1000
## 6 2021 1 3 847 830 17 1011 1000
## 7 2021 1 3 1247 1238 9 1415 1400
## 8 2021 1 3 2012 1759 133 2133 1931
## 9 2021 1 4 631 600 31 746 727
## 10 2021 1 4 740 705 35 906 837
## # … with 792 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
We further filter with information about date and time.
flights %>%
filter(dep_delay > 0, dest == "ATL",
carrier == "DL",
month == 12, day == 12, sched_arr_time < 1100)
## # A tibble: 1 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 12 12 1121 830 171 1251 1009
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
We found it!
Q - In plain English, what is the code below accomplishing?
flights %>%
filter(carrier %in% c("DL", "AA"), # American Airlines
arr_delay < 0 | dep_delay < 0)
## # A tibble: 12,322 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2021 1 1 553 600 -7 704 722
## 2 2021 1 1 557 600 -3 722 727
## 3 2021 1 1 648 654 -6 749 807
## 4 2021 1 1 655 700 -5 850 915
## 5 2021 1 1 701 705 -4 826 837
## 6 2021 1 1 734 740 -6 1005 1029
## 7 2021 1 1 827 836 -9 933 952
## 8 2021 1 1 828 830 -2 953 1000
## 9 2021 1 1 1545 1547 -2 1712 1710
## 10 2021 1 1 1749 1745 4 1907 1913
## # … with 12,312 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The table of logical operators below will be helpful as you work with filtering.
operator | definition |
---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
x & y |
x AND y |
x | y |
x OR y |
!x |
not x |
mutate()
mutate()
creates a new variable.
In the code chunk below, air_time
is converted to hours, and a new variable mph
is created, corresponding to the miles per hour of the flight.
flights %>%
mutate(hours = air_time / 60,
mph = distance / hours) %>%
select(air_time, distance, hours, mph)
## # A tibble: 44,136 × 4
## air_time distance hours mph
## <dbl> <dbl> <dbl> <dbl>
## 1 76 501 1.27 396.
## 2 110 646 1.83 352.
## 3 66 356 1.1 324.
## 4 41 130 0.683 190.
## 5 100 700 1.67 420
## 6 146 980 2.43 403.
## 7 64 356 1.07 334.
## 8 176 1061 2.93 362.
## 9 105 680 1.75 389.
## 10 61 431 1.02 424.
## # … with 44,126 more rows
Note that the above code chunk creates hour
and mph
available only locally within the code chunk because dplyr
functions do not modify the input data frame. If you want to use these newly created variables for future code chunks, you may create another data frame with two additional variables or overwrite flights
.
# option1: create a new data frame
flights2 <- flights %>%
mutate(hours = air_time / 60,
mph = distance / hours)
# option2: overwrite the original data frame
flights <- flights %>%
mutate(hours = air_time / 60,
mph = distance / hours)
Q - Using <=
in the mutate, create a new variable on_time
that is TRUE if the flight arrived on time and FALSE if it did not.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
flights %>%
mutate(on_time = ____) %>%
select(arr_delay, on_time)
Q - What do you think will happen if you take the mean of the on_time
variable?
summarize()
summarize
calculates summary statistics. It collapses rows into summary statistics and removes columns irrelevant to the calculation.
Be sure to name your summary statistics!
# option3: create a variable every time I use it
flights %>%
mutate(on_time = (arr_delay <= 0)) %>%
summarize(prop_on_time = mean(on_time))
## # A tibble: 1 × 1
## prop_on_time
## <dbl>
## 1 NA
Huh? Shall we try another variable?
flights %>%
summarize(mean_dep_delay = mean(dep_delay))
## # A tibble: 1 × 1
## mean_dep_delay
## <dbl>
## 1 NA
Q - The code chunks above should return an NA. What is going wrong? Try to fix it to find the proportion of flights on time and the mean departure delay. Hint: Try summary(flights$dep_delay)
.
group_by()
group_by()
is used for grouped operations. It’s very powerful when paired with summarize()
to calculate summary statistics by group.
Here we find the proportion of flights that are on time by each carrier and print top 5 airlines in descending order of the proportions.
flights %>%
group_by(carrier) %>%
summarize(prop_on_time = mean((arr_delay <= 0), na.rm = TRUE)) %>%
arrange(desc(prop_on_time)) %>%
head(5)
## # A tibble: 5 × 2
## carrier prop_on_time
## <chr> <dbl>
## 1 9E 0.843
## 2 DL 0.795
## 3 AS 0.767
## 4 YX 0.766
## 5 AA 0.763
9E, AS, and YX represent Endeavor Airlines, Alaska Airlines, and Midwest Express Airlines, respectively.
We can calculate more than one summary statistic in summarize()
. Let’s find the maximum delay, median delay, and the count of flights in each month as well as the proportion on time for each month.
Here n()
calculates the current group size.
flights %>%
group_by(month) %>%
summarize(prop_on_time = mean((arr_delay <= 0), na.rm = TRUE),
max_delay = max(arr_delay, na.rm = TRUE),
median_delay = median(arr_delay, na.rm = TRUE),
count = n())
## # A tibble: 12 × 5
## month prop_on_time max_delay median_delay count
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 0.811 792 -13 2239
## 2 2 0.794 897 -12 2222
## 3 3 0.845 806 -14 3040
## 4 4 0.831 769 -14 3252
## 5 5 0.786 2715 -11 3647
## 6 6 0.662 1009 -7 3951
## 7 7 0.662 1191 -7 4189
## 8 8 0.688 1058 -8 4197
## 9 9 0.792 1086 -13 4007
## 10 10 0.759 1288 -10 4367
## 11 11 0.708 761 -9 4565
## 12 12 0.702 806 -9 4460
Finally, let’s see what the proportion on time, the maximum delay, the median delay, and the count of flights are in each month for DL, AS, and AA over the summer peak season (June ~ August).
flights %>%
filter(between(month, 6, 8),
carrier %in% c("DL", "AS", "AA")) %>%
group_by(month, carrier) %>%
summarize(prop_on_time = mean((arr_delay <= 0), na.rm = TRUE),
max_delay = max(arr_delay, na.rm = TRUE),
median_delay = median(arr_delay, na.rm = TRUE),
count = n())
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 6
## # Groups: month [3]
## month carrier prop_on_time max_delay median_delay count
## <int> <chr> <dbl> <dbl> <dbl> <int>
## 1 6 AA 0.689 866 -9 577
## 2 6 AS 0.730 102 -12 38
## 3 6 DL 0.780 1009 -10 780
## 4 7 AA 0.757 1191 -11 609
## 5 7 AS 0.857 274 -17 49
## 6 7 DL 0.775 402 -11 813
## 7 8 AA 0.737 945 -12 609
## 8 8 AS 0.816 20 -19 49
## 9 8 DL 0.764 859 -10 795
distinct()
distinct()
selects only unique rows.
Q - I am planning on a trip to Seattle this summer. I want to find which airlines have direct flights to Seattle-Tacoma Airport (SEA). Find unique combinations of carrier
and dest
and filter for dest
equal to “SEA”.
Note: Once you have modified the code, remove the option eval = FALSE
from the code chunk header and knit to see the updates.
flights %>%
distinct(____, ____) %>%
filter(____)
count()
count
counts the unique values of one or more variables. It creates frequency tables.
Here we count the number of flights to each destination to find popular destinations from RDU.
flights %>%
count(dest) %>%
arrange(desc(n))
## # A tibble: 47 × 2
## dest n
## <chr> <int>
## 1 ATL 4008
## 2 CLT 2884
## 3 LGA 2520
## 4 MCO 2332
## 5 BOS 2317
## 6 JFK 2124
## 7 ORD 2109
## 8 EWR 1823
## 9 MIA 1798
## 10 PHL 1646
## # … with 37 more rows
Q - What is the code chunk below doing?
flights %>%
count(dest) %>%
mutate(prop = n / sum(n))
## # A tibble: 47 × 3
## dest n prop
## <chr> <int> <dbl>
## 1 ATL 4008 0.0908
## 2 AUS 825 0.0187
## 3 BDL 54 0.00122
## 4 BNA 1080 0.0245
## 5 BOS 2317 0.0525
## 6 BUF 25 0.000566
## 7 BWI 1087 0.0246
## 8 CLE 52 0.00118
## 9 CLT 2884 0.0653
## 10 DAL 100 0.00227
## # … with 37 more rows
Create a new dataset that only contains flights that do not have a missing departure time. Include the columns year
, month
, day
, dep_time
, dep_delay
, and dep_delay_hours
(the departure delay in hours). Note you may need to use mutate()
to make one or more of these variables.
For each airline (uniquely identified by carrier
), use a group_by()
paired with summarize()
to find the sample size, mean, and standard deviation of flight distances. Then, pick off the top 5 and bottom 5 airlines in terms of mean distance traveled per flight.