AE 05: Data Wrangling 1

Getting Started

RDU Flights

Part 1: Data overview
Part 2: Wrangling

Practice

Submitting Application Exercises

Additional Resources

Getting Started

Go to the course GitHub organization page and find the repository entitled “ae05-GitHubUsername”.
Click the green “code” button and copy the SSH URL.
Go to RStudio, select File New Project Version Control Git and paste the URL.
Open the .Rmd file and replace “Your Name” with your name.

RDU Flights

We start with loading relevant packages. For the very first time only, uncomment the line install.packages("anyflights") and install the package anyflights.

library(tidyverse)
# install.packages("anyflights")
library(anyflights)

To demonstrate data wrangling we will use a dataset of characteristics of all flights departing from RDU (Raleigh–Durham International Airport) in 2021. We could download the data directly from the package anyflights with the commented line of the following code chunk; but to save time, I did it for you, and all you need to do is to read the RDS file using readRDS().

# flights <- get_flights("RDU", 2021) 
flights <- readRDS("data/rduflights_2021.RDS")

Part 1: Data overview

We first explore the data a bit. Examine the documentation as well.

glimpse(flights)

## Rows: 44,136
## Columns: 19
## $ year           <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 552, 553, 557, 648, 655, 701, 701, 715, 722, 724, 730, …
## $ sched_dep_time <int> 600, 600, 600, 654, 700, 705, 705, 700, 730, 730, 735, …
## $ dep_delay      <dbl> -8, -7, -3, -6, -5, -4, -4, 15, -8, -6, -5, -6, -1, -8,…
## $ arr_time       <int> 725, 704, 722, 749, 850, 847, 826, 935, 922, 844, 826, …
## $ sched_arr_time <int> 758, 722, 727, 807, 915, 912, 837, 926, 945, 909, 840, …
## $ arr_delay      <dbl> -33, -18, -5, -18, -25, -25, -11, 9, -23, -25, -14, -24…
## $ carrier        <chr> "YX", "AA", "DL", "AA", "AA", "9E", "DL", "AA", "9E", "…
## $ flight         <int> 203, 33, 63, 21, 31, 152, 59, 25, 149, 193, 117, 16, 15…
## $ tailnum        <chr> "N815MD", "N9021H", "N823DN", "N109UW", "N929NN", "N907…
## $ origin         <chr> "RDU", "RDU", "RDU", "RDU", "RDU", "RDU", "RDU", "RDU",…
## $ dest           <chr> "DTW", "ORD", "ATL", "CLT", "MIA", "MSP", "ATL", "DFW",…
## $ air_time       <dbl> 76, 110, 66, 41, 100, 146, 64, 176, 105, 61, 102, 245, …
## $ distance       <dbl> 501, 646, 356, 130, 700, 980, 356, 1061, 680, 431, 632,…
## $ hour           <dbl> 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9…
## $ minute         <dbl> 0, 0, 0, 54, 0, 5, 5, 0, 30, 30, 35, 40, 40, 0, 15, 30,…
## $ time_hour      <dttm> 2021-01-01 06:00:00, 2021-01-01 06:00:00, 2021-01-01 0…

names(flights) # to get column names

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

?get_flights # to examine data dictionary

head(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2021     1     1      552            600        -8      725            758
## 2  2021     1     1      553            600        -7      704            722
## 3  2021     1     1      557            600        -3      722            727
## 4  2021     1     1      648            654        -6      749            807
## 5  2021     1     1      655            700        -5      850            915
## 6  2021     1     1      701            705        -4      847            912
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

tail(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2021    12    31       NA           1445        NA       NA           1624
## 2  2021    12    31       NA            710        NA       NA           1032
## 3  2021    12    31       NA            720        NA       NA            903
## 4  2021    12    31       NA            530        NA       NA            725
## 5  2021    12    31       NA           1434        NA       NA           1550
## 6  2021    12    31       NA           2059        NA       NA           2250
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The head() / tail() function returns “A tibble: 6 x 19” and then the first / last six rows of the flights data.

A tibble is a tweaked, opinionated version of the R data frame. The two main differences between a data frame vs a tibble are printing and subsetting.
You can specify the number of rows shown with the second argument in the functions, e.g. tail(flights, 2).

Q - Can you print the first three rows of the flights data?

# code here

Part 2: Wrangling

Often we need to wrangle our data to extract meaning. This includes calculating new variables, summary statistics, grouping by variables, renaming, reordering, selecting subsets of data, filtering by various conditions, etc.

We can accomplish a great deal of wrangling by learning some key verbs and a pipe operator. Each of the following functions takes a data frame as input and returns a data frame as output.

select
arrange
slice
filter
mutate
summarize
group_by
distinct
count

`select()`

The select() function picks off one or more columns by name.

Let’s say we want a dataset that only contains the variables month and dest.

select(flights, month, dest)

## # A tibble: 44,136 × 2
##    month dest 
##    <int> <chr>
##  1     1 DTW  
##  2     1 ORD  
##  3     1 ATL  
##  4     1 CLT  
##  5     1 MIA  
##  6     1 MSP  
##  7     1 ATL  
##  8     1 DFW  
##  9     1 FLL  
## 10     1 LGA  
## # … with 44,126 more rows

We can also use select() to exclude variables. Let’s exclude year but keep all the other variables because we know the year is 2021.

select(flights, -year)

## # A tibble: 44,136 × 18
##    month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1     1     1      552            600        -8      725            758
##  2     1     1      553            600        -7      704            722
##  3     1     1      557            600        -3      722            727
##  4     1     1      648            654        -6      749            807
##  5     1     1      655            700        -5      850            915
##  6     1     1      701            705        -4      847            912
##  7     1     1      701            705        -4      826            837
##  8     1     1      715            700        15      935            926
##  9     1     1      722            730        -8      922            945
## 10     1     1      724            730        -6      844            909
## # … with 44,126 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We can also use select() with certain characteristics. Here, we select all the variables whose names contain “time”.

select(flights, contains("time"))

## # A tibble: 44,136 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      552            600      725            758       76 2021-01-01 06:00:00
##  2      553            600      704            722      110 2021-01-01 06:00:00
##  3      557            600      722            727       66 2021-01-01 06:00:00
##  4      648            654      749            807       41 2021-01-01 06:00:00
##  5      655            700      850            915      100 2021-01-01 07:00:00
##  6      701            705      847            912      146 2021-01-01 07:00:00
##  7      701            705      826            837       64 2021-01-01 07:00:00
##  8      715            700      935            926      176 2021-01-01 07:00:00
##  9      722            730      922            945      105 2021-01-01 07:00:00
## 10      724            730      844            909       61 2021-01-01 07:00:00
## # … with 44,126 more rows

Try ?contains for more info about other select helpers.

`arrange()`

The arrange() function orders rows (observations) in ascending / descending order.

Let’s arrange the data by descending departure delays, with large departure delays on top.

arrange(flights, desc(dep_delay))

## # A tibble: 44,136 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2021     5    11      838           1120      2718     1041           1326
##  2  2021     8     7      637            600      1477       NA            819
##  3  2021     5    29      517            525      1432      709            631
##  4  2021    10    28      655            915      1300      806           1038
##  5  2021     7     4      751           1156      1195      945           1354
##  6  2021    10     5      759           1252      1147      908           1406
##  7  2021     9    23      737           1325      1092      911           1505
##  8  2021     8    14     1016           1630      1066     1152           1827
##  9  2021     7     6     1258           1915      1063     1447           2109
## 10  2021     8    12     1113           1730      1063     1225           1847
## # … with 44,126 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Or with low departure delays on top.

arrange(flights, dep_delay)

## # A tibble: 44,136 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2021     3     8     2037           2109       -32     2214           2302
##  2  2021     3    12     2003           2035       -32     2203           2215
##  3  2021     3    15     2005           2035       -30     2135           2215
##  4  2021     2    12     2120           2147       -27     2326           2336
##  5  2021    11    17      654            720       -26      849            846
##  6  2021     3    19     2010           2035       -25     2139           2215
##  7  2021     6     5     2155           2220       -25     2310           2355
##  8  2021     3     1     2059           2123       -24     2243           2312
##  9  2021     7     7      946           1010       -24     1138           1200
## 10  2021     8    31     2021           2045       -24     2209           2240
## # … with 44,126 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Pipe operator `%>%`

What if we only want to examine the dep_delay, dest, month, and day columns? We can combine arrange() and select()!

arrange(select(flights, month:day, dest, dep_delay), desc(dep_delay))

## # A tibble: 44,136 × 4
##    month   day dest  dep_delay
##    <int> <int> <chr>     <dbl>
##  1     5    11 DFW        2718
##  2     8     7 LAX        1477
##  3     5    29 CLT        1432
##  4    10    28 ORD        1300
##  5     7     4 DFW        1195
##  6    10     5 CLT        1147
##  7     9    23 EWR        1092
##  8     8    14 IAH        1066
##  9     7     6 JFK        1063
## 10     8    12 ORD        1063
## # … with 44,126 more rows

It is not easy to understand what is going on in the code chunk above.

we have to read from inside out and right to left
not clear which argument goes with which function
doesn’t focus on the functions

The pipe is a technique for passing information from one process to another. It can rewrite the code with many parentheses (like the one above!) into a more natural and easier-to-read structure (see the code chunk below!).

flights %>%
  select(month:day, dest, dep_delay) %>%
  arrange(desc(dep_delay))

## # A tibble: 44,136 × 4
##    month   day dest  dep_delay
##    <int> <int> <chr>     <dbl>
##  1     5    11 DFW        2718
##  2     8     7 LAX        1477
##  3     5    29 CLT        1432
##  4    10    28 ORD        1300
##  5     7     4 DFW        1195
##  6    10     5 CLT        1147
##  7     9    23 EWR        1092
##  8     8    14 IAH        1066
##  9     7     6 JFK        1063
## 10     8    12 ORD        1063
## # … with 44,126 more rows

When reading code “in English”, say “then” whenever you see a pipe.

Q - How would you read the code chunk above in English? What is it accomplishing?

`slice()`

Slice subsets a data frame for certain row numbers.

Here we slice off the first 5 rows of the flights data.

flights %>%
  slice(1:5)

## # A tibble: 5 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2021     1     1      552            600        -8      725            758
## 2  2021     1     1      553            600        -7      704            722
## 3  2021     1     1      557            600        -3      722            727
## 4  2021     1     1      648            654        -6      749            807
## 5  2021     1     1      655            700        -5      850            915
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We can also slice the bottom 5 rows of the flights data. Here n() computes the number of rows in the data.

flights %>%
  slice((n()-4):n())

## # A tibble: 5 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2021    12    31       NA            710        NA       NA           1032
## 2  2021    12    31       NA            720        NA       NA            903
## 3  2021    12    31       NA            530        NA       NA            725
## 4  2021    12    31       NA           1434        NA       NA           1550
## 5  2021    12    31       NA           2059        NA       NA           2250
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Q - You can accomplish the same thing with one of the functions introduced above. What is it?

We can also slice every other row, e.g., 1st, 3rd, 5th, .

flights %>%
  slice(seq(1, n(), by=2))

## # A tibble: 22,068 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2021     1     1      552            600        -8      725            758
##  2  2021     1     1      557            600        -3      722            727
##  3  2021     1     1      655            700        -5      850            915
##  4  2021     1     1      701            705        -4      826            837
##  5  2021     1     1      722            730        -8      922            945
##  6  2021     1     1      730            735        -5      826            840
##  7  2021     1     1      739            740        -1      920            932
##  8  2021     1     1      811            815        -4      959           1025
##  9  2021     1     1      827            836        -9      933            952
## 10  2021     1     1      854            900        -6     1053           1116
## # … with 22,058 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Q - What is the code chunk below accomplishing? Guess before running the code.

flights %>%
  select(month:day, dest, dep_delay) %>%
  arrange(desc(dep_delay)) %>%
  slice(1:5)

## # A tibble: 5 × 4
##   month   day dest  dep_delay
##   <int> <int> <chr>     <dbl>
## 1     5    11 DFW        2718
## 2     8     7 LAX        1477
## 3     5    29 CLT        1432
## 4    10    28 ORD        1300
## 5     7     4 DFW        1195

Q - What will happen if you comment out the line containing arrange() in the code chunk above? Try it.

`filter()`

filter() selects rows satisfying certain conditions.

Around 8am on Dec 12, I was at RDU waiting for my Delta flight to Atlanta (ATL). I was supposed to take a direct flight from ATL to South Korea at around 11am. However, because of a plane maintenance delay I could not depart to ATL on time and had to reschedule my itinerary completely. Here we will find that incident.

We first select all flights that experienced departure delays to ATL.

flights %>%
  filter(dep_delay > 0, dest == "ATL")

## # A tibble: 1,020 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2021     1     1     1256           1238        18     1427           1400
##  2  2021     1     1     1418           1410         8     1541           1536
##  3  2021     1     1     1546           1515        31     1705           1645
##  4  2021     1     1     1749           1745         4     1907           1913
##  5  2021     1     2      604            600         4      740            727
##  6  2021     1     2      840            830        10     1005           1000
##  7  2021     1     3      847            830        17     1011           1000
##  8  2021     1     3     1110           1105         5     1235           1235
##  9  2021     1     3     1247           1238         9     1415           1400
## 10  2021     1     3     1616           1615         1     1738           1745
## # … with 1,010 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We add another condition for Delta Airlines (DL).

flights %>%
  filter(dep_delay > 0, dest == "ATL", 
         carrier == "DL")

## # A tibble: 802 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2021     1     1     1256           1238        18     1427           1400
##  2  2021     1     1     1418           1410         8     1541           1536
##  3  2021     1     1     1749           1745         4     1907           1913
##  4  2021     1     2      604            600         4      740            727
##  5  2021     1     2      840            830        10     1005           1000
##  6  2021     1     3      847            830        17     1011           1000
##  7  2021     1     3     1247           1238         9     1415           1400
##  8  2021     1     3     2012           1759       133     2133           1931
##  9  2021     1     4      631            600        31      746            727
## 10  2021     1     4      740            705        35      906            837
## # … with 792 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We further filter with information about date and time.

flights %>%
  filter(dep_delay > 0, dest == "ATL", 
         carrier == "DL", 
         month == 12, day == 12, sched_arr_time < 1100)

## # A tibble: 1 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2021    12    12     1121            830       171     1251           1009
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

We found it!

Q - In plain English, what is the code below accomplishing?

flights %>%
  filter(carrier %in% c("DL", "AA"), # American Airlines
         arr_delay < 0 | dep_delay < 0)

## # A tibble: 12,322 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2021     1     1      553            600        -7      704            722
##  2  2021     1     1      557            600        -3      722            727
##  3  2021     1     1      648            654        -6      749            807
##  4  2021     1     1      655            700        -5      850            915
##  5  2021     1     1      701            705        -4      826            837
##  6  2021     1     1      734            740        -6     1005           1029
##  7  2021     1     1      827            836        -9      933            952
##  8  2021     1     1      828            830        -2      953           1000
##  9  2021     1     1     1545           1547        -2     1712           1710
## 10  2021     1     1     1749           1745         4     1907           1913
## # … with 12,312 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The table of logical operators below will be helpful as you work with filtering.

operator	definition
`<`	less than
`<=`	less than or equal to
`>`	greater than
`>=`	greater than or equal to
`==`	exactly equal to
`!=`	not equal to
`is.na(x)`	is x NA?
`!is.na(x)`	is x not NA?
`x %in% y`	is x in y?
`!(x %in% y)`	is x not in y?
`x & y`	x AND y
`x \| y`	x OR y
`!x`	not x

`mutate()`

mutate() creates a new variable.

In the code chunk below, air_time is converted to hours, and a new variable mph is created, corresponding to the miles per hour of the flight.

flights %>%
  mutate(hours = air_time / 60,
         mph = distance / hours) %>%
  select(air_time, distance, hours, mph)

## # A tibble: 44,136 × 4
##    air_time distance hours   mph
##       <dbl>    <dbl> <dbl> <dbl>
##  1       76      501 1.27   396.
##  2      110      646 1.83   352.
##  3       66      356 1.1    324.
##  4       41      130 0.683  190.
##  5      100      700 1.67   420 
##  6      146      980 2.43   403.
##  7       64      356 1.07   334.
##  8      176     1061 2.93   362.
##  9      105      680 1.75   389.
## 10       61      431 1.02   424.
## # … with 44,126 more rows

Note that the above code chunk creates hour and mph available only locally within the code chunk because dplyr functions do not modify the input data frame. If you want to use these newly created variables for future code chunks, you may create another data frame with two additional variables or overwrite flights.

# option1: create a new data frame
flights2 <- flights %>%
  mutate(hours = air_time / 60,
         mph = distance / hours) 

# option2: overwrite the original data frame
flights <- flights %>%
  mutate(hours = air_time / 60,
         mph = distance / hours)

Q - Using <= in the mutate, create a new variable on_time that is TRUE if the flight arrived on time and FALSE if it did not.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

flights %>%
  mutate(on_time = ____) %>%
  select(arr_delay, on_time)

Q - What do you think will happen if you take the mean of the on_time variable?

`summarize()`

summarize calculates summary statistics. It collapses rows into summary statistics and removes columns irrelevant to the calculation.

Be sure to name your summary statistics!

# option3: create a variable every time I use it
flights %>%
  mutate(on_time = (arr_delay <= 0)) %>% 
  summarize(prop_on_time = mean(on_time))

## # A tibble: 1 × 1
##   prop_on_time
##          <dbl>
## 1           NA

Huh? Shall we try another variable?

flights %>%
  summarize(mean_dep_delay = mean(dep_delay))

## # A tibble: 1 × 1
##   mean_dep_delay
##            <dbl>
## 1             NA

Q - The code chunks above should return an NA. What is going wrong? Try to fix it to find the proportion of flights on time and the mean departure delay. Hint: Try summary(flights$dep_delay).

`group_by()`

group_by() is used for grouped operations. It’s very powerful when paired with summarize() to calculate summary statistics by group.

Here we find the proportion of flights that are on time by each carrier and print top 5 airlines in descending order of the proportions.

flights %>%
  group_by(carrier) %>%
  summarize(prop_on_time = mean((arr_delay <= 0), na.rm = TRUE)) %>% 
  arrange(desc(prop_on_time)) %>% 
  head(5)

## # A tibble: 5 × 2
##   carrier prop_on_time
##   <chr>          <dbl>
## 1 9E             0.843
## 2 DL             0.795
## 3 AS             0.767
## 4 YX             0.766
## 5 AA             0.763

9E, AS, and YX represent Endeavor Airlines, Alaska Airlines, and Midwest Express Airlines, respectively.

We can calculate more than one summary statistic in summarize(). Let’s find the maximum delay, median delay, and the count of flights in each month as well as the proportion on time for each month.

Here n() calculates the current group size.

flights %>%
  group_by(month) %>%
  summarize(prop_on_time = mean((arr_delay <= 0), na.rm = TRUE),
            max_delay = max(arr_delay, na.rm = TRUE),
            median_delay = median(arr_delay, na.rm = TRUE),
            count = n())

## # A tibble: 12 × 5
##    month prop_on_time max_delay median_delay count
##    <int>        <dbl>     <dbl>        <dbl> <int>
##  1     1        0.811       792          -13  2239
##  2     2        0.794       897          -12  2222
##  3     3        0.845       806          -14  3040
##  4     4        0.831       769          -14  3252
##  5     5        0.786      2715          -11  3647
##  6     6        0.662      1009           -7  3951
##  7     7        0.662      1191           -7  4189
##  8     8        0.688      1058           -8  4197
##  9     9        0.792      1086          -13  4007
## 10    10        0.759      1288          -10  4367
## 11    11        0.708       761           -9  4565
## 12    12        0.702       806           -9  4460

Finally, let’s see what the proportion on time, the maximum delay, the median delay, and the count of flights are in each month for DL, AS, and AA over the summer peak season (June ~ August).

flights %>%
  filter(between(month, 6, 8), 
         carrier %in% c("DL", "AS", "AA")) %>% 
  group_by(month, carrier) %>%
  summarize(prop_on_time = mean((arr_delay <= 0), na.rm = TRUE),
            max_delay = max(arr_delay, na.rm = TRUE),
            median_delay = median(arr_delay, na.rm = TRUE),
            count = n())

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

## # A tibble: 9 × 6
## # Groups:   month [3]
##   month carrier prop_on_time max_delay median_delay count
##   <int> <chr>          <dbl>     <dbl>        <dbl> <int>
## 1     6 AA             0.689       866           -9   577
## 2     6 AS             0.730       102          -12    38
## 3     6 DL             0.780      1009          -10   780
## 4     7 AA             0.757      1191          -11   609
## 5     7 AS             0.857       274          -17    49
## 6     7 DL             0.775       402          -11   813
## 7     8 AA             0.737       945          -12   609
## 8     8 AS             0.816        20          -19    49
## 9     8 DL             0.764       859          -10   795

`distinct()`

distinct() selects only unique rows.

Q - I am planning on a trip to Seattle this summer. I want to find which airlines have direct flights to Seattle-Tacoma Airport (SEA). Find unique combinations of carrier and dest and filter for dest equal to “SEA”.

Note: Once you have modified the code, remove the option eval = FALSE from the code chunk header and knit to see the updates.

flights %>% 
  distinct(____, ____) %>% 
  filter(____)

`count()`

count counts the unique values of one or more variables. It creates frequency tables.

Here we count the number of flights to each destination to find popular destinations from RDU.

flights %>%
  count(dest) %>% 
  arrange(desc(n))

## # A tibble: 47 × 2
##    dest      n
##    <chr> <int>
##  1 ATL    4008
##  2 CLT    2884
##  3 LGA    2520
##  4 MCO    2332
##  5 BOS    2317
##  6 JFK    2124
##  7 ORD    2109
##  8 EWR    1823
##  9 MIA    1798
## 10 PHL    1646
## # … with 37 more rows

Q - What is the code chunk below doing?

flights %>%
  count(dest) %>%
  mutate(prop = n / sum(n))

## # A tibble: 47 × 3
##    dest      n     prop
##    <chr> <int>    <dbl>
##  1 ATL    4008 0.0908  
##  2 AUS     825 0.0187  
##  3 BDL      54 0.00122 
##  4 BNA    1080 0.0245  
##  5 BOS    2317 0.0525  
##  6 BUF      25 0.000566
##  7 BWI    1087 0.0246  
##  8 CLE      52 0.00118 
##  9 CLT    2884 0.0653  
## 10 DAL     100 0.00227 
## # … with 37 more rows

Practice

Create a new dataset that only contains flights that do not have a missing departure time. Include the columns year, month, day, dep_time, dep_delay, and dep_delay_hours (the departure delay in hours). Note you may need to use mutate() to make one or more of these variables.
For each airline (uniquely identified by carrier), use a group_by() paired with summarize() to find the sample size, mean, and standard deviation of flight distances. Then, pick off the top 5 and bottom 5 airlines in terms of mean distance traveled per flight.

Submitting Application Exercises

Once you have completed the activity, push your final changes to your GitHub repo.
Make sure you committed at least three times.
Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.

AE 05: Data Wrangling 1

due Wednesday, May 18 at 9:29am

Bora Jin

Getting Started

RDU Flights

Part 1: Data overview

Part 2: Wrangling

`select()`

`arrange()`

Pipe operator `%>%`

`slice()`

`filter()`

`mutate()`

`summarize()`

`group_by()`

`distinct()`

`count()`

Practice

Submitting Application Exercises

Additional Resources

AE 05: Data Wrangling 1

due Wednesday, May 18 at 9:29am

Bora Jin

Getting Started

RDU Flights

Part 1: Data overview

Part 2: Wrangling

select()

arrange()

Pipe operator %>%

slice()

filter()

mutate()

summarize()

group_by()

distinct()

count()

Practice

Submitting Application Exercises

Additional Resources

`select()`

`arrange()`

Pipe operator `%>%`

`slice()`

`filter()`

`mutate()`

`summarize()`

`group_by()`

`distinct()`

`count()`