Goals

Calculate marginal, joint, and conditional probabilities in a reproducible way.
Visualize categorical data.
Use visualizations and probabilities to describe the association between two categorical variables.

Getting started

Go to course GitHub organization page and clone the repository entitled “lab04-GitHubUsername” in RStudio.
Find lab04.Rmd to open the template R Markdown file.

Don’t forget to label your R chunk. Your label should be short, informative, shouldn’t include spaces, and shouldn’t repeat a previous label.

Packages

We will use the tidyverse and knitr packages in this lab.

library(tidyverse)
library(knitr)

NC Courage

Today, we will be working with data from the first three full seasons of the NC Courage, a highly successful National Women’s Soccer League (NWSL) team located near Duke in Cary, NC. The Courage moved to the Triangle from Western New York in 2017 and had three very successful first seasons, culminating in winning the championship game that was held at their stadium in Cary in 2019! (Data for this lab were sourced from the nwslR package on Github, and verified with the NC Courage website by Meredith Brown in a previous semester.)

Use the code below to load the data set.

courage <- read_csv("data/courage.csv")
glimpse(courage)

## Rows: 78
## Columns: 10
## $ game_id     <chr> "washington-spirit-vs-north-carolina-courage-2017-04-15", …
## $ game_date   <chr> "4/15/2017", "4/22/2017", "4/29/2017", "5/7/2017", "5/14/2…
## $ game_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ home_team   <chr> "WAS", "NC", "NC", "BOS", "ORL", "NC", "NC", "CHI", "NC", …
## $ away_team   <chr> "NC", "POR", "ORL", "NC", "NC", "CHI", "NJ", "NC", "KC", "…
## $ opponent    <chr> "WAS", "POR", "ORL", "BOS", "ORL", "CHI", "NJ", "CHI", "KC…
## $ home_pts    <dbl> 0, 1, 3, 0, 3, 1, 2, 3, 2, 3, 0, 0, 2, 1, 1, 0, 1, 2, 2, 2…
## $ away_pts    <dbl> 1, 0, 1, 1, 1, 3, 0, 2, 0, 1, 1, 1, 0, 0, 0, 1, 2, 0, 3, 1…
## $ result      <chr> "win", "win", "win", "win", "loss", "loss", "win", "loss",…
## $ season      <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017…

How many observations are in this dataset? What does each observation represent? (You do not need to create a code chunk here)
Each season the Courage play 26 games. We want to find out whether they win more in the early, middle or late season.

Create a new variable called seasonal_category that classifies NC courage games as early (games 1-9), middle (games 10-17), or late (18-26) season.
Create a new variable called win that takes the value 0 if the courage lose and 1 if they win.
Create a 3 x 2 tibble that has the season category (early, middle, late) in one column and the proportion of wins in the second column.
Save your tibble as seasonal_courage and print it to the screen.
Describe what this table shows using the term “conditional probability”.

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

By default, R will arrange the categories of a categorical variable in alphabetical order in any output and visualizations, but we want the levels for seasonal_category to be in logical order. To achieve this, we will use the factor() function to make both of these variables factors (categorical variables with ordering) and specify the levels we wish to use.

The code to reorder levels for seasonal_category is below.

# seasonal_courage %>%
#   mutate(seasonal_category =
#            factor(seasonal_category,
#                   levels = c("early", "middle", "late")))

Add one line of code to the chunk above so that the seasonal categories print in the correct order (early then middle, then late). Hint: what dplyr verb changes the order of output?
Uncomment and run the code above.

Based on the data,

what is the marginal probability the Courage win a game?
What is the conditional probability the Courage win a game given it was a home game?
Based on your findings, would you say winning is independent of whether they are playing at their home-field? Why? What does this say about home-field advantage?

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

Independence, contingency tables and ties.

Create a new column called home_courage that takes values “home” if Courage is the home team and “away” if Courage is the away team, save this data frame.
Using the data frame above, create a 3 x 2 contingency table with
- columns denoting whether or not a game is home or away for the Courage and
- rows denoting whether the Courage win, lose or tie.
- Your tibble output may be 3 x 3 counting the game result (lose, tie, win) as a column. When the same table viewed as a contingency table, however, we count their dimensions as 3 x 2.
Use the contingency table to find
- The marginal probability a game is home
- The marginal probability a game is a tie
- The conditional probability of a game being at home given the game was a tie

Bayes’ theorem tells us that

Using Bayes’ theorem, find and report the conditional probability a game is a tie given a game is home. Check your result using the contingency table.
Finally, is the event that a game is a tie independent of the Courage playing at home or away? Why?

🧶 ✅ ⬆️ Knit, commit, and push your final changes to GitHub with a meaningful commit message.

Submission

Once you are fully satisfied with your lab, Knit to .pdf to create a PDF document.

Follow the instructions in previous labs to submit your PDF to Gradescope.

Be sure to identify which problems are on each page using Gradescope.

Once you are finished with the lab, you will submit the PDF document produced from your final knit, commit, and push to Gradescope.

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes. Remember – you must turn in a .pdf file to the Gradescope page by the submission deadline to be considered “on time”.

To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials Duke NetID and log in using your NetID credentials.
Click on your STA 199 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading (50 pts)

Component	Points
Ex 1	2
Ex 2	10
Ex 3	3
Ex 4	10
Ex 5	20
Workflow & formatting	5

Grading notes:

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes updating the name on the assignment to your name, having at least 3 informative commit messages, labeling the code chunks, and having readable code that does not exceed 80 characters (i.e., we can read all your code in the knitted PDF.)

Lab #04: Probability

due Tuesday, May 31 at 11:59pm

Goals

Getting started

Packages

NC Courage

Submission

Grading (50 pts)