Your task is to fill in the blanks denoted by ___.


In this mini analysis we work with the data used in the FiveThirtyEight story titled “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”. The data contains information about if movies pass the Bechdel test which checks the following three criteria: (1) it has to have at least two women in it, who (2) who talk to each other, about (3) something besides a man.

Data and Packages

We start with loading the packages we’ll use.

library(fivethirtyeight) # for data

The dataset contains information on 1794 movies released between 1970 and 2013. However, we’ll focus our analysis on movies released between 2000 and 2013. We first make a new dataset with year between 2000 and 2013 by the pipe operator %>% and filter(). The function filter() is to filter observations meeting the given condition, and you can think of the pipe operator %>% as a “then”. Hence, the following code translates to “you take the bechdel data, then you filter the data with year between 2000 and 2013”.

bechdel00_13 <- bechdel %>% 
  filter(between(year, 2000, 2013))

Let’s preview our data with the glimpse() function:

## Rows: 1,278
## Columns: 15
## $ year          <int> 2013, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20…
## $ imdb          <chr> "tt1711425", "tt1343727", "tt2024544", "tt1272878", "tt0…
## $ title         <chr> "21 & Over", "Dredd 3D", "12 Years a Slave", "2 Guns", "…
## $ test          <chr> "notalk", "ok-disagree", "notalk-disagree", "notalk", "m…
## $ clean_test    <ord> notalk, ok, notalk, notalk, men, men, notalk, ok, ok, no…
## $ binary        <chr> "FAIL", "PASS", "FAIL", "FAIL", "FAIL", "FAIL", "FAIL", …
## $ budget        <int> 13000000, 45000000, 20000000, 61000000, 40000000, 225000…
## $ domgross      <dbl> 25682380, 13414714, 53107035, 75612460, 95020213, 383624…
## $ intgross      <dbl> 42195766, 40868994, 158607035, 132493015, 95020213, 1458…
## $ code          <chr> "2013FAIL", "2012PASS", "2013FAIL", "2013FAIL", "2013FAI…
## $ budget_2013   <int> 13000000, 45658735, 20000000, 61000000, 40000000, 225000…
## $ domgross_2013 <dbl> 25682380, 13611086, 53107035, 75612460, 95020213, 383624…
## $ intgross_2013 <dbl> 42195766, 41467257, 158607035, 132493015, 95020213, 1458…
## $ period_code   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ decade_code   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

There are ___ such movies.

The financial variables we’ll focus on are the following:

  • budget_2013: Budget in 2013 inflation adjusted dollars
  • domgross_2013: Domestic gross (US) in 2013 inflation adjusted dollars
  • intgross_2013: Total International (i.e., worldwide) gross in 2013 inflation adjusted dollars

And we’ll also use the binary and clean_test variables for grouping.


Part 1

Let’s take a look at how median budget and gross vary by whether the movie passed the Bechdel test, which is stored in the binary variable.

bechdel00_13 %>%
  group_by(binary) %>%
  summarise(med_budget = median(budget_2013),
            med_domgross = median(domgross_2013, na.rm = TRUE),
            med_intgross = median(intgross_2013, na.rm = TRUE))
## # A tibble: 2 × 4
##   binary med_budget med_domgross med_intgross
##   <chr>       <dbl>        <dbl>        <dbl>
## 1 FAIL     46231854    56199150.    106399210
## 2 PASS     29955246    41913460      75682603

Part 2

Next, let’s take a look at how median budget and gross vary by a more detailed indicator of the Bechdel test result. This information is stored in the clean_test variable, which takes on the following values:

  • ok = passes test
  • dubious
  • men = women only talk about men
  • notalk = women don’t talk to each other
  • nowomen = fewer than two women
bechdel00_13 %>%
  # group_by(___) %>%
  summarise(med_budget = median(budget_2013),
            med_domgross = median(domgross_2013, na.rm = TRUE),
            med_intgross = median(intgross_2013, na.rm = TRUE))
## # A tibble: 1 × 3
##   med_budget med_domgross med_intgross
##        <dbl>        <dbl>        <dbl>
## 1  36145608.     48120751     92571961

Part 3

In order to evaluate how return on investment varies among movies that pass and fail the Bechdel test, we’ll first create a new variable called roi as the ratio of the gross to budget.

bechdel00_13 <- bechdel00_13 %>%
  mutate(roi = (intgross_2013 + domgross_2013) / budget_2013)

Let’s see which movies have the highest return on investment.

bechdel00_13 %>%
  arrange(desc(roi)) %>% 
  select(title, roi, year)
## # A tibble: 1,278 × 3
##    title                      roi  year
##    <chr>                    <dbl> <int>
##  1 Paranormal Activity      671.   2007
##  2 Napoleon Dynamite        227.   2004
##  3 Once                     190.   2006
##  4 The Devil Inside         155.   2012
##  5 Primer                   142.   2004
##  6 Fireproof                134.   2008
##  7 Saw                      132.   2004
##  8 My Big Fat Greek Wedding 119.   2002
##  9 Insidious                103.   2010
## 10 Paranormal Activity 2     87.4  2010
## # … with 1,268 more rows

Part 4

Below is a visualization of the return on investment by test result, however it’s difficult to see the distributions due to a few extreme observations.

ggplot(data = bechdel00_13, 
       mapping = aes(x = clean_test, y = roi, color = binary)) +
  geom_boxplot() +
  labs(title = "Return on investment vs. Bechdel test result",
       x = "Detailed Bechdel result",
       y = "___",
       color = "Binary Bechdel result")

What are those movies with very high returns on investment?

bechdel00_13 %>%
  filter(roi > 150) %>%
  select(title, budget_2013, domgross_2013, year)
## # A tibble: 4 × 4
##   title               budget_2013 domgross_2013  year
##   <chr>                     <int>         <dbl> <int>
## 1 The Devil Inside        1014639      54041622  2012
## 2 Paranormal Activity      505595     121251476  2007
## 3 Once                     173369      10917487  2006
## 4 Napoleon Dynamite        493277      54927590  2004

Zooming in on the movies with roi < ___ provides a better view of how the medians across the categories compare:

ggplot(data = bechdel00_13, mapping = aes(x = clean_test, y = roi, fill = binary)) +
  geom_boxplot() +
  labs(title = "___",
       subtitle = "___", # Something about zooming in to a certain level
       x = "___",
       y = "___",
       fill = "___") + 
  coord_cartesian(ylim = c(0, 15)) # zooming in without dropping data 

Submitting Application Exercises

  • Once you have completed the activity, push your final changes to your GitHub repo.
  • Make sure you committed at least three times.
  • Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.


This assignment was adapted from Bechdel exercise.