Your task is to fill in the blanks denoted by ___
.
In this mini analysis we work with the data used in the FiveThirtyEight story titled “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”. The data contains information about if movies pass the Bechdel test which checks the following three criteria: (1) it has to have at least two women in it, who (2) who talk to each other, about (3) something besides a man.
We start with loading the packages we’ll use.
library(fivethirtyeight) # for data
library(tidyverse)
The dataset contains information on 1794 movies released between 1970 and 2013. However, we’ll focus our analysis on movies released between 2000 and 2013. We first make a new dataset with year between 2000 and 2013 by the pipe operator %>%
and filter()
. The function filter()
is to filter observations meeting the given condition, and you can think of the pipe operator %>%
as a “then”. Hence, the following code translates to “you take the bechdel data, then you filter the data with year between 2000 and 2013”.
bechdel00_13 <- bechdel %>%
filter(between(year, 2000, 2013))
Let’s preview our data with the glimpse()
function:
glimpse(bechdel00_13)
## Rows: 1,278
## Columns: 15
## $ year <int> 2013, 2012, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20…
## $ imdb <chr> "tt1711425", "tt1343727", "tt2024544", "tt1272878", "tt0…
## $ title <chr> "21 & Over", "Dredd 3D", "12 Years a Slave", "2 Guns", "…
## $ test <chr> "notalk", "ok-disagree", "notalk-disagree", "notalk", "m…
## $ clean_test <ord> notalk, ok, notalk, notalk, men, men, notalk, ok, ok, no…
## $ binary <chr> "FAIL", "PASS", "FAIL", "FAIL", "FAIL", "FAIL", "FAIL", …
## $ budget <int> 13000000, 45000000, 20000000, 61000000, 40000000, 225000…
## $ domgross <dbl> 25682380, 13414714, 53107035, 75612460, 95020213, 383624…
## $ intgross <dbl> 42195766, 40868994, 158607035, 132493015, 95020213, 1458…
## $ code <chr> "2013FAIL", "2012PASS", "2013FAIL", "2013FAIL", "2013FAI…
## $ budget_2013 <int> 13000000, 45658735, 20000000, 61000000, 40000000, 225000…
## $ domgross_2013 <dbl> 25682380, 13611086, 53107035, 75612460, 95020213, 383624…
## $ intgross_2013 <dbl> 42195766, 41467257, 158607035, 132493015, 95020213, 1458…
## $ period_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ decade_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
There are ___
such movies.
The financial variables we’ll focus on are the following:
budget_2013
: Budget in 2013 inflation adjusted dollarsdomgross_2013
: Domestic gross (US) in 2013 inflation adjusted dollarsintgross_2013
: Total International (i.e., worldwide) gross in 2013 inflation adjusted dollarsAnd we’ll also use the binary
and clean_test
variables for grouping.
Let’s take a look at how median budget and gross vary by whether the movie passed the Bechdel test, which is stored in the binary
variable.
bechdel00_13 %>%
group_by(binary) %>%
summarise(med_budget = median(budget_2013),
med_domgross = median(domgross_2013, na.rm = TRUE),
med_intgross = median(intgross_2013, na.rm = TRUE))
## # A tibble: 2 × 4
## binary med_budget med_domgross med_intgross
## <chr> <dbl> <dbl> <dbl>
## 1 FAIL 46231854 56199150. 106399210
## 2 PASS 29955246 41913460 75682603
Next, let’s take a look at how median budget and gross vary by a more detailed indicator of the Bechdel test result. This information is stored in the clean_test
variable, which takes on the following values:
ok
= passes testdubious
men
= women only talk about mennotalk
= women don’t talk to each othernowomen
= fewer than two womenbechdel00_13 %>%
# group_by(___) %>%
summarise(med_budget = median(budget_2013),
med_domgross = median(domgross_2013, na.rm = TRUE),
med_intgross = median(intgross_2013, na.rm = TRUE))
## # A tibble: 1 × 3
## med_budget med_domgross med_intgross
## <dbl> <dbl> <dbl>
## 1 36145608. 48120751 92571961
In order to evaluate how return on investment varies among movies that pass and fail the Bechdel test, we’ll first create a new variable called roi
as the ratio of the gross to budget.
bechdel00_13 <- bechdel00_13 %>%
mutate(roi = (intgross_2013 + domgross_2013) / budget_2013)
Let’s see which movies have the highest return on investment.
bechdel00_13 %>%
arrange(desc(roi)) %>%
select(title, roi, year)
## # A tibble: 1,278 × 3
## title roi year
## <chr> <dbl> <int>
## 1 Paranormal Activity 671. 2007
## 2 Napoleon Dynamite 227. 2004
## 3 Once 190. 2006
## 4 The Devil Inside 155. 2012
## 5 Primer 142. 2004
## 6 Fireproof 134. 2008
## 7 Saw 132. 2004
## 8 My Big Fat Greek Wedding 119. 2002
## 9 Insidious 103. 2010
## 10 Paranormal Activity 2 87.4 2010
## # … with 1,268 more rows
Below is a visualization of the return on investment by test result, however it’s difficult to see the distributions due to a few extreme observations.
ggplot(data = bechdel00_13,
mapping = aes(x = clean_test, y = roi, color = binary)) +
geom_boxplot() +
labs(title = "Return on investment vs. Bechdel test result",
x = "Detailed Bechdel result",
y = "___",
color = "Binary Bechdel result")
What are those movies with very high returns on investment?
bechdel00_13 %>%
filter(roi > 150) %>%
select(title, budget_2013, domgross_2013, year)
## # A tibble: 4 × 4
## title budget_2013 domgross_2013 year
## <chr> <int> <dbl> <int>
## 1 The Devil Inside 1014639 54041622 2012
## 2 Paranormal Activity 505595 121251476 2007
## 3 Once 173369 10917487 2006
## 4 Napoleon Dynamite 493277 54927590 2004
Zooming in on the movies with roi < ___
provides a better view of how the medians across the categories compare:
ggplot(data = bechdel00_13, mapping = aes(x = clean_test, y = roi, fill = binary)) +
geom_boxplot() +
labs(title = "___",
subtitle = "___", # Something about zooming in to a certain level
x = "___",
y = "___",
fill = "___") +
coord_cartesian(ylim = c(0, 15)) # zooming in without dropping data
This assignment was adapted from Bechdel exercise.