Getting Started

Statistical Process Terminology

Part 1: Population vs. Sample

Go to the Monmouth University Polling Institute website and select a poll of interest. Briefly read the poll results and methodology section at the end. Try and identify the following for a different example:

  • Title: Steady Support for Russia Sanctions

  • Population of interest: Americans with age 18 and older

  • Parameter of interest: Proportion supporting the economic sanctions imposed on Russia in response to its invasion of Ukraine

  • Sample: A nationwide random sample of adults age 18 and older

  • Sample size: 807

  • Sample statistic: Proportion supporting the economic sanctions imposed on Russia in response to its invasion of Ukraine

  • Sample statistic’s value: 77%

Part 2: Scientific Studies1

Researchers aim to evaluate the effects of prenatal exposure to air pollutants on neurodevelopmental and behavioral development in children. They recruited a total of 700 African American and Dominican women from two prenatal clinics in Northern Manhattan who satisfied the following inclusion criteria:

  • enrolled in a longitudinal birth cohort as a mother-child dyad
  • aged 18-35
  • had first prenatal visit before the 20th week of gestation
  • were free of diabetes, hypertension, or known HIV
  • did not report tobacco or illicit drug use
  • resided in the study area for at least one year prior to pregnancy

For neurodevelopmental and behavioral development in children, an IQ test was administered to children at age seven.

Q - What type of study is it? What kind of conclusions can we make?

Q - Identify a response variable, explanatory variables, and confounding variables.

Q - Can we extend findings in the study to a broader population, for instance, mother-child dyads across the US?

Q - What could have been done differently for a more representative sample?

Data Generative Process

Your friend wants to play a game. The game goes: you flip a coin. If it's tails then the turn goes to your friend. If it's heads, you keep playing. 

We will say a Tails is 0 and a Heads is 1. To be more precise, we will say:

Let \(X\) be a random variable that maps the outcome (“Heads”, “Tails”) to the numbers (1, 0) respectively.

A random variable takes each outcome (possibly a categorical variable) in the sample space and maps it to a number.

Let’s flip a real coin with your friend for 10 times and see how this game goes.

Part 3

Input the data.

# coin_flips <- c()

Q - Do you think this coin is fair? Why?

Let’s come up with a data generative process, a model for how the data were generated.

We might imagine that \(X\) (the outcome of an individual coin flip, represented as 0 or 1) is obeying some universal law. Each time I flip the coin, there is a probability \(p\) that the coin lands heads. What is the probability the coin lands tails?

This probability \(p\) is fixed to this coin. For example, some scenarios:

  • the coin is fair, \(p = 0.5\)
  • the coin is double-sided, \(p = 0\) or \(p = 1\)
  • the coin is weighted slightly in favor tails: \(p = 0.4\)

We say “\(X\) is Bernoulli distributed with parameter \(p\).”

\[ X \sim \text{Bernoulli}(p) \]

this means \(P(X = 1) = p\) and \(P(X = 0) = 1-p\).

Now we can frame questions about the coin being fair in terms of \(p\).

Part 4

Let’s flip a coin for another 10 times.

# coin_flips <- c(coin_flips, )
What's the probability of this outcome if the coin is fair?

In order to answer this question, we need a distribution of the number of heads from 20 trials with a fair coin.

We can define \(Y\) to be the number of heads. In other words, \(Y = \sum_{i=1}^{n} X_i = X_1 + X_2 + \cdots + X_{n}\) where \(n\) is the total number of coin flips. In our case, \(n=20\).

We say “\(Y\) follows a Bionomial distribution with parameters \(n\) and \(p\)”.

\[ Y \sim \text{Binomial}(n, p) \]

this means:

\[ P(Y = y) = {n \choose y} p^{y} (1-p)^{n-y} \] Notice:

  • the number of heads observed is lower case \(y\)
  • \(n-y\) is the number of tails

Part 5

Once we figure out the distribution, we can either calculate the probability analytically or through simulations. For analytical results, we can simply replace \(y\) with the number of heads that we want to compute the probability for.

Let’s practice how to simulate coin flips. The code below simulates 100,000 coin flips where each flip is Bernoulli distributed with \(p = 0.5\). Note, Bernoulli(p) \(\stackrel{d}{=}\) Binomial(1, p).

set.seed(527)
flips <- rbinom(100000, 1, 0.5)

We can simulate 20 coin flips for multiple times as well. The code below simulates the number of heads out of 20 coin flips for 50,000 times. Each element \(Y\) follows a Binomial distribution with \(n = 20\) and \(p = 0.5\).

set.seed(527)
total_head <- rbinom(50000, 20, 0.5)

Q - What does rbinom() do?

Practice

  1. Using total_head, find the probability of seeing 15 tails out of 20 coin flips.

  2. Compare the above results based on simulation with analytical results using dbinom().

  3. Plot a histogram of the simulation. What does this show?

library(tidyverse)
  1. What is the probability of seeing at least 15 tails?

  2. Verify the results in (4) with analytical results using pbinom().

  3. Using total_head, find the probability of seeing your result out of 20 coin flips.

Submitting Application Exercises

  • Once you have completed the activity, push your final changes to your GitHub repo.
  • Make sure you committed at least three times.
  • Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.