Section 4: Statistical Relationships

Sam Frederick

Columbia University

2/21/23

Today’s Section

Correlation Coefficients
Law of Large Numbers
The Central Limit Theorem
Hypothesis Testing

Correlation Coefficients

Statistically summarize relationships between numeric variables

Range from -1 to 1

Values closer to -1 or 1 indicate stronger relationships

Correlation Coefficients

\(Cor(X, Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)*Var(Y)}} = \frac{\frac{1}{n-1}\sum_{i=1}^n(X-\bar{X})(Y-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i=1}^n(X-\bar{X})^2 *\frac{1}{n-1} \sum_{i=1}^n(Y-\bar{Y})^2}}\)

Equivalently:

\(Cor(X,Y) = \frac{\sum_{i=1}^n (X-\bar{X})*(Y-\bar{Y})}{ \sqrt{\sum_{i=1}^n(X-\bar{X})^2 *\sum_{i=1}^n(Y-\bar{Y})^2}}\)

Correlation Coefficients in R

x <- c(1, 3, 5, 7, 9)
y <- c(1, 4, 7, 2 , 10)

By hand:

numerator <- sum((x-mean(x))*(y - mean(y)))
denominator <- sqrt(sum((x-mean(x))^2)*sum((y - mean(y))^2))
round(numerator/denominator, 3)

[1] 0.683

Using R functions:

round(cov(x, y)/sqrt(var(x)*var(y)), 3)

[1] 0.683

Using one R function:

round(cor(x, y), 3)

[1] 0.683

Custom Functions in R

name <- function(arguments){
  tasks
  return(output)
}
name(arguments)

Custom Functions in R

square <- function(x) {
  return(x^2)
}
square(2)

[1] 4

square(4)

[1] 16

Custom Functions in R

correlation <- function(x, y) {
  numerator <- sum((x-mean(x))*(y - mean(y)))
  denominator <- sqrt(sum((x-mean(x))^2)*sum((y - mean(y))^2))
  return(round(numerator/denominator, 3))
}
correlation(x,y)

[1] 0.683

Law of Large Numbers

As \(n \to \infty\), sample mean approaches true population mean:

Central Limit Theorem

As \(n \to \infty\), \(\sqrt{n}(\bar{X}_n-\mu) \to N(0, \sigma^2)\)

Central Limit Theorem

What’s more, As \(n \to \infty\), \(\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \to N(0, 1)\)

Hypothesis Testing

Null Hypothesis:
- Usually that true population mean equal to some value (\(\mu = x\))
- e.g., the true approval rate of Joe Biden is 50%
- e.g., the difference between the conservatism of Democrats and Republicans is 0
Alternative Hypothesis:
- Two-sided: \(\mu \neq x\)
- One-sided: \(\mu > x\) or \(\mu < x\)

Hypothesis Testing

Calculate the Z-Score based on null hypothesis:
- \(Z = \frac{\bar{X} - \mu_{0}}{\sigma/\sqrt{n}}\)
- Two-Sample/Difference-in-Means Test:
  - \(Z = \frac{(\bar{X} - \bar{Y}) - (\mu_{0x}-\mu_{0y})}{\sqrt{\sigma_x^2/n_x + \sigma_y^2/n_y}}\)

Hypothesis Testing

Under the Central Limit Theorem, the Z-Score should be
- distributed approximately standard normal
  - if we repeated the sampling process many times with a large enough sample
  - and if the null hypothesis is true

Hypothesis Testing

Standard Normal Distribution has known properties
- Calculate probability of observing Z-Score at least as large as observed Z-Score
  - if the null hypothesis is true
- This is the p-value

Hypothesis Testing: Example

nominate <- read_csv("~/Downloads/HSall_members.csv") %>%
  filter(congress==118&party_code%in%c(100, 200)&
           chamber!="President") 
dem <- nominate %>% filter(party_code==100) %>% drop_na(nominate_dim1)
rep <- nominate%>% filter(party_code==200) %>% drop_na(nominate_dim1)
diff_in_means <- mean(rep$nominate_dim1) - mean(dem$nominate_dim1)
denominator <- sqrt((var(rep$nominate_dim1, na.rm = T)/nrow(rep)) +
                      (var(dem$nominate_dim1)/nrow(dem)))
z_score <- diff_in_means/denominator
round(z_score, 3)

[1] 73.525

Hypothesis Testing: Example

ggplot()+
  stat_function(fun = dnorm) + 
  xlim(c(-3, 3)) + 
  labs(x = "X", y = "Density", 
       title = "Standard Normal Distribution")

Hypothesis Testing: Example

ggplot()+
  stat_function(fun = dnorm) + 
  xlim(c(-100,100)) + 
  labs(x = "X", y = "Density", 
       title = "Standard Normal Distribution") + 
  geom_vline(xintercept = z_score, color = "red", lty = "dashed")

Hypothesis Testing: Example

One-Sided Hypothesis Test: \(\mu_{0R} - \mu_{0D} >0\)

1-pnorm(z_score, mean = 0, sd = 1)

[1] 0

Two-Sided Hypothesis Test: \(\mu_{0R}- \mu_{0D} \neq 0\)

2*(1-pnorm(z_score, mean = 0, sd = 1))

[1] 0

Hypothesis Testing: T-Distribution

Often use the t distribution instead of normal distribution
- especially with small sample sizes
t-distribution places more probability in the tails
In large samples, the t-distribution is equivalent to the normal distribution

T Distribution

Hypothesis Testing

In R, can use the t.test() function

t.test(rep$nominate_dim1, dem$nominate_dim1)


    Welch Two Sample t-test

data:  rep$nominate_dim1 and dem$nominate_dim1
t = 73.525, df = 500.98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.8825648 0.9310273
sample estimates:
 mean of x  mean of y 
 0.5222045 -0.3845916

Confidence Intervals

We often want some idea of uncertainty in our estimates
Use Central Limit Theorem to construct “confidence intervals” around our estimates
Lower End: sample estimate - \(qnorm(0.975)*\)Standard Error
Upper End: sample estimate + \(qnorm(0.975)*\)Standard Error

Confidence Intervals: Example

dem_mean <- mean(dem$nominate_dim1)
dem_mean - qnorm(0.975)*sd(dem$nominate_dim1)/sqrt(nrow(dem))

[1] -0.399397

dem_mean

[1] -0.3845916

dem_mean + qnorm(0.975)*sd(dem$nominate_dim1)/sqrt(nrow(dem))

[1] -0.3697862

Confidence Intervals: Example

Confidence Intervals

Most common: 95% “Confidence Intervals”
- Does NOT mean we are 95% confident that the true population value is in the interval

Real meaning: if we repeat the sampling process 100 times, 95% of the 95% confidence intervals will contain the true population value (on average)

Confidence Intervals

set.seed(123)
n <- 100000
pop <- rnorm(n, 15, 10)
samp <- sample(pop, size = 100)
samp_mean <- mean(samp)
lb <- samp_mean - qnorm(0.975, mean = 0, sd = 1)*sd(samp)/sqrt(100)
ub <- samp_mean +qnorm(0.975, mean= 0, sd = 1)*sd(samp)/sqrt(100)

Confidence Intervals

Recap

Central Limit Theorem (and Law of Large Numbers) central to many scientific tasks
Used for calculating p-values, hypothesis testing, and constructing confidence intervals
p-value: probability of observing a Z-score/t-statistic at least as large as the one actually observed if the null hypothesis is true

Recap

Confidence Intervals:
- lower bound: sample estimate - \(qnorm(0.975)*\)Standard Error
- upper bound: sample estimate + \(qnorm(0.975)*\)Standard Error
We are NOT 95% confident that the true population value is in the interval
- Only that, if we repeated the sampling process many times, roughly 95% of the intervals would contain the true population mean