Section 7. Tables, Chi-Squared Tests, and Correlations
Preliminary Data Cleaning
First, we want to read in our data from the Cooperative Election Study conducted in 2020. We select the three variables we will use for our analyses (ideo5
, pid3
, and commonweight
). Next, we use the mutate()
function to give our ideology and party variables comprehensible labels.
<- read_csv("ces2020_example.csv")
ces <- ces %>%
ces select(ideo5, pid3, commonweight) %>%
mutate(ideology = factor(ideo5,
labels = c("Very liberal",
"Liberal",
"Moderate",
"Conservative",
"Very Conservative",
"Not Sure")),
party = factor(pid3,
labels = c("Democrat",
"Republican",
"Independent",
"Other",
"Not Sure")))
Review
Review of Hypothesis Testing
We usually start with our hypothesis, which is known as the alternative hypothesis, that some statistic is not equal to 0 or is greater than or less than 0. Implicit in this hypothesis is what is known as the null hypothesis: that the statistic we are interested in is equal to 0.
There are two main types of alternative hypotheses: two-sided and one-sided. Two-sided hypotheses are of the form: \(A\not=B\).
For example:
- \(H_0\): The difference between the means for group A and group B is 0.
- \(H_A\): The difference between the means for group A and group B is not 0.
One-sided hypotheses are of the form: \(A>B\) or \(A<B\).
For example:
- \(H_0\): The difference between the means for group A and group B is 0.
- \(H_A\): The mean for group A is greater than the mean for group B.
Review of t-tests
Under the Central Limit Theorem, the statistic \(Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\) is distributed approximately standard normal as the sample size increases.
This allows us to conduct a test of the null hypothesis (as discussed above), under the assumption that the null hypothesis is true. If our null hypothesis is true, the statistic \(Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}\), where \(\mu_0\) is the population value under the null hypothesis, will be distributed approximately standard normal with larger sample sizes.
We can calculate this statistic using our observed data, and as a result of the Central Limit Theorem, we can calculate the probability that we observe a value of this statistic at least as large in magnitude as the one we actually observe if the null hypothesis is true.
This probability of observing a value at least as large as that actually observed if the null hypothesis is true is known as the p-value.
The t-distribution
Because we have small sample sizes and usually don’t know the true value of \(\sigma\), we usually use the t-distribution instead of the normal distribution.
The t-distribution is more conservative than the normal distribution (there is more probability in the tails of the distribution, so it is less likely we reject the null hypothesis using the t-distribution). This helps us account for the uncertainty from the small sample sizes and our estimation of the standard deviation.
Weighted t-tests
One of the things that is important to remember is that we have to apply survey weights when we are analyzing survey data. Applying the weights will ensure that our survey samples (which may not be representative of the population) will be representative of the population.
To apply weights when we are conducting t-tests, we should use the weights
package in R.
install.packages("weights")
library(weights)
<- ces%>% filter(party=="Democrat"&ideo5!=6)
dem.sub <- ces %>% filter(party=="Republican"&ideo5!=6)
rep.sub wtd.t.test(x = dem.sub$ideo5,
y = rep.sub$ideo5,
weight = dem.sub$commonweight,
weighty = rep.sub$commonweight,
samedata = FALSE,
alternative = "two.sided")
$test
[1] "Two Sample Weighted T-Test (Welch)"
$coefficients
t.value df p.value
-28.88007 583.43253 0.00000
$additional
Difference Mean.x Mean.y Std. Err
-2.09809561 2.14947769 4.24757330 0.07264856
Tables in R
We talked a bit about tables earlier in the semester. Tables are very helpful for summarizing categorical data. We can get the number of observations that fall into a given category or the proportion of observations that fall into each category.
Tables: table()
and prop.table()
We can calculate basic tables in R using the table()
function, as we saw a while ago. For example, if we want to discover the number of respondents in each ideological category in each party group, we can use the table()
function.
table(ces$party, ces$ideology)
Very liberal Liberal Moderate Conservative Very Conservative
Democrat 105 136 105 18 6
Republican 1 4 48 96 96
Independent 16 32 145 47 19
Other 13 4 17 5 6
Not Sure 3 3 13 3 2
Not Sure
Democrat 12
Republican 9
Independent 7
Other 1
Not Sure 27
We can also calculate the proportion of respondents in each category using the prop.table()
function.
prop.table(table(ces$party, ces$ideology))
Very liberal Liberal Moderate Conservative
Democrat 0.105105105 0.136136136 0.105105105 0.018018018
Republican 0.001001001 0.004004004 0.048048048 0.096096096
Independent 0.016016016 0.032032032 0.145145145 0.047047047
Other 0.013013013 0.004004004 0.017017017 0.005005005
Not Sure 0.003003003 0.003003003 0.013013013 0.003003003
Very Conservative Not Sure
Democrat 0.006006006 0.012012012
Republican 0.096096096 0.009009009
Independent 0.019019019 0.007007007
Other 0.006006006 0.001001001
Not Sure 0.002002002 0.027027027
Remember, we can round these numbers too.
prop.table(table(ces$party, ces$ideology)) %>%
round(digits = 3)
Very liberal Liberal Moderate Conservative Very Conservative
Democrat 0.105 0.136 0.105 0.018 0.006
Republican 0.001 0.004 0.048 0.096 0.096
Independent 0.016 0.032 0.145 0.047 0.019
Other 0.013 0.004 0.017 0.005 0.006
Not Sure 0.003 0.003 0.013 0.003 0.002
Not Sure
Democrat 0.012
Republican 0.009
Independent 0.007
Other 0.001
Not Sure 0.027
CrossTable()
There are a variety of packages that can be used to create nicer looking tables. One of these packages is gmodels
. Within gmodels
, there is a function called CrossTable()
which can generate aesthetically pleasing cross tables.
install.packages("gmodels")
library(gmodels)
CrossTable(ces$ideology,
$party,
cesdigits = 3,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 999
| ces$party
ces$ideology | Democrat | Republican | Independent | Other | Not Sure | Row Total |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Very liberal | 105 | 1 | 16 | 13 | 3 | 138 |
| 0.761 | 0.007 | 0.116 | 0.094 | 0.022 | 0.138 |
| 0.275 | 0.004 | 0.060 | 0.283 | 0.059 | |
| 0.105 | 0.001 | 0.016 | 0.013 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Liberal | 136 | 4 | 32 | 4 | 3 | 179 |
| 0.760 | 0.022 | 0.179 | 0.022 | 0.017 | 0.179 |
| 0.356 | 0.016 | 0.120 | 0.087 | 0.059 | |
| 0.136 | 0.004 | 0.032 | 0.004 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Moderate | 105 | 48 | 145 | 17 | 13 | 328 |
| 0.320 | 0.146 | 0.442 | 0.052 | 0.040 | 0.328 |
| 0.275 | 0.189 | 0.545 | 0.370 | 0.255 | |
| 0.105 | 0.048 | 0.145 | 0.017 | 0.013 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Conservative | 18 | 96 | 47 | 5 | 3 | 169 |
| 0.107 | 0.568 | 0.278 | 0.030 | 0.018 | 0.169 |
| 0.047 | 0.378 | 0.177 | 0.109 | 0.059 | |
| 0.018 | 0.096 | 0.047 | 0.005 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Very Conservative | 6 | 96 | 19 | 6 | 2 | 129 |
| 0.047 | 0.744 | 0.147 | 0.047 | 0.016 | 0.129 |
| 0.016 | 0.378 | 0.071 | 0.130 | 0.039 | |
| 0.006 | 0.096 | 0.019 | 0.006 | 0.002 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Not Sure | 12 | 9 | 7 | 1 | 27 | 56 |
| 0.214 | 0.161 | 0.125 | 0.018 | 0.482 | 0.056 |
| 0.031 | 0.035 | 0.026 | 0.022 | 0.529 | |
| 0.012 | 0.009 | 0.007 | 0.001 | 0.027 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Column Total | 382 | 254 | 266 | 46 | 51 | 999 |
| 0.382 | 0.254 | 0.266 | 0.046 | 0.051 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
If we also want to perform a Chi-Squared test of independence, we can do so using the chisq
argument:
CrossTable(ces$ideology,
$party,
cesdigits = 3,
prop.chisq = FALSE,
chisq = TRUE)
Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation may be
incorrect
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 999
| ces$party
ces$ideology | Democrat | Republican | Independent | Other | Not Sure | Row Total |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Very liberal | 105 | 1 | 16 | 13 | 3 | 138 |
| 0.761 | 0.007 | 0.116 | 0.094 | 0.022 | 0.138 |
| 0.275 | 0.004 | 0.060 | 0.283 | 0.059 | |
| 0.105 | 0.001 | 0.016 | 0.013 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Liberal | 136 | 4 | 32 | 4 | 3 | 179 |
| 0.760 | 0.022 | 0.179 | 0.022 | 0.017 | 0.179 |
| 0.356 | 0.016 | 0.120 | 0.087 | 0.059 | |
| 0.136 | 0.004 | 0.032 | 0.004 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Moderate | 105 | 48 | 145 | 17 | 13 | 328 |
| 0.320 | 0.146 | 0.442 | 0.052 | 0.040 | 0.328 |
| 0.275 | 0.189 | 0.545 | 0.370 | 0.255 | |
| 0.105 | 0.048 | 0.145 | 0.017 | 0.013 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Conservative | 18 | 96 | 47 | 5 | 3 | 169 |
| 0.107 | 0.568 | 0.278 | 0.030 | 0.018 | 0.169 |
| 0.047 | 0.378 | 0.177 | 0.109 | 0.059 | |
| 0.018 | 0.096 | 0.047 | 0.005 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Very Conservative | 6 | 96 | 19 | 6 | 2 | 129 |
| 0.047 | 0.744 | 0.147 | 0.047 | 0.016 | 0.129 |
| 0.016 | 0.378 | 0.071 | 0.130 | 0.039 | |
| 0.006 | 0.096 | 0.019 | 0.006 | 0.002 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Not Sure | 12 | 9 | 7 | 1 | 27 | 56 |
| 0.214 | 0.161 | 0.125 | 0.018 | 0.482 | 0.056 |
| 0.031 | 0.035 | 0.026 | 0.022 | 0.529 | |
| 0.012 | 0.009 | 0.007 | 0.001 | 0.027 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Column Total | 382 | 254 | 266 | 46 | 51 | 999 |
| 0.382 | 0.254 | 0.266 | 0.046 | 0.051 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 768.1034 d.f. = 20 p = 8.288596e-150
The CrossTable()
function also allows us to see what the “expected” values in each cell are under the null hypothesis if we use the expected
argument.
CrossTable(ces$ideology,
$party,
cesdigits = 3,
prop.chisq = FALSE,
chisq = TRUE,
expected = TRUE)
Warning in chisq.test(t, correct = FALSE, ...): Chi-squared approximation may be
incorrect
Cell Contents
|-------------------------|
| N |
| Expected N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 999
| ces$party
ces$ideology | Democrat | Republican | Independent | Other | Not Sure | Row Total |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Very liberal | 105 | 1 | 16 | 13 | 3 | 138 |
| 52.769 | 35.087 | 36.745 | 6.354 | 7.045 | |
| 0.761 | 0.007 | 0.116 | 0.094 | 0.022 | 0.138 |
| 0.275 | 0.004 | 0.060 | 0.283 | 0.059 | |
| 0.105 | 0.001 | 0.016 | 0.013 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Liberal | 136 | 4 | 32 | 4 | 3 | 179 |
| 68.446 | 45.512 | 47.662 | 8.242 | 9.138 | |
| 0.760 | 0.022 | 0.179 | 0.022 | 0.017 | 0.179 |
| 0.356 | 0.016 | 0.120 | 0.087 | 0.059 | |
| 0.136 | 0.004 | 0.032 | 0.004 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Moderate | 105 | 48 | 145 | 17 | 13 | 328 |
| 125.421 | 83.395 | 87.335 | 15.103 | 16.745 | |
| 0.320 | 0.146 | 0.442 | 0.052 | 0.040 | 0.328 |
| 0.275 | 0.189 | 0.545 | 0.370 | 0.255 | |
| 0.105 | 0.048 | 0.145 | 0.017 | 0.013 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Conservative | 18 | 96 | 47 | 5 | 3 | 169 |
| 64.623 | 42.969 | 44.999 | 7.782 | 8.628 | |
| 0.107 | 0.568 | 0.278 | 0.030 | 0.018 | 0.169 |
| 0.047 | 0.378 | 0.177 | 0.109 | 0.059 | |
| 0.018 | 0.096 | 0.047 | 0.005 | 0.003 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Very Conservative | 6 | 96 | 19 | 6 | 2 | 129 |
| 49.327 | 32.799 | 34.348 | 5.940 | 6.586 | |
| 0.047 | 0.744 | 0.147 | 0.047 | 0.016 | 0.129 |
| 0.016 | 0.378 | 0.071 | 0.130 | 0.039 | |
| 0.006 | 0.096 | 0.019 | 0.006 | 0.002 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Not Sure | 12 | 9 | 7 | 1 | 27 | 56 |
| 21.413 | 14.238 | 14.911 | 2.579 | 2.859 | |
| 0.214 | 0.161 | 0.125 | 0.018 | 0.482 | 0.056 |
| 0.031 | 0.035 | 0.026 | 0.022 | 0.529 | |
| 0.012 | 0.009 | 0.007 | 0.001 | 0.027 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Column Total | 382 | 254 | 266 | 46 | 51 | 999 |
| 0.382 | 0.254 | 0.266 | 0.046 | 0.051 | |
------------------|-------------|-------------|-------------|-------------|-------------|-------------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 768.1034 d.f. = 20 p = 8.288596e-150
Chi-Squared Tests
We test the null hypothesis that our categorical variables are independent using the Chi-Squared test statistic and the Chi-Squared distribution.
As a reminder, the Chi-Squared test statistic is \(\chi^2 = \sum_{k=1}^K \frac{(Observed_k - Expected_k)^2}{Expected_k}\) where \(k\) indicates a given cell. The expected value in a given cell under the null hypothesis is the row total times the column total divided by the total number of observations.
Finally, under the null hypothesis, our \(\chi^2\) statistic will be distributed according to the \(\chi^2\) distribution with \(g\) degrees of freedom. \(g\) is calculated by multiplying the number of rows minus 1 by the number of columns minus 1. Above, the degrees of freedom will be 20, leading to a Chi-Squared distribution that looks like this:
Unweighted Chi-Squared Test
If we just want to conduct a Chi-Squared test, we can do so using the chisq.test()
function.
chisq.test(ces$ideology,
$party) ces
Warning in chisq.test(ces$ideology, ces$party): Chi-squared approximation may be
incorrect
Pearson's Chi-squared test
data: ces$ideology and ces$party
X-squared = 768.1, df = 20, p-value < 2.2e-16
The Problem: the chisq.test()
function does not apply survey weights.
Weighted Chi-Squared Test
If we want to conduct a weighted Chi-Squared test (which we should if our dataset contains weights), we should turn to the weights
package. In the weights
package, there is a wtd.chi.sq()
function which allows us to easily apply weights.
library(weights)
wtd.chi.sq(var1 = ces$ideology,
var2 = ces$party,
weight = ces$commonweight)
Chisq df p.value
9.902687e+02 2.000000e+01 4.640567e-197
Counting numbers in groups: count()
Another way to obtain a table is using the count()
function in dplyr
. One of the nice things about the count function is that it allows us to obtain the weighted number of observations in each category using the wt
argument.
%>%
ces count(ideology, wt = commonweight)
# A tibble: 7 × 2
ideology n
<fct> <dbl>
1 Very liberal 105.
2 Liberal 136.
3 Moderate 266.
4 Conservative 176.
5 Very Conservative 170.
6 Not Sure 67.0
7 <NA> 0.970
To see how important weights are, we can compare to the unweighted counts:
%>%
ces count(ideology)
# A tibble: 7 × 2
ideology n
<fct> <int>
1 Very liberal 138
2 Liberal 179
3 Moderate 328
4 Conservative 169
5 Very Conservative 129
6 Not Sure 56
7 <NA> 1
gt
Finally, there are many packages in R that can be used to generate nice tables. The one we’ll talk about is gt
.
install.packages("gt")
library(gt)
%>%
ces group_by(party) %>%
count(ideology, wt = commonweight) %>%
gt(groupname_col = "pid3") %>%
opt_row_striping(row_striping = T)
party | ideology | n |
---|---|---|
Democrat | Very liberal | 86.1513497 |
Democrat | Liberal | 103.8848878 |
Democrat | Moderate | 79.9934793 |
Democrat | Conservative | 14.8124399 |
Democrat | Very Conservative | 6.7042232 |
Democrat | Not Sure | 9.3306875 |
Republican | Very liberal | 0.3514461 |
Republican | Liberal | 7.2793520 |
Republican | Moderate | 41.8180993 |
Republican | Conservative | 104.4875094 |
Republican | Very Conservative | 126.9780632 |
Republican | Not Sure | 7.1685850 |
Independent | Very liberal | 9.2598282 |
Independent | Liberal | 21.3422818 |
Independent | Moderate | 116.5471971 |
Independent | Conservative | 45.6921294 |
Independent | Very Conservative | 25.8717080 |
Independent | Not Sure | 4.2594498 |
Independent | NA | 0.9702168 |
Other | Very liberal | 7.5267886 |
Other | Liberal | 2.1869923 |
Other | Moderate | 16.1317720 |
Other | Conservative | 6.0002626 |
Other | Very Conservative | 7.2110614 |
Other | Not Sure | 0.4421898 |
Not Sure | Very liberal | 2.1532383 |
Not Sure | Liberal | 1.5623802 |
Not Sure | Moderate | 11.7003928 |
Not Sure | Conservative | 5.1063630 |
Not Sure | Very Conservative | 3.2099986 |
Not Sure | Not Sure | 45.8422459 |
Correlation Coefficients
Turning back to our dataset of House election results and campaign finance data from 2020, we can think about how we might summarize the relationship between two numeric variables.
We generally use what is known as the correlation coefficient to summarize bivariate relationships between two numeric variables.
The correlation coefficient is calculated as follows:
\(\frac{Cov(X, Y)}{\sqrt{Var(X)*Var(Y)}} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})*(Y_i - \bar{Y})}{\sqrt{\sum_{i = 1}^{n} (X_i - \bar{X})^2 * \sum_{i=1}^n(Y_i - \bar{Y})^2}}\)
You can think of this as roughly the share of the variation in the two variables that is captured by the co-variation in the two variables.
Correlation coefficients run from -1 to 1, where -1 is a perfect negative relationship and 1 is a perfect positive relationship. 0 indicates that there is no relationship between the variables.
Note: a positive relationship means that, as X increases, Y tends to increase as well. A negative relationship means that, as X increases, Y tends to decrease.
We can easily calculate the correlation in R using the cor()
function.
cor(log(house$disbursements), house$voteshare)
[1] 0.4844947
We can see this is the same as if we actually calculated the correlation coefficient by hand:
cov(log(house$disbursements), house$voteshare)/
sqrt(var(log(house$disbursements))*var(house$voteshare))
[1] 0.4844947
To conduct an actual hypothesis test, we need to use the cor.test()
function.
cor.test(log(house$disbursements), house$voteshare)
Pearson's product-moment correlation
data: log(house$disbursements) and house$voteshare
t = 14.902, df = 724, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4267770 0.5382815
sample estimates:
cor
0.4844947
This is just a t-test where the t-statistic is calculated as follows:
<- cor(log(house$disbursements), house$voteshare)
r - 0)/(sqrt((1-r^2)/(nrow(house)-2))) (r
[1] 14.90227
We can confirm that the t-statistic we calculated by hand matches the t-statistic from the cor.test()
function.
Weighted Correlation Coefficients
If we are working with survey data or data with observation weights, we want to apply those weights in calculating the correlation coefficient.
We can do this using the weights
package and the wtd.cor()
function.
library(weights)
wtd.cor(x = var1,
y = var2,
weight = weight)