Section 3. Summarizing Data

Sam Frederick

1/31/23

Last Week

  • Tibbles:
    • Changing and creating columns: mutate()
    • Subset data: filter()
  • Plotting:
    • Foundation: ggplot(data, aes())
    • Build on top of foundation with +

Last Week

  • Factors: factor(variable, levels = c(...), labels = c(...))

  • Logical Data: TRUE or FALSE

  • Remove missing data with na.rm = TRUE arugment

Last Week

G stats1 Summary Statistics tabs Table Prop. Table stats1->tabs stats0 Summary Statistics centtend Central Tendency Mean Median stats0->centtend spread Spread Standard Deviation Variance Range Interquartile Range stats0->spread vis0 Visual Summaries vislist Histogram vis0->vislist vis1 Visual Summaries vislist1 Barplot vis1->vislist1 Data Data Numeric Numeric Data->Numeric Categorical Categorical Data->Categorical Numeric->stats0 Numeric->vis0 Categorical->stats1 Categorical->vis1

Summarizing Bivariate Relationships

Categorical Numeric
Categorical
  • Cross-Tabs
  • Facetted Barplots
  • Box-and-Whisker Plots
  • Facetted/Filled Histograms
Numeric
  • Scatterplots
  • Line Plots

Categorical-Categorical Data: Cross-Tabs

library(modelsummary)
df <- read_csv("https://raw.githubusercontent.com/SamuelFrederick/scope-and-methods-spring2023/main/section-2/intro_survey.csv")
datasummary_crosstab(r_exp~python_exp, statistic = 1~1+N, data = df)
r_exp 0 1 All
0 N 7 3 10
1 N 2 1 3
All N 9 4 13

Categorical-Categorical Data: Cross-Tabs

datasummary_crosstab(r_exp~python_exp, statistic = 1~1+Percent(), 
                     data = df)
r_exp 0 1 All
0 % 53.8 23.1 76.9
1 % 15.4 7.7 23.1
All % 69.2 30.8 100.0

Categorical-Categorical Data: Barplots with Facets

df %>% 
  ggplot(aes(code_experience, fill = factor(code_experience))) + 
  geom_bar() +
  facet_wrap(~american_pol, nrow = 1) +
  labs(x = "Code Experience", y = "Count", 
       title = "Barplot of Coding Experience By Interest", 
       fill = "Coding Experience")

Categorical-Numeric: Box-and-Whisker Plots

df %>%
  ggplot(aes(x = code_experience, y = sleep)) + 
  geom_boxplot() +
  labs(x = "Coding Experience", y = "Sleep", title = "Sleep by Prior Coding Experience") 

Categorical-Numeric: Box-and-Whisker Plots

  • Box-and-Whisker Plots:
    • Show several important summary statistics
      • Median (Bold line inside the Box)
      • Interquartile Range (Box)
      • Minimum/Maximum or 1.5*IQR (Whiskers)

Categorical-Numeric: Histograms

df %>% 
  ggplot(aes(sleep)) + 
  geom_histogram() + 
  facet_wrap(~code_experience, nrow = 2)

Categorical-Numeric: Density Plots

df %>% 
  ggplot(aes(sleep, fill = code_experience)) + 
  geom_density(alpha = 0.5, position = "identity") 

Numeric-Numeric Data: Scatterplots

df %>%
  mutate(Timestamp = lubridate::as_datetime(Timestamp), 
         time = lubridate::hour(Timestamp)) %>%
  ggplot(aes(time, sleep)) + geom_point() +
  labs(x = "Time of Day Completed Survey", y = "Hours of Sleep")

Conditional Logic in Data: ifelse()

  • Check if some condition holds for data and perform operation if that condition holds
    • e.g., check if ideology is below the median, and if so, label that person liberal
  • Use ifelse() function
    • ifelse(condition, output, output2)
      • if condition is TRUE, returns “output”
      • otherwise, returns “output2”

ifelse() function

x <- 1:10
ifelse(x<5.5, "Below Median", "Above Median")
 [1] "Below Median" "Below Median" "Below Median" "Below Median" "Below Median"
 [6] "Above Median" "Above Median" "Above Median" "Above Median" "Above Median"

ifelse() function

df <- df %>% 
  mutate(nocturnal = ifelse(lubridate::hour(Timestamp)<12, 
                            "morning person", 
                            "nocturnal"))
df$nocturnal
 [1] "morning person" "morning person" "morning person" "morning person"
 [5] "nocturnal"      "nocturnal"      "nocturnal"      "morning person"
 [9] "morning person" "morning person" "morning person" "morning person"
[13] "nocturnal"     

Multiple Conditions

  • Sometimes we have more than one condition we want to evaluate
    • x&y checks whether both x AND y are TRUE
    • x|y checks whether either x OR y is TRUE
df <- df %>% 
  mutate(nocturnal_coder = ifelse(
    lubridate::hour(Timestamp)<12&code_experience=="Yes", 
                            "Not Nocturnal Coder", 
                            "Nocturnal Coder"))
df$nocturnal_coder[1:7]
[1] "Nocturnal Coder"     "Not Nocturnal Coder" "Nocturnal Coder"    
[4] "Not Nocturnal Coder" "Nocturnal Coder"     "Nocturnal Coder"    
[7] "Nocturnal Coder"    

Negating Conditions

  • Check whether a condition is not TRUE
    • !condition means NOT condition
df <- df %>% 
  mutate(not_american_politics = ifelse(!american_pol==1, 
                            "Not interested in AP", 
                            "Interested in AP"))
df$not_american_politics[1:7]
[1] "Interested in AP"     "Not interested in AP" "Not interested in AP"
[4] "Interested in AP"     "Interested in AP"     "Interested in AP"    
[7] "Interested in AP"    

case_when()

  • case_when(): tidyverse function to check multiple conditions sequentially
case_when(cond1~output1, 
          cond2~output2, 
          cond3~output3, 
          TRUE~NA_typeofoutput)

case_when()

df <- df %>% 
  mutate(code_interests = 
           case_when(code_experience=="Yes"&american_pol==1~"AP Coder", 
                     code_experience=="Yes"&comparative_pol==1~"CP Coder", 
                     code_experience=="Yes"&international_rel==1~"IR Coder", 
                     T~"Other")) 
df$code_interests
 [1] "Other"    "CP Coder" "Other"    "AP Coder" "AP Coder" "AP Coder"
 [7] "Other"    "Other"    "AP Coder" "AP Coder" "Other"    "AP Coder"
[13] "Other"   

Tidyverse Digression: Selecting Columns

  • Often only want certain columns
  • Specify with select()
df %>% select(year, python_exp) %>% head(2)
# A tibble: 2 × 2
  year      python_exp
  <chr>          <dbl>
1 Junior             0
2 Sophomore          1
  • If columns have specific patterns or types, can select with dplyr functions

Working with Real Data

  • Navigate to voteview.com
  • Click the data tab
  • Click Download Data
  • Move to course folder
nominate <- read_csv("HSall_members.csv")

Working with Real Data

  • Navigate to course page
  • Click on data for today
nominate <- read_csv("https://raw.githubusercontent.com/SamuelFrederick/scope-and-methods-spring2023/main/section-3/HSall_members.csv")

Cleaning Data

  • Filter/subset the data, so it only contains members of the House and Senate from the 97th Congress to the current Congress
  • Change the party_code variable, so it reads “D”, “R” instead of “100”, “200”
  • Select only the variables we will use for this analysis:
    • party_code, congress, chamber, state_abbrev, nominate_dim1, nominate_dim2

Cleaning Data

  • Filter/subset the data, so it only contains members of the House and Senate from the 97th Congress to the current Congress
nominate <- nominate %>% filter(chamber%in%c("House", "Senate")&
                                  congress>=97)
  • Change the party_code variable, so it reads “D”, “R” instead of “100”, “200”
  • Select only the variables we will use for this analysis:
    • party_code, congress, chamber, state_abbrev, nominate_dim1, nominate_dim2

Cleaning Data

  • Filter/subset the data, so it only contains members of the House and Senate from the 97th Congress to the current Congress
nominate <- nominate %>% filter(chamber%in%c("House", "Senate")&
                                  congress>=97)
  • Change the party_code variable, so it reads “D”, “R” instead of “100”, “200”
nominate <- nominate %>%
  mutate(party_code = case_when(party_code==100~"D", 
                                party_code==200~"R",
                                T~NA_character_))

Cleaning Data

  • Select only the variables we will use for this analysis:
    • party_code, congress, chamber, state_abbrev, nominate_dim1, nominate_dim2
nominate <- nominate %>% 
  select(party_code, congress, chamber, 
         state_abbrev, nominate_dim1, nominate_dim2)

Putting It Together

nominate <- nominate %>% 
  filter(chamber%in%c("House", "Senate")&congress>=97) %>%
  mutate(party_code = case_when(party_code==100~"D", 
                                party_code==200~"R",
                                T~NA_character_)) %>%
  select(party_code, congress, chamber, 
       state_abbrev, nominate_dim1, nominate_dim2) %>%
  drop_na(party_code)

Summarizing Data: Univariate

  • Create a barplot of party membership with fill set to party variable
    • Hint: you might have to re-level the factor variable
  • Create a histogram of nominate_dim1
    • What stands out to you in this histogram?
  • Use datasummary_skim() to summarize the numeric variables
    • Reminder: you will likely have to load the modelsummary package

Summarizing Data: Univariate

  • Create a barplot of party membership with fill set to party variable
nominate <- nominate  %>% mutate(party_code = factor(party_code, 
                             levels = c("R", "D")))
nominate %>%
  ggplot(aes(party_code, fill = party_code)) + 
  geom_bar() + 
  labs(x = "Party", y = "Count", fill = "Party")

Summarizing Data: Univariate

  • Create a histogram of nominate_dim1
    • What stands out to you in this histogram?
nominate %>%
  ggplot(aes(nominate_dim1)) +
  geom_histogram() + 
  labs(x = "NOMINATE-Dim. 1", y = "Count")

Summarizing Data: Univariate

  • Use datasummary to summarize nominate_dim1 and nominate_dim2
    • Reminder: you will likely have to load the modelsummary package
datasummary((nominate_dim1 + nominate_dim2) ~ Min + Median + Mean + 
              Max + SD, 
                 nominate, histogram = F, type = "numeric")
Min Median Mean Max SD
nominate_dim1 −0.83 −0.06 0.02 0.94 0.41
nominate_dim2 −1.00 −0.03 −0.02 1.00 0.36

Summarizing Data: Bivariate and Beyond

  • Categorical-Numeric:
    • Create a density plot of nominate_dim1 filled by party
  • Categorical-Categorical-Numeric:
    • Facet the above density plot by chamber of Congress
  • Numeric-Numeric:
    • Create a scatterplot of nominate_dim2 against nominate_dim1
  • Numeric-Numeric-Categorical
    • Color the above scatterplot by party

nominate %>%
  ggplot(aes(nominate_dim1, y = ..density.., fill = party_code)) + 
  geom_histogram(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party")

nominate %>%
  ggplot(aes(nominate_dim1, fill = party_code)) + 
  geom_density(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party")

nominate %>%
  ggplot(aes(nominate_dim1, y = ..density.., fill = party_code)) + 
  geom_histogram(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party") +
  facet_wrap(~chamber, nrow =2)

nominate %>%
  ggplot(aes(nominate_dim1, y = ..density.., fill = party_code)) + 
  geom_density(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party") +
  facet_wrap(~chamber, nrow =2)

Summarizing Data: Bivariate and Beyond

  • Numeric-Numeric:
    • Create a scatterplot of nominate_dim2 against nominate_dim1
nominate %>%
  ggplot(aes(nominate_dim1, nominate_dim2)) + 
  geom_point()+
  labs(x = "NOMINATE-Dim. 1", y = "NOMINATE-Dim. 2")

Summarizing Data: Bivariate and Beyond

  • Numeric-Numeric-Categorical
    • Color the above scatterplot by party
nominate %>%
  ggplot(aes(nominate_dim1, nominate_dim2, col = party_code)) + 
  geom_point()+
  labs(x = "NOMINATE-Dim. 1", y = "NOMINATE-Dim. 2", col = "Party") 

Tidyverse Digression: Grouping and Summarizing Data

  • Often want to figure out some quantity within a category
    • e.g., ideology by party, average GDP per capita by country
  • Use group_by() function in tidyverse to group our data by given variables
  • Use summarize() and desired function to create new function
    • Very similar to mutate()

Tidyverse Digress: Grouping and Summarizing

  • Example with our class survey data
df %>%
  group_by(code_experience)%>%
  summarize(sleep = mean(sleep))
# A tibble: 2 × 2
  code_experience sleep
  <chr>           <dbl>
1 No               6.33
2 Yes              7.32

Getting Ideology by Congress and Party

  • What would be the first step in getting the average of nominate_dim1 by Congress and party?
nominate %>% 
  group_by(congress, party_code) %>% head(c(2,2))
# A tibble: 2 × 2
# Groups:   congress, party_code [1]
  party_code congress
  <fct>         <dbl>
1 R                97
2 R                97
  • What comes next?
nominate %>% 
  group_by(congress, party_code) %>%
  summarize(nominate_dim1 = 
              mean(nominate_dim1, 
                   na.rm = T)) %>%
  head(4)
# A tibble: 4 × 3
# Groups:   congress [2]
  congress party_code nominate_dim1
     <dbl> <fct>              <dbl>
1       97 R                  0.307
2       97 D                 -0.300
3       98 R                  0.321
4       98 D                 -0.302

Line Plots

  • How could we make a scatterplot from our summarized data of average NOMINATE score, colored by party, against the Congress number?

Line Plots

nominate %>%
  group_by(congress, party_code) %>% 
  summarize(nominate_dim1_mean = mean(nominate_dim1, na.rm = T)) %>%
  ggplot(aes(congress, nominate_dim1_mean, col = party_code)) +
  geom_point()

Line Plots

  • To make a line plot, we just switch one line of code:
nominate %>%
  group_by(congress, party_code) %>% 
  summarize(nominate_dim1_mean = mean(nominate_dim1, na.rm = T)) %>%
  ggplot(aes(congress, nominate_dim1_mean, col = party_code)) +
  geom_point() +
  labs(x = "Congress", y= "Mean NOMINATE-Dim. 1", col = "Party")

Line Plots

  • To make a line plot, we just switch one line of code:
nominate %>%
  group_by(congress, party_code) %>% 
  summarize(nominate_dim1_mean = mean(nominate_dim1, na.rm = T)) %>%
  ggplot(aes(congress, nominate_dim1_mean, col = party_code)) +
  geom_line() +
  labs(x = "Congress", y= "Mean NOMINATE-Dim. 1", col = "Party")

Recap

G n0 1 Variable stats0 Summary Statistics n0->stats0 vis0 Visual Summaries n0->vis0 n1 1 Variable stats1 Summary Statistics n1->stats1 vis1 Visual Summaries n1->vis1 n2n 2+ Variables num2 Scatterplot Line Plot n2n->num2 numcat Facetted Histograms/Density Plots Box-and-Whisker Plots n2n->numcat n2c 2+ Variables cat2 Cross-Tabs Facetted Barplots n2c->cat2 n2c->numcat tabs Table Prop. Table stats1->tabs centtend Central Tendency Mean Median stats0->centtend spread Spread Standard Deviation Variance Range Interquartile Range stats0->spread vislist Histogram vis0->vislist vislist1 Barplot vis1->vislist1 Data Data Numeric Numeric Data->Numeric Categorical Categorical Data->Categorical Numeric->n0 Numeric->n2n Categorical->n1 Categorical->n2c