Section 3. Summarizing Data

Sam Frederick

1/31/23

Last Week

Tibbles:
- Changing and creating columns: mutate()
- Subset data: filter()
Plotting:
- Foundation: ggplot(data, aes())
- Build on top of foundation with +

Last Week

Factors: factor(variable, levels = c(...), labels = c(...))
Logical Data: TRUE or FALSE
Remove missing data with na.rm = TRUE arugment

Last Week

Summarizing Bivariate Relationships

	Categorical	Numeric
Categorical	Cross-Tabs Facetted Barplots	Box-and-Whisker Plots Facetted/Filled Histograms
Numeric		Scatterplots Line Plots

Categorical-Categorical Data: Cross-Tabs

library(modelsummary)
df <- read_csv("https://raw.githubusercontent.com/SamuelFrederick/scope-and-methods-spring2023/main/section-2/intro_survey.csv")
datasummary_crosstab(r_exp~python_exp, statistic = 1~1+N, data = df)

r_exp		0	1	All
0	N	7	3	10
1	N	2	1	3
All	N	9	4	13

Categorical-Categorical Data: Cross-Tabs

datasummary_crosstab(r_exp~python_exp, statistic = 1~1+Percent(), 
                     data = df)

r_exp		0	1	All
0	%	53.8	23.1	76.9
1	%	15.4	7.7	23.1
All	%	69.2	30.8	100.0

Categorical-Categorical Data: Barplots with Facets

df %>% 
  ggplot(aes(code_experience, fill = factor(code_experience))) + 
  geom_bar() +
  facet_wrap(~american_pol, nrow = 1) +
  labs(x = "Code Experience", y = "Count", 
       title = "Barplot of Coding Experience By Interest", 
       fill = "Coding Experience")

Categorical-Numeric: Box-and-Whisker Plots

df %>%
  ggplot(aes(x = code_experience, y = sleep)) + 
  geom_boxplot() +
  labs(x = "Coding Experience", y = "Sleep", title = "Sleep by Prior Coding Experience")

Categorical-Numeric: Box-and-Whisker Plots

Box-and-Whisker Plots:
- Show several important summary statistics
  - Median (Bold line inside the Box)
  - Interquartile Range (Box)
  - Minimum/Maximum or 1.5*IQR (Whiskers)

Categorical-Numeric: Histograms

df %>% 
  ggplot(aes(sleep)) + 
  geom_histogram() + 
  facet_wrap(~code_experience, nrow = 2)

Categorical-Numeric: Density Plots

df %>% 
  ggplot(aes(sleep, fill = code_experience)) + 
  geom_density(alpha = 0.5, position = "identity")

Numeric-Numeric Data: Scatterplots

df %>%
  mutate(Timestamp = lubridate::as_datetime(Timestamp), 
         time = lubridate::hour(Timestamp)) %>%
  ggplot(aes(time, sleep)) + geom_point() +
  labs(x = "Time of Day Completed Survey", y = "Hours of Sleep")

Conditional Logic in Data: ifelse()

Check if some condition holds for data and perform operation if that condition holds
- e.g., check if ideology is below the median, and if so, label that person liberal

Use ifelse() function
- ifelse(condition, output, output2)
  - if condition is TRUE, returns “output”
  - otherwise, returns “output2”

ifelse() function

x <- 1:10
ifelse(x<5.5, "Below Median", "Above Median")

 [1] "Below Median" "Below Median" "Below Median" "Below Median" "Below Median"
 [6] "Above Median" "Above Median" "Above Median" "Above Median" "Above Median"

ifelse() function

df <- df %>% 
  mutate(nocturnal = ifelse(lubridate::hour(Timestamp)<12, 
                            "morning person", 
                            "nocturnal"))
df$nocturnal

 [1] "morning person" "morning person" "morning person" "morning person"
 [5] "nocturnal"      "nocturnal"      "nocturnal"      "morning person"
 [9] "morning person" "morning person" "morning person" "morning person"
[13] "nocturnal"

Multiple Conditions

Sometimes we have more than one condition we want to evaluate
- x&y checks whether both x AND y are TRUE
- x|y checks whether either x OR y is TRUE

df <- df %>% 
  mutate(nocturnal_coder = ifelse(
    lubridate::hour(Timestamp)<12&code_experience=="Yes", 
                            "Not Nocturnal Coder", 
                            "Nocturnal Coder"))
df$nocturnal_coder[1:7]

[1] "Nocturnal Coder"     "Not Nocturnal Coder" "Nocturnal Coder"    
[4] "Not Nocturnal Coder" "Nocturnal Coder"     "Nocturnal Coder"    
[7] "Nocturnal Coder"

Negating Conditions

Check whether a condition is not TRUE
- !condition means NOT condition

df <- df %>% 
  mutate(not_american_politics = ifelse(!american_pol==1, 
                            "Not interested in AP", 
                            "Interested in AP"))
df$not_american_politics[1:7]

[1] "Interested in AP"     "Not interested in AP" "Not interested in AP"
[4] "Interested in AP"     "Interested in AP"     "Interested in AP"    
[7] "Interested in AP"

case_when()

case_when(): tidyverse function to check multiple conditions sequentially

case_when(cond1~output1, 
          cond2~output2, 
          cond3~output3, 
          TRUE~NA_typeofoutput)

case_when()

df <- df %>% 
  mutate(code_interests = 
           case_when(code_experience=="Yes"&american_pol==1~"AP Coder", 
                     code_experience=="Yes"&comparative_pol==1~"CP Coder", 
                     code_experience=="Yes"&international_rel==1~"IR Coder", 
                     T~"Other")) 
df$code_interests

 [1] "Other"    "CP Coder" "Other"    "AP Coder" "AP Coder" "AP Coder"
 [7] "Other"    "Other"    "AP Coder" "AP Coder" "Other"    "AP Coder"
[13] "Other"

Tidyverse Digression: Selecting Columns

Often only want certain columns
Specify with select()

df %>% select(year, python_exp) %>% head(2)

# A tibble: 2 × 2
  year      python_exp
  <chr>          <dbl>
1 Junior             0
2 Sophomore          1

If columns have specific patterns or types, can select with dplyr functions

Working with Real Data

Navigate to voteview.com
Click the data tab
Click Download Data
Move to course folder

nominate <- read_csv("HSall_members.csv")

Working with Real Data

Navigate to course page
Click on data for today

nominate <- read_csv("https://raw.githubusercontent.com/SamuelFrederick/scope-and-methods-spring2023/main/section-3/HSall_members.csv")

Cleaning Data

Filter/subset the data, so it only contains members of the House and Senate from the 97th Congress to the current Congress
Change the party_code variable, so it reads “D”, “R” instead of “100”, “200”
Select only the variables we will use for this analysis:
- party_code, congress, chamber, state_abbrev, nominate_dim1, nominate_dim2

Cleaning Data

Filter/subset the data, so it only contains members of the House and Senate from the 97th Congress to the current Congress

nominate <- nominate %>% filter(chamber%in%c("House", "Senate")&
                                  congress>=97)

Change the party_code variable, so it reads “D”, “R” instead of “100”, “200”
Select only the variables we will use for this analysis:
- party_code, congress, chamber, state_abbrev, nominate_dim1, nominate_dim2

Cleaning Data

Filter/subset the data, so it only contains members of the House and Senate from the 97th Congress to the current Congress

nominate <- nominate %>% filter(chamber%in%c("House", "Senate")&
                                  congress>=97)

Change the party_code variable, so it reads “D”, “R” instead of “100”, “200”

nominate <- nominate %>%
  mutate(party_code = case_when(party_code==100~"D", 
                                party_code==200~"R",
                                T~NA_character_))

Cleaning Data

Select only the variables we will use for this analysis:
- party_code, congress, chamber, state_abbrev, nominate_dim1, nominate_dim2

nominate <- nominate %>% 
  select(party_code, congress, chamber, 
         state_abbrev, nominate_dim1, nominate_dim2)

Putting It Together

nominate <- nominate %>% 
  filter(chamber%in%c("House", "Senate")&congress>=97) %>%
  mutate(party_code = case_when(party_code==100~"D", 
                                party_code==200~"R",
                                T~NA_character_)) %>%
  select(party_code, congress, chamber, 
       state_abbrev, nominate_dim1, nominate_dim2) %>%
  drop_na(party_code)

Summarizing Data: Univariate

Create a barplot of party membership with fill set to party variable
- Hint: you might have to re-level the factor variable
Create a histogram of nominate_dim1
- What stands out to you in this histogram?
Use datasummary_skim() to summarize the numeric variables
- Reminder: you will likely have to load the modelsummary package

Summarizing Data: Univariate

Create a barplot of party membership with fill set to party variable

nominate <- nominate  %>% mutate(party_code = factor(party_code, 
                             levels = c("R", "D")))
nominate %>%
  ggplot(aes(party_code, fill = party_code)) + 
  geom_bar() + 
  labs(x = "Party", y = "Count", fill = "Party")

Summarizing Data: Univariate

Create a histogram of nominate_dim1
- What stands out to you in this histogram?

nominate %>%
  ggplot(aes(nominate_dim1)) +
  geom_histogram() + 
  labs(x = "NOMINATE-Dim. 1", y = "Count")

Summarizing Data: Univariate

Use datasummary to summarize nominate_dim1 and nominate_dim2
- Reminder: you will likely have to load the modelsummary package

datasummary((nominate_dim1 + nominate_dim2) ~ Min + Median + Mean + 
              Max + SD, 
                 nominate, histogram = F, type = "numeric")

	Min	Median	Mean	Max	SD
nominate_dim1	−0.83	−0.06	0.02	0.94	0.41
nominate_dim2	−1.00	−0.03	−0.02	1.00	0.36

Summarizing Data: Bivariate and Beyond

Categorical-Numeric:
- Create a density plot of nominate_dim1 filled by party
Categorical-Categorical-Numeric:
- Facet the above density plot by chamber of Congress
Numeric-Numeric:
- Create a scatterplot of nominate_dim2 against nominate_dim1
Numeric-Numeric-Categorical
- Color the above scatterplot by party

Histogram
Density Plot

nominate %>%
  ggplot(aes(nominate_dim1, y = ..density.., fill = party_code)) + 
  geom_histogram(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party")

nominate %>%
  ggplot(aes(nominate_dim1, fill = party_code)) + 
  geom_density(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party")

Histogram
Density Plot

nominate %>%
  ggplot(aes(nominate_dim1, y = ..density.., fill = party_code)) + 
  geom_histogram(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party") +
  facet_wrap(~chamber, nrow =2)

nominate %>%
  ggplot(aes(nominate_dim1, y = ..density.., fill = party_code)) + 
  geom_density(alpha = 0.5, position = "identity") + 
  labs(x = "NOMINATE-Dim. 1", y = "Density", fill = "Party") +
  facet_wrap(~chamber, nrow =2)

Summarizing Data: Bivariate and Beyond

Numeric-Numeric:
- Create a scatterplot of nominate_dim2 against nominate_dim1

nominate %>%
  ggplot(aes(nominate_dim1, nominate_dim2)) + 
  geom_point()+
  labs(x = "NOMINATE-Dim. 1", y = "NOMINATE-Dim. 2")

Summarizing Data: Bivariate and Beyond

Numeric-Numeric-Categorical
- Color the above scatterplot by party

nominate %>%
  ggplot(aes(nominate_dim1, nominate_dim2, col = party_code)) + 
  geom_point()+
  labs(x = "NOMINATE-Dim. 1", y = "NOMINATE-Dim. 2", col = "Party")

Tidyverse Digression: Grouping and Summarizing Data

Often want to figure out some quantity within a category
- e.g., ideology by party, average GDP per capita by country

Use group_by() function in tidyverse to group our data by given variables

Use summarize() and desired function to create new function
- Very similar to mutate()

Tidyverse Digress: Grouping and Summarizing

Example with our class survey data

df %>%
  group_by(code_experience)%>%
  summarize(sleep = mean(sleep))

# A tibble: 2 × 2
  code_experience sleep
  <chr>           <dbl>
1 No               6.33
2 Yes              7.32

Getting Ideology by Congress and Party

What would be the first step in getting the average of nominate_dim1 by Congress and party?

nominate %>% 
  group_by(congress, party_code) %>% head(c(2,2))

# A tibble: 2 × 2
# Groups:   congress, party_code [1]
  party_code congress
  <fct>         <dbl>
1 R                97
2 R                97

What comes next?

nominate %>% 
  group_by(congress, party_code) %>%
  summarize(nominate_dim1 = 
              mean(nominate_dim1, 
                   na.rm = T)) %>%
  head(4)

# A tibble: 4 × 3
# Groups:   congress [2]
  congress party_code nominate_dim1
     <dbl> <fct>              <dbl>
1       97 R                  0.307
2       97 D                 -0.300
3       98 R                  0.321
4       98 D                 -0.302

Line Plots

How could we make a scatterplot from our summarized data of average NOMINATE score, colored by party, against the Congress number?

Line Plots

nominate %>%
  group_by(congress, party_code) %>% 
  summarize(nominate_dim1_mean = mean(nominate_dim1, na.rm = T)) %>%
  ggplot(aes(congress, nominate_dim1_mean, col = party_code)) +
  geom_point()

Line Plots

To make a line plot, we just switch one line of code:

nominate %>%
  group_by(congress, party_code) %>% 
  summarize(nominate_dim1_mean = mean(nominate_dim1, na.rm = T)) %>%
  ggplot(aes(congress, nominate_dim1_mean, col = party_code)) +
  geom_point() +
  labs(x = "Congress", y= "Mean NOMINATE-Dim. 1", col = "Party")

Line Plots

To make a line plot, we just switch one line of code:

nominate %>%
  group_by(congress, party_code) %>% 
  summarize(nominate_dim1_mean = mean(nominate_dim1, na.rm = T)) %>%
  ggplot(aes(congress, nominate_dim1_mean, col = party_code)) +
  geom_line() +
  labs(x = "Congress", y= "Mean NOMINATE-Dim. 1", col = "Party")

Section 3. Summarizing Data

Last Week

Last Week

Last Week

Summarizing Bivariate Relationships

Categorical-Categorical Data: Cross-Tabs

Categorical-Categorical Data: Cross-Tabs

Categorical-Categorical Data: Barplots with Facets

Categorical-Numeric: Box-and-Whisker Plots

Categorical-Numeric: Box-and-Whisker Plots

Categorical-Numeric: Histograms

Categorical-Numeric: Density Plots

Numeric-Numeric Data: Scatterplots

Conditional Logic in Data: ifelse()

ifelse() function

ifelse() function

Multiple Conditions

Negating Conditions

case_when()

case_when()

Tidyverse Digression: Selecting Columns

Working with Real Data

Working with Real Data

Cleaning Data

Cleaning Data

Cleaning Data

Cleaning Data

Putting It Together

Summarizing Data: Univariate

Summarizing Data: Univariate

Summarizing Data: Univariate

Summarizing Data: Univariate

Summarizing Data: Bivariate and Beyond

Summarizing Data: Bivariate and Beyond

Summarizing Data: Bivariate and Beyond

Tidyverse Digression: Grouping and Summarizing Data

Tidyverse Digress: Grouping and Summarizing

Getting Ideology by Congress and Party

Line Plots

Line Plots

Line Plots

Line Plots

Recap