Section 2. Introduction to R

Sam Frederick

1/31/23

Last Section

  • Setting Working Directory to Course Folder
    • setwd("/path/to/your/folder")
    • RProjects
  • RScript and RMarkdown files
  • Beginning functions in R
    • e.g., sum(), mean(), min(), max(), sqrt()

Last Section

  • Vectors
    • c()
  • Objects
    • x <- 1:3

Today’s Section

  • Types of Objects in R

  • Summarizing Data in One Variable

  • Working with Real Data in R

Today’s Section

  • Types of Objects in R
    • Numeric
    • Categorical
    • Logical
  • Summarizing Data in One Variable

  • Working with Real Data in R

Numeric Data

  • Integers int type

  • Doubles

  • Ways of Summarizing (Univariate):

    • Mean, median, min, max, range, IQR, standard deviation
    • summary() function
    • Histograms: hist()
  • Ways of Summarizing (Bivariate):

    • Scatterplot

Summary Statistics - Central Tendency: Mean

  • Mean/Average:
    • \(\bar{x} = \frac{x_1 + x_2+...+x_n}{n} = \frac{1}{n}\sum_{i=1}^{n}x_i\)
x <- c(1, 100, 7, 6, 5)
sum(x)/length(x)
[1] 23.8
mean(x)
[1] 23.8

Summary Statistics - Central Tendency: Median

  • Median
    • arrange vector in numerical order
    • find the middle value (50% above and 50% below)
    • not susceptible to outliers like the mean/average
  • What’s the median of this vector?
x <- c(1, 100, 7,6,5)
quantile(x, prob = 0.5)
50% 
  6 
median(x)
[1] 6

Summary Statistics: Measures of Spread

  • Standard Deviation
    • Measures spread around mean
    • Square root of the variance

\(Var(x) = \sigma^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2\)

\(sd(x) = \sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2}\)

Summary Statistics: Measures of Spread

\(Var(x) = \sigma^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2\)

sum((x - mean(x))^2)/(length(x) -1)
[1] 1819.7
var(x)
[1] 1819.7

\(sd(x) = \sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2}\)

sqrt(var(x))
[1] 42.65794
sqrt(sum((x - mean(x))^2)/(length(x) -1))
[1] 42.65794
sd(x)
[1] 42.65794

Summary Statistics: Measures of Spread

  • Range (minimum, maximum)
range(x)
[1]   1 100
min(x)
[1] 1
max(x)
[1] 100

Summary Statistics: Measures of Spread

  • Interquartile Range (IQR)
    • Arrange in numerical order
    • Find values below which 25% and 75% of the data lie
quantile(x, prob = c(0.25, 0.75))
25% 75% 
  5   7 
IQR(x)
[1] 2

Summary Statistics

  • summary() function
    • min, max, median, mean, IQR, # of missing observations
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     5.0     6.0    23.8     7.0   100.0 

Tidyverse Digression

  • Install tidyverse
install.packages("tidyverse")
  • Load tidyverse for use
library(tidyverse)

Tidyverse Digression

  • Pipe Operator: x %>% function()
    • Basically puts the object x into the function
    • More like writing/reading left to right

Tidyverse Digression

  • Tibbles:
    • Tidyverse version of data.frame
    • A lot of helpful functions that perform various operations
      • Example: mutate() to create and change column(s)
df <- tibble(x = x, y = 1:5)
df <- df %>%
  mutate(z = 6:10)
df
# A tibble: 5 × 3
      x     y     z
  <dbl> <int> <int>
1     1     1     6
2   100     2     7
3     7     3     8
4     6     4     9
5     5     5    10

Plotting in R: ggplot

  • Data

  • Plot Foundation: ggplot(data, aes())

    • Creates a plot base
    • Can add items on top of base with +
  • aes(): aesthetic arguments

    • Take information from data and put into plot
    • Examples: x, y, col (color)

Visually Summarizing Data

df %>% 
  ggplot(aes(x)) 

Visually Summarizing Data

df %>% 
  ggplot(aes(x)) + 
  geom_histogram()

Visually Summarizing Data

df %>% 
  ggplot(aes(x)) + 
  geom_histogram(fill = "blue")

Visually Summarizing Data

df %>% 
  ggplot(aes(x)) + 
  geom_histogram(fill = "blue") + 
  labs(x = "X", y = "Count", title = "Histogram of X")

Categorical Data

  • Character chr data

  • Factors

  • Ways of Summarizing (Univariate):

    • Tables
    • Barplots
  • Ways of Summarizing (Bivariate):

    • Cross-tabs
    • Box-and-whisker plots

Factors

  • Usually turn character data into factors for analysis
    • factor()
  • R often turns these into dummy/indicator variables
    • Indicator variables: take on a value of 1 if some condition is met, 0 otherwise
    • e.g., Male (1 if individual identifies as a man, 0 otherwise)
  • Come in specific order (i.e., alphabetical or numerical order)

Factors

  • factor(variable, levels = c(...), labels = c(...))
    • levels argument:
      • must match exact spelling of categories
      • can be used to reorder the levels/categories
    • labels argument:
      • doesn’t have to match spelling (can be anything)
      • must be same length as number of levels/categories

Factors

grp <- c(rep("A", 3), rep("B", 6), rep("C", 8))
grp
 [1] "A" "A" "A" "B" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C" "C" "C" "C"
grp <- factor(grp)
grp
 [1] A A A B B B B B B C C C C C C C C
Levels: A B C
grp <- factor(grp, levels = c("C", "B", "A"))
grp
 [1] A A A B B B B B B C C C C C C C C
Levels: C B A
grp <- factor(grp, 
              levels = c("C", "B", "A"), 
              labels = c("Group C", "Group B", "Group A"))
grp
 [1] Group A Group A Group A Group B Group B Group B Group B Group B Group B
[10] Group C Group C Group C Group C Group C Group C Group C Group C
Levels: Group C Group B Group A

Tables and Proportion Tables

  • table()
    • Number of observations in each category
table(grp)
grp
Group C Group B Group A 
      8       6       3 
  • prop.table()
    • Proportion of total observations in each category
prop.table(table(grp))
grp
  Group C   Group B   Group A 
0.4705882 0.3529412 0.1764706 

Visual Summaries of Categorical Data

  • Barplots
df <- tibble(grp = grp)
df %>% 
  ggplot(aes(grp)) + 
  geom_bar() 

Visual Summaries of Categorical Data

  • Barplots
df <- tibble(grp = grp)
df %>% 
  ggplot(aes(grp, fill = grp)) + 
  geom_bar() + 
  labs(x = "Group", y = "Count", 
       title = "Barplot of Groups")

Logical Operators

  • Logical Operators
    • Check whether condition is TRUE/FALSE
  • x==y
    • Is x equal to y?
  • x<y
    • Is x less than y?
  • x>=y
    • Is x greater than or equal to y?

Logical Operators

Operator Meaning
== equal to
< less than
> greater than
<= less than or equal to
>= greater than or equal to
%in% in

Logical Data

  • TRUE or FALSE
    • can also be abbreviated T or F
  • NA also technically logical
    • Indicates missing observation
  • Can be used for subsetting data

Subsetting Data

  • Brackets []
    • For vectors, indicates position in vector you want
    • For data frames and matrices, [row, column]
seq(0, 50, by = 2)[4]
[1] 6
df[2,1]
# A tibble: 1 × 1
  grp    
  <fct>  
1 Group A

Subsetting Data

  • $
    • In data frames/tibbles, gives variable after $ as vector
df$grp
 [1] Group A Group A Group A Group B Group B Group B Group B Group B Group B
[10] Group C Group C Group C Group C Group C Group C Group C Group C
Levels: Group C Group B Group A

Subsetting Data

subset(df, 
       subset = grp=="Group A")
# A tibble: 3 × 1
  grp    
  <fct>  
1 Group A
2 Group A
3 Group A
df %>% 
  filter(grp=="Group A")
# A tibble: 3 × 1
  grp    
  <fct>  
1 Group A
2 Group A
3 Group A

Working with Data in R

  • Download survey responses from Courseworks

    • “Section 2” > “intro_survey.csv”
  • Put file into course folder

  • Set working directory in R or open course RProject

  • Read file into R using tidyverse:

intro <- read_csv("intro_survey.csv")

Summarizing Data

  • What is the median number of hours of sleep students in this section get each night? How about the standard deviation and range?

  • See if you can reproduce this plot:

Summarizing Data

  • What is the median number of hours of sleep students in this section get each night? How about the standard deviation and range?
median(intro$sleep)
[1] 7
quantile(intro$sleep, prob = 0.5)
50% 
  7 
sd(intro$sleep)
[1] 0.7813942
range(intro$sleep)
[1] 5 8

Summarizing Data

  • See if you can reproduce this plot:

Summarizing Data

  • See if you can reproduce this plot:
intro %>%
  ggplot(aes(sleep)) + 
  geom_histogram() + 
  labs(x = "Hours of Sleep per Night", 
       y = "Number",
       title = "Histogram of Number of Hours of Sleep")

Summarizing Data

  • Make the variable year into a factor, and reorder in proper order

    • Hint: remember the mutate() function
  • Make a proportion table from the new factor variable year

  • Try making a barplot for the factor variable year with each year in a different color

Summarizing Data

  • Make the variable year into a factor, and reorder in proper order
intro <- intro %>% 
  mutate(year= factor(year,levels =  c("First Year", "Sophomore", 
                              "Junior", "Senior")))
  • Make a proportion table from the new factor variable year

  • Try making a barplot for the factor variable year with each year in a different color

Summarizing Data

  • Make the variable year into a factor, and reorder in proper order
intro <- intro %>% 
  mutate(year= factor(year,levels =  c("First Year", "Sophomore", 
                              "Junior", "Senior")))
  • Make a proportion table from the new factor variable year
prop.table(table(intro$year))

First Year  Sophomore     Junior     Senior 
0.00000000 0.07692308 0.92307692 0.00000000 
  • Try making a barplot for the factor variable year with each year in a different color

Summarizing Data

  • Try making a barplot for the factor variable year with each year in a different color
intro %>%
  ggplot(aes(year, fill = year)) + 
  geom_bar() + 
  labs(x = "Year", y = "Count", 
       title = "Barplot of Year")

Summarizing Data

intro %>%
  ggplot(aes(year, fill = year)) + 
  geom_bar() + 
  labs(x = "Year", y = "Count", 
       title = "Barplot of Year")

Making Nice Tables in R

  • Install modelsummary package
install.packages("modelsummary")
  • Load the modelsummary package
library(modelsummary)

Making Nice Tables in R

datasummary_skim(data = intro %>% select(sleep, homework), 
                  histogram=F)
Unique (#) Missing (%) Mean SD Min Median Max
sleep 6 0 6.9 0.8 5.0 7.0 8.0
homework 4 77 8.0 4.4 5.0 6.0 13.0

Missing Data

  • Often data incomplete or missing altogether

  • R shows as NA

  • Some functions will only output missing data if NAs are present

mean(intro$homework)
[1] NA
  • Solution: na.rm argument
mean(intro$homework, na.rm = T)
[1] 8

Recap

G stats1 Summary Statistics tabs Table Prop. Table stats1->tabs stats0 Summary Statistics centtend Central Tendency Mean Median stats0->centtend spread Spread Standard Deviation Variance Range Interquartile Range stats0->spread vis0 Visual Summaries vislist Histogram vis0->vislist vis1 Visual Summaries vislist1 Barplot vis1->vislist1 Data Data Numeric Numeric Data->Numeric Categorical Categorical Data->Categorical Numeric->stats0 Numeric->vis0 Categorical->stats1 Categorical->vis1

Recap

  • Tidyverse
    • Tibbles
      • Create new columns with mutate()
      • Subset with filter() and logical conditions
    • Plotting
      • Foundation for plot with ggplot(data, aes())
      • Build on top of foundation with +

Recap

  • Factors
    • Used factor(variable, levels = c(...), labels = c(...))
  • Missing Data
    • Be aware of how many missing data there are
    • Can usually remove using na.rm=TRUE argument

Next Section

  • Summarizing more than one variable

  • “If” Statements

  • For Loops