Section 2. Introduction to R

Sam Frederick

1/31/23

Last Section

Setting Working Directory to Course Folder
- setwd("/path/to/your/folder")
- RProjects

RScript and RMarkdown files

Beginning functions in R
- e.g., sum(), mean(), min(), max(), sqrt()

Last Section

Vectors
- c()

Objects
- x <- 1:3

Today’s Section

Types of Objects in R
Summarizing Data in One Variable
Working with Real Data in R

Today’s Section

Types of Objects in R
- Numeric
- Categorical
- Logical

Summarizing Data in One Variable
Working with Real Data in R

Numeric Data

Integers int type
Doubles
Ways of Summarizing (Univariate):
- Mean, median, min, max, range, IQR, standard deviation
- summary() function
- Histograms: hist()
Ways of Summarizing (Bivariate):
- Scatterplot

Summary Statistics - Central Tendency: Mean

Mean/Average:
- $\bar{x} = \frac{x_1 + x_2+...+x_n}{n} = \frac{1}{n}\sum_{i=1}^{n}x_i$

x <- c(1, 100, 7, 6, 5)
sum(x)/length(x)

[1] 23.8

mean(x)

[1] 23.8

Summary Statistics - Central Tendency: Median

Median
- arrange vector in numerical order
- find the middle value (50% above and 50% below)
- not susceptible to outliers like the mean/average
What’s the median of this vector?

x <- c(1, 100, 7,6,5)

quantile(x, prob = 0.5)

50% 
  6

median(x)

[1] 6

Summary Statistics: Measures of Spread

Standard Deviation
- Measures spread around mean
- Square root of the variance

$Var(x) = \sigma^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2$

$sd(x) = \sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2}$

Summary Statistics: Measures of Spread

$Var(x) = \sigma^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2$

sum((x - mean(x))^2)/(length(x) -1)

[1] 1819.7

var(x)

[1] 1819.7

$sd(x) = \sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})^2}$

sqrt(var(x))

[1] 42.65794

sqrt(sum((x - mean(x))^2)/(length(x) -1))

[1] 42.65794

sd(x)

[1] 42.65794

Summary Statistics: Measures of Spread

Range (minimum, maximum)

range(x)

[1]   1 100

min(x)

[1] 1

max(x)

[1] 100

Summary Statistics: Measures of Spread

Interquartile Range (IQR)
- Arrange in numerical order
- Find values below which 25% and 75% of the data lie

quantile(x, prob = c(0.25, 0.75))

25% 75% 
  5   7

IQR(x)

[1] 2

Summary Statistics

summary() function
- min, max, median, mean, IQR, # of missing observations

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     5.0     6.0    23.8     7.0   100.0

Tidyverse Digression

Install tidyverse

install.packages("tidyverse")

Load tidyverse for use

library(tidyverse)

Tidyverse Digression

Pipe Operator: x %>% function()
- Basically puts the object x into the function
- More like writing/reading left to right

Tidyverse Digression

Tibbles:
- Tidyverse version of data.frame
- A lot of helpful functions that perform various operations
  - Example: mutate() to create and change column(s)

df <- tibble(x = x, y = 1:5)
df <- df %>%
  mutate(z = 6:10)
df

# A tibble: 5 × 3
      x     y     z
  <dbl> <int> <int>
1     1     1     6
2   100     2     7
3     7     3     8
4     6     4     9
5     5     5    10

Plotting in R: ggplot

Data
Plot Foundation: ggplot(data, aes())
- Creates a plot base
- Can add items on top of base with +
aes(): aesthetic arguments
- Take information from data and put into plot
- Examples: x, y, col (color)

Visually Summarizing Data

df %>% 
  ggplot(aes(x))

Visually Summarizing Data

df %>% 
  ggplot(aes(x)) + 
  geom_histogram()

Visually Summarizing Data

df %>% 
  ggplot(aes(x)) + 
  geom_histogram(fill = "blue")

Visually Summarizing Data

df %>% 
  ggplot(aes(x)) + 
  geom_histogram(fill = "blue") + 
  labs(x = "X", y = "Count", title = "Histogram of X")

Categorical Data

Character chr data
Factors
Ways of Summarizing (Univariate):
- Tables
- Barplots
Ways of Summarizing (Bivariate):
- Cross-tabs
- Box-and-whisker plots

Factors

Usually turn character data into factors for analysis
- factor()
R often turns these into dummy/indicator variables
- Indicator variables: take on a value of 1 if some condition is met, 0 otherwise
- e.g., Male (1 if individual identifies as a man, 0 otherwise)
Come in specific order (i.e., alphabetical or numerical order)

Factors

factor(variable, levels = c(...), labels = c(...))
- levels argument:
  - must match exact spelling of categories
  - can be used to reorder the levels/categories
- labels argument:
  - doesn’t have to match spelling (can be anything)
  - must be same length as number of levels/categories

Factors

grp <- c(rep("A", 3), rep("B", 6), rep("C", 8))
grp

 [1] "A" "A" "A" "B" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C" "C" "C" "C"

grp <- factor(grp)
grp

 [1] A A A B B B B B B C C C C C C C C
Levels: A B C

grp <- factor(grp, levels = c("C", "B", "A"))
grp

 [1] A A A B B B B B B C C C C C C C C
Levels: C B A

grp <- factor(grp, 
              levels = c("C", "B", "A"), 
              labels = c("Group C", "Group B", "Group A"))
grp

 [1] Group A Group A Group A Group B Group B Group B Group B Group B Group B
[10] Group C Group C Group C Group C Group C Group C Group C Group C
Levels: Group C Group B Group A

Tables and Proportion Tables

table()
- Number of observations in each category

table(grp)

grp
Group C Group B Group A 
      8       6       3

prop.table()
- Proportion of total observations in each category

prop.table(table(grp))

grp
  Group C   Group B   Group A 
0.4705882 0.3529412 0.1764706

Visual Summaries of Categorical Data

Barplots

df <- tibble(grp = grp)
df %>% 
  ggplot(aes(grp)) + 
  geom_bar()

Visual Summaries of Categorical Data

Barplots

df <- tibble(grp = grp)
df %>% 
  ggplot(aes(grp, fill = grp)) + 
  geom_bar() + 
  labs(x = "Group", y = "Count", 
       title = "Barplot of Groups")

Logical Operators

Logical Operators
- Check whether condition is TRUE/FALSE
x==y
- Is x equal to y?
x<y
- Is x less than y?
x>=y
- Is x greater than or equal to y?

Logical Operators

Operator	Meaning
==	equal to
<	less than
>	greater than
<=	less than or equal to
>=	greater than or equal to
%in%	in

Logical Data

TRUE or FALSE
- can also be abbreviated T or F
NA also technically logical
- Indicates missing observation
Can be used for subsetting data

Subsetting Data

Brackets []
- For vectors, indicates position in vector you want
- For data frames and matrices, [row, column]

seq(0, 50, by = 2)[4]

[1] 6

df[2,1]

# A tibble: 1 × 1
  grp    
  <fct>  
1 Group A

Subsetting Data

$
- In data frames/tibbles, gives variable after $ as vector

df$grp

 [1] Group A Group A Group A Group B Group B Group B Group B Group B Group B
[10] Group C Group C Group C Group C Group C Group C Group C Group C
Levels: Group C Group B Group A

Subsetting Data

subset(df, 
       subset = grp=="Group A")

# A tibble: 3 × 1
  grp    
  <fct>  
1 Group A
2 Group A
3 Group A

df %>% 
  filter(grp=="Group A")

# A tibble: 3 × 1
  grp    
  <fct>  
1 Group A
2 Group A
3 Group A

Working with Data in R

Download survey responses from Courseworks
- “Section 2” > “intro_survey.csv”
Put file into course folder
Set working directory in R or open course RProject
Read file into R using tidyverse:

intro <- read_csv("intro_survey.csv")

Summarizing Data

What is the median number of hours of sleep students in this section get each night? How about the standard deviation and range?
See if you can reproduce this plot:

Summarizing Data

What is the median number of hours of sleep students in this section get each night? How about the standard deviation and range?

median(intro$sleep)

[1] 7

quantile(intro$sleep, prob = 0.5)

50% 
  7

sd(intro$sleep)

[1] 0.7813942

range(intro$sleep)

[1] 5 8

Summarizing Data

See if you can reproduce this plot:

Summarizing Data

See if you can reproduce this plot:

intro %>%
  ggplot(aes(sleep)) + 
  geom_histogram() + 
  labs(x = "Hours of Sleep per Night", 
       y = "Number",
       title = "Histogram of Number of Hours of Sleep")

Summarizing Data

Make the variable year into a factor, and reorder in proper order
- Hint: remember the mutate() function
Make a proportion table from the new factor variable year
Try making a barplot for the factor variable year with each year in a different color

Summarizing Data

Make the variable year into a factor, and reorder in proper order

intro <- intro %>% 
  mutate(year= factor(year,levels =  c("First Year", "Sophomore", 
                              "Junior", "Senior")))

Make a proportion table from the new factor variable year
Try making a barplot for the factor variable year with each year in a different color

Summarizing Data

Make the variable year into a factor, and reorder in proper order

intro <- intro %>% 
  mutate(year= factor(year,levels =  c("First Year", "Sophomore", 
                              "Junior", "Senior")))

Make a proportion table from the new factor variable year

prop.table(table(intro$year))


First Year  Sophomore     Junior     Senior 
0.00000000 0.07692308 0.92307692 0.00000000

Try making a barplot for the factor variable year with each year in a different color

Summarizing Data

Try making a barplot for the factor variable year with each year in a different color

intro %>%
  ggplot(aes(year, fill = year)) + 
  geom_bar() + 
  labs(x = "Year", y = "Count", 
       title = "Barplot of Year")

Summarizing Data

intro %>%
  ggplot(aes(year, fill = year)) + 
  geom_bar() + 
  labs(x = "Year", y = "Count", 
       title = "Barplot of Year")

Making Nice Tables in R

Install modelsummary package

install.packages("modelsummary")

Load the modelsummary package

library(modelsummary)

Making Nice Tables in R

datasummary_skim(data = intro %>% select(sleep, homework), 
                  histogram=F)

	Unique (#)	Missing (%)	Mean	SD	Min	Median	Max
sleep	6	0	6.9	0.8	5.0	7.0	8.0
homework	4	77	8.0	4.4	5.0	6.0	13.0

Missing Data

Often data incomplete or missing altogether
R shows as NA
Some functions will only output missing data if NAs are present

mean(intro$homework)

[1] NA

Solution: na.rm argument

mean(intro$homework, na.rm = T)

[1] 8

Recap

Tidyverse
- Tibbles
  - Create new columns with mutate()
  - Subset with filter() and logical conditions
- Plotting
  - Foundation for plot with ggplot(data, aes())
  - Build on top of foundation with +

Recap

Factors
- Used factor(variable, levels = c(...), labels = c(...))
Missing Data
- Be aware of how many missing data there are
- Can usually remove using na.rm=TRUE argument

Next Section

Summarizing more than one variable
“If” Statements
For Loops