Section 3. Types of Data in R

Author

Sam Frederick

Published

May 31, 2023

Picking Up Where We Left Off…

Review

Creating Objects in R

In R, we often want to refer back to code output we have previously generated, so we can perform further operations on that output without having to retrace all of our steps.

To do this, we use the assignment operator (<-). This operator assigns the value on the right hand side to the object on the left hand side. For example, to create an object called x that stores the number 2, we could use the following code:

x <- 2

Once we run this code, x will appear in the “Global Environment” in the upper right corner of RStudio.

We can now refer to x in future calculations.

x^2 + 2
[1] 6

We can call objects whatever we want with several notable restrictions:

  • We should not use spaces in object names.
  • We should not create objects with the titles of functions that already exist in R.
  • We should not start object names with numbers.
  • Object names have to contain some alphabetical characters.

Vectors

Last week, we talked about vectors in R. We can think of vectors as columns in a spreadsheet or dataset. Another way is to think of vectors as “lists” of data; however, the word list has a special meaning in R, so we want to be careful about this terminology. Vectors cannot contain more than one type of data (for example, they can contain either numbers or characters/strings but not both).

Vectors can be created in R using the c() command, which puts the arguments into a vector. The arguments inside of the parentheses should be separated by commas. For example,

c(1,2,3)
[1] 1 2 3

We can also create vectors in R using the : operator or the seq() command. Both of these options create sequences of numbers in vector form.

The : operator outputs the sequence of integers from the left side number to the right side number.

1:6
[1] 1 2 3 4 5 6

The seq(a, b, by = x) command outputs the sequence of numbers from a to b in increments of x.

seq(1, 6, by = 1)
[1] 1 2 3 4 5 6
seq(10, 11, by = 0.1)
 [1] 10.0 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11.0

After we have created a vector, we can use it as input for the functions we are using.

For example, to calculate the average of the sequence of integers from 1 to 6, we could use the following code:

mean(1:6)
[1] 3.5
mean(seq(1, 6, by = 1))
[1] 3.5

Reading Data into R

Much of the time, the data we want to work with in our research comes in some form of spreadsheet (mostly .csv format). In order to work with this data in R, we need to read the file into R. To do so, we need to know where this data is stored on our computer.

This is why it’s helpful to have a dedicated folder somewhere on your computer for this course or for a project you’re working on. You can then move your data files from your Downloads folder to the course folder.

Once we know where the data are located, we can tell R where to look for the data by

  1. Manually entering the path to the file
    • “/Users/samfrederick/Desktop/Scope and Methods/dataset.csv”
  2. Setting our working directory at the beginning of our R session
    • setwd("/Users/samfrederick/Desktop/Scope and Methods/")
  3. Using an RProject associated with our course folder

A Short Digression on Packages in R

The main way we will read csv data into R in this course is by using the R package tidyverse. This is a package that contains a bunch of other helpful packages for reading, cleaning and processing, and visualizing data in R.

To use tidyverse, we first have to download the package. We can download the package using this code:

# Note the use of quotation marks around the word tidyverse
install.packages("tidyverse")

We only have to install packages once.

To use packages, we need to load the packages using the library() command:

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.0     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Now, we can use the functions that come within the tidyverse package.

Back to Reading Files

To read files, we use the read_csv() command.

Say the name of the file we want to read is “house2020_elections.csv”, we can read this into R by (1) setting our working directory or using an RProject in our course folder and (2) using this code:

read_csv("house2020_elections.csv")
Rows: 726 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): party, state, incumbent_challenge_full, last
dbl (4): district, receipts, disbursements, voteshare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 726 × 8
   party state district incumbent_challenge_full receipts disbur…¹ last  votes…²
   <chr> <chr>    <dbl> <chr>                       <dbl>    <dbl> <chr>   <dbl>
 1 REP   AL           1 Open seat                2344517. 2232544. CARL     64.4
 2 DEM   AL           1 Open seat                  80095.   78973. AVER…    35.5
 3 DEM   AL           2 Open seat                  57723.   57661. HARV…    34.7
 4 REP   AL           5 Incumbent                 669026.  223707. BROO…    95.8
 5 DEM   AL           7 Incumbent                2171040. 1498832. SEWE…    97.2
 6 REP   AR           1 Incumbent                 966801. 1095518. CRAW…   100  
 7 DEM   CA           7 Incumbent                1830741. 1126436. BERA     56.6
 8 REP   CA           3 Challenger                397099.  384584. HAMI…    45.3
 9 DEM   CA           4 Challenger               3022017. 3018529. KENN…    44.1
10 REP   CA           6 Challenger                 45504.   45286. BISH     26.7
# … with 716 more rows, and abbreviated variable names ¹​disbursements,
#   ²​voteshare

Finally, we will want to store the data from this file in an object:

house <- read_csv("house2020_elections.csv")
Rows: 726 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): party, state, incumbent_challenge_full, last
dbl (4): district, receipts, disbursements, voteshare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now, this data should appear in our “Global Environment” in the top right corner of RStudio.

Working with Data Frames or Tibbles in R

We can access columns in a tibble using the $ operator. For example, if we want the party column from our house dataset, we can use the following code:

house$party

This will return the party column as a vector containing each candidate’s political party.

We can also use [] operators to access specific columns and rows of our data. We use these brackets by specifying: data[row,column]. To access columns, we can use either the number of the column we are interested in or the name, and both will give the same output:

house[,"party"]
house[,1]
house$party

Say we want the second row of the data:

house[2,]
# A tibble: 1 × 8
  party state district incumbent_challenge_full receipts disburs…¹ last  votes…²
  <chr> <chr>    <dbl> <chr>                       <dbl>     <dbl> <chr>   <dbl>
1 DEM   AL           1 Open seat                  80095.    78973. AVER…    35.5
# … with abbreviated variable names ¹​disbursements, ²​voteshare

Finally, what if we want the second value of the party column of our data?

house[2, 'party']
# A tibble: 1 × 1
  party
  <chr>
1 DEM  
house$party[2]
[1] "DEM"

Main Types of Data in R

Types of Data R Types Ways to Summarize
Numeric integer (int), double (dbl) mean, median, min, max, range, IQR, sd, var, summary
Categorical character (chr), factor (fct) table, prop.table
Logical logical (lgl), TRUE, FALSE, NA

Categorical Data

In political science, we often treat categorical data or data that comes as a string/character variable as a factor variable. A factor variable is a way of storing this data in categories. We can create them using the factor() command.

Try turning the party variable in our house dataset into a factor:

factor(house$party)

It is important to note that factors are stored in alphabetical or numeric order by default–though we often want to specify a different order.

We can use the “levels” argument of the factor() command to accomplish this.

Say we want to organize our house party factor in reverse alphabetical order, we could do that like this:

factor(house$party, levels = c("REP", "DEM"))

Note about the levels argument: the levels must be spelled exactly the same as in the data. Additionally, you must have the same number of levels as there are categories in the data. This is where it can help to determine the specific levels of the categorical variable using unique(variable).

Summarizing Factor Variables

Since factor variables are in categories (i.e., are not numeric), we need to take a different approach to summarizing. We generally do this by looking at a table of our data. Tables tell us how many observations are in each category of our data.

table(house$party)

DEM REP 
378 348 

Sometimes, we might also want to know what proportion of observations falls into each category. We can calculate this using the following code:

prop.table(table(house$party))

      DEM       REP 
0.5206612 0.4793388 

Logical Data in R

Another type of common data in R is “logical” data in R. There are three values logical data can take on: (1) TRUE, (2) FALSE, and (3) NA.

TRUE indicates that some condition is met, or “true.” FALSE indicates that some condition is not met, or is “false.” Finally, NA indicates that the data are missing.

R will check whether various conditions are met for us. We can do this using logical operators:

Logical Operator Task
== equal to
!= not equal to
< less than
> greater than
<= less than or equal to
>= greater than or equal to
! not
%in% in

If we’d like to check whether 2 is equal to 3, we can use this code:

2==3
[1] FALSE

R checks whether this is true, and returns FALSE because it’s not true.

How about whether 2 is less than or equal to 3?

2<=3
[1] TRUE

What if we have a vector of integers between 1 and and 7 called x, and we want to check whether the number 3 is in this vector?

x <- 1:7
3%in%x
[1] TRUE

Logical conditions like these are useful for taking portions/subsets of our data to which a certain condition applies. For example, in our house data, we can take only the rows for Democratic candidates:

house[house$party=="DEM",]
# A tibble: 378 × 8
   party state district incumbent_challenge_full receipts disbur…¹ last  votes…²
   <chr> <chr>    <dbl> <chr>                       <dbl>    <dbl> <chr>   <dbl>
 1 DEM   AL           1 Open seat                  80095.   78973. AVER…    35.5
 2 DEM   AL           2 Open seat                  57723.   57661. HARV…    34.7
 3 DEM   AL           7 Incumbent                2171040. 1498832. SEWE…    97.2
 4 DEM   CA           7 Incumbent                1830741. 1126436. BERA     56.6
 5 DEM   CA           4 Challenger               3022017. 3018529. KENN…    44.1
 6 DEM   CA           8 Open seat                1878106. 1876718. BUBS…    43.9
 7 DEM   CA          11 Incumbent                 612404.  442547. DESA…    73.0
 8 DEM   CA           3 Incumbent                1103806.  717060. GARA…    54.7
 9 DEM   CA          18 Challenger                777037.  697543. KUMAR    36.8
10 DEM   CA          22 Challenger               5175373. 5146107. ARBA…    45.8
# … with 368 more rows, and abbreviated variable names ¹​disbursements,
#   ²​voteshare

This code will give you the same output:

house[house$party!="REP",]

We can also look only at candidates who received more than 50% of the vote:

house[house$voteshare>50,]
# A tibble: 380 × 8
   party state district incumbent_challenge_full  receipts disbu…¹ last  votes…²
   <chr> <chr>    <dbl> <chr>                        <dbl>   <dbl> <chr>   <dbl>
 1 REP   AL           1 Open seat                 2344517.  2.23e6 CARL     64.4
 2 REP   AL           5 Incumbent                  669026.  2.24e5 BROO…    95.8
 3 DEM   AL           7 Incumbent                 2171040.  1.50e6 SEWE…    97.2
 4 REP   AR           1 Incumbent                  966801.  1.10e6 CRAW…   100  
 5 DEM   CA           7 Incumbent                 1830741.  1.13e6 BERA     56.6
 6 REP   CA           8 Open seat                 2028339.  1.96e6 OBER…    56.1
 7 DEM   CA          11 Incumbent                  612404.  4.43e5 DESA…    73.0
 8 DEM   CA           3 Incumbent                 1103806.  7.17e5 GARA…    54.7
 9 REP   CA          25 Incumbent                10140614.  9.76e6 GARC…    50.0
10 DEM   CA          28 Incumbent                19598363.  1.04e7 SCHI…    72.7
# … with 370 more rows, and abbreviated variable names ¹​disbursements,
#   ²​voteshare

Evaluating Multiple Logical Conditions

Operator Task
& AND
| OR

For example, we can get only the observations in our dataset for Democrats who received more than 50% of the vote.

house[house$party=="DEM"&house$voteshare>50,]
# A tibble: 192 × 8
   party state district incumbent_challenge_full  receipts disbu…¹ last  votes…²
   <chr> <chr>    <dbl> <chr>                        <dbl>   <dbl> <chr>   <dbl>
 1 DEM   AL           7 Incumbent                 2171040.  1.50e6 SEWE…    97.2
 2 DEM   CA           7 Incumbent                 1830741.  1.13e6 BERA     56.6
 3 DEM   CA          11 Incumbent                  612404.  4.43e5 DESA…    73.0
 4 DEM   CA           3 Incumbent                 1103806.  7.17e5 GARA…    54.7
 5 DEM   CA          28 Incumbent                19598363.  1.04e7 SCHI…    72.7
 6 DEM   CA          27 Incumbent                 1220538.  9.63e5 CHU      69.8
 7 DEM   CA          37 Incumbent                 2233947.  1.22e6 BASS     85.9
 8 DEM   CT           3 Incumbent                 1956686.  1.81e6 DELA…    56.1
 9 DEM   DC           0 Incumbent                  387489.  3.59e5 NORT…    81.9
10 DEM   FL          24 Incumbent                  412592.  3.21e5 WILS…    75.6
# … with 182 more rows, and abbreviated variable names ¹​disbursements,
#   ²​voteshare

Exercises

  1. Find the average vote share of candidates in the 2020 elections dataset.
  1. Find the average vote share of candidates in the 2020 elections dataset.
mean(house$voteshare)
[1] 51.15903
  1. What is the median amount of money spent by candidates in 2020?
  1. What is the median amount of money spent by candidates in 2020?
median(house$disbursements)
[1] 1141486
  1. What are three ways we could summarize how spread out campaign fundraising is (look at spending here)?
  1. What are three ways we could summarize how spread out campaign fundraising is (look at spending here)?
range(house$receipts)
[1]     3065.2 38160641.6
IQR(house$receipts)
[1] 2228867
sd(house$receipts)
[1] 3251062
var(house$receipts)
[1] 1.05694e+13
  1. What’s one command we could use to get an overview of many summary statistics for our voteshare column?
  1. What’s one command we could use to get an overview of many summary statistics for our voteshare column?
summary(house$voteshare)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.168  38.045  51.104  51.159  63.713 100.000 
  1. Take the incumbent_challenge_full variable. Turn it into a factor, and order the levels in this order “Incumbents”, “Challengers”, and “Open-Seat Candidates.” Make a proportion table from this factor variable.
  1. Take the incumbent_challenge_full variable. Turn it into a factor, and order the levels in this order “Incumbents”, “Challengers”, and “Open-Seat Candidates.” Make a proportion table from this factor variable.
factor(house$incumbent_challenge_full, levels = c("Incumbent", "Challenger",
                                                  "Open seat"))%>%
  table() %>% 
  prop.table()
.
 Incumbent Challenger  Open seat 
0.48898072 0.41597796 0.09504132 
  1. Subset the data so that we only have rows for incumbents OR candidates who received more than 60% of the vote.
  1. Subset the data so that we only have rows for incumbents OR candidates who received more than 60% of the vote.
house[house$incumbent_challenge_full=="Incumbent"|house$voteshare>60,]
# A tibble: 374 × 8
   party state district incumbent_challenge_full  receipts disbu…¹ last  votes…²
   <chr> <chr>    <dbl> <chr>                        <dbl>   <dbl> <chr>   <dbl>
 1 REP   AL           1 Open seat                 2344517.  2.23e6 CARL     64.4
 2 REP   AL           5 Incumbent                  669026.  2.24e5 BROO…    95.8
 3 DEM   AL           7 Incumbent                 2171040.  1.50e6 SEWE…    97.2
 4 REP   AR           1 Incumbent                  966801.  1.10e6 CRAW…   100  
 5 DEM   CA           7 Incumbent                 1830741.  1.13e6 BERA     56.6
 6 DEM   CA          11 Incumbent                  612404.  4.43e5 DESA…    73.0
 7 DEM   CA           3 Incumbent                 1103806.  7.17e5 GARA…    54.7
 8 REP   CA          25 Incumbent                10140614.  9.76e6 GARC…    50.0
 9 DEM   CA          28 Incumbent                19598363.  1.04e7 SCHI…    72.7
10 DEM   CA          27 Incumbent                 1220538.  9.63e5 CHU      69.8
# … with 364 more rows, and abbreviated variable names ¹​disbursements,
#   ²​voteshare