Section 3. Types of Data in R
Picking Up Where We Left Off…
Review
Creating Objects in R
In R, we often want to refer back to code output we have previously generated, so we can perform further operations on that output without having to retrace all of our steps.
To do this, we use the assignment operator (<-
). This operator assigns the value on the right hand side to the object on the left hand side. For example, to create an object called x
that stores the number 2
, we could use the following code:
<- 2 x
Once we run this code, x
will appear in the “Global Environment” in the upper right corner of RStudio.
We can now refer to x in future calculations.
^2 + 2 x
[1] 6
We can call objects whatever we want with several notable restrictions:
- We should not use spaces in object names.
- We should not create objects with the titles of functions that already exist in R.
- We should not start object names with numbers.
- Object names have to contain some alphabetical characters.
Vectors
Last week, we talked about vectors in R. We can think of vectors as columns in a spreadsheet or dataset. Another way is to think of vectors as “lists” of data; however, the word list
has a special meaning in R, so we want to be careful about this terminology. Vectors cannot contain more than one type of data (for example, they can contain either numbers or characters/strings but not both).
Vectors can be created in R using the c()
command, which puts the arguments into a vector. The arguments inside of the parentheses should be separated by commas. For example,
c(1,2,3)
[1] 1 2 3
We can also create vectors in R using the :
operator or the seq()
command. Both of these options create sequences of numbers in vector form.
The :
operator outputs the sequence of integers from the left side number to the right side number.
1:6
[1] 1 2 3 4 5 6
The seq(a, b, by = x)
command outputs the sequence of numbers from a
to b
in increments of x
.
seq(1, 6, by = 1)
[1] 1 2 3 4 5 6
seq(10, 11, by = 0.1)
[1] 10.0 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11.0
After we have created a vector, we can use it as input for the functions we are using.
For example, to calculate the average of the sequence of integers from 1 to 6, we could use the following code:
mean(1:6)
[1] 3.5
mean(seq(1, 6, by = 1))
[1] 3.5
Reading Data into R
Much of the time, the data we want to work with in our research comes in some form of spreadsheet (mostly .csv format). In order to work with this data in R, we need to read the file into R. To do so, we need to know where this data is stored on our computer.
This is why it’s helpful to have a dedicated folder somewhere on your computer for this course or for a project you’re working on. You can then move your data files from your Downloads folder to the course folder.
Once we know where the data are located, we can tell R where to look for the data by
- Manually entering the path to the file
- “/Users/samfrederick/Desktop/Scope and Methods/dataset.csv”
- Setting our working directory at the beginning of our R session
setwd("/Users/samfrederick/Desktop/Scope and Methods/")
- Using an RProject associated with our course folder
A Short Digression on Packages in R
The main way we will read csv data into R in this course is by using the R package tidyverse
. This is a package that contains a bunch of other helpful packages for reading, cleaning and processing, and visualizing data in R.
To use tidyverse
, we first have to download the package. We can download the package using this code:
# Note the use of quotation marks around the word tidyverse
install.packages("tidyverse")
We only have to install packages once.
To use packages, we need to load the packages using the library()
command:
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.1.0
✔ tidyr 1.2.0 ✔ stringr 1.4.1
✔ readr 2.1.2 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Now, we can use the functions that come within the tidyverse
package.
Back to Reading Files
To read files, we use the read_csv()
command.
Say the name of the file we want to read is “house2020_elections.csv”, we can read this into R by (1) setting our working directory or using an RProject in our course folder and (2) using this code:
read_csv("house2020_elections.csv")
Rows: 726 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): party, state, incumbent_challenge_full, last
dbl (4): district, receipts, disbursements, voteshare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 726 × 8
party state district incumbent_challenge_full receipts disbur…¹ last votes…²
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 REP AL 1 Open seat 2344517. 2232544. CARL 64.4
2 DEM AL 1 Open seat 80095. 78973. AVER… 35.5
3 DEM AL 2 Open seat 57723. 57661. HARV… 34.7
4 REP AL 5 Incumbent 669026. 223707. BROO… 95.8
5 DEM AL 7 Incumbent 2171040. 1498832. SEWE… 97.2
6 REP AR 1 Incumbent 966801. 1095518. CRAW… 100
7 DEM CA 7 Incumbent 1830741. 1126436. BERA 56.6
8 REP CA 3 Challenger 397099. 384584. HAMI… 45.3
9 DEM CA 4 Challenger 3022017. 3018529. KENN… 44.1
10 REP CA 6 Challenger 45504. 45286. BISH 26.7
# … with 716 more rows, and abbreviated variable names ¹disbursements,
# ²voteshare
Finally, we will want to store the data from this file in an object:
<- read_csv("house2020_elections.csv") house
Rows: 726 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): party, state, incumbent_challenge_full, last
dbl (4): district, receipts, disbursements, voteshare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now, this data should appear in our “Global Environment” in the top right corner of RStudio.
Working with Data Frames or Tibbles in R
We can access columns in a tibble using the $
operator. For example, if we want the party
column from our house
dataset, we can use the following code:
$party house
This will return the party column as a vector containing each candidate’s political party.
We can also use []
operators to access specific columns and rows of our data. We use these brackets by specifying: data[row,column]
. To access columns, we can use either the number of the column we are interested in or the name, and both will give the same output:
"party"]
house[,1]
house[,$party house
Say we want the second row of the data:
2,] house[
# A tibble: 1 × 8
party state district incumbent_challenge_full receipts disburs…¹ last votes…²
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 DEM AL 1 Open seat 80095. 78973. AVER… 35.5
# … with abbreviated variable names ¹disbursements, ²voteshare
Finally, what if we want the second value of the party column of our data?
2, 'party'] house[
# A tibble: 1 × 1
party
<chr>
1 DEM
$party[2] house
[1] "DEM"
Main Types of Data in R
Types of Data | R Types | Ways to Summarize |
---|---|---|
Numeric | integer (int ), double (dbl ) |
mean , median , min , max , range , IQR , sd , var , summary |
Categorical | character (chr ), factor (fct ) |
table , prop.table |
Logical | logical (lgl ), TRUE , FALSE , NA |
Categorical Data
In political science, we often treat categorical data or data that comes as a string/character variable as a factor variable. A factor variable is a way of storing this data in categories. We can create them using the factor()
command.
Try turning the party variable in our house dataset into a factor:
factor(house$party)
It is important to note that factors are stored in alphabetical or numeric order by default–though we often want to specify a different order.
We can use the “levels” argument of the factor()
command to accomplish this.
Say we want to organize our house party factor in reverse alphabetical order, we could do that like this:
factor(house$party, levels = c("REP", "DEM"))
Note about the levels argument: the levels must be spelled exactly the same as in the data. Additionally, you must have the same number of levels as there are categories in the data. This is where it can help to determine the specific levels of the categorical variable using unique(variable)
.
Summarizing Factor Variables
Since factor variables are in categories (i.e., are not numeric), we need to take a different approach to summarizing. We generally do this by looking at a table of our data. Tables tell us how many observations are in each category of our data.
table(house$party)
DEM REP
378 348
Sometimes, we might also want to know what proportion of observations falls into each category. We can calculate this using the following code:
prop.table(table(house$party))
DEM REP
0.5206612 0.4793388
Logical Data in R
Another type of common data in R is “logical” data in R. There are three values logical data can take on: (1) TRUE
, (2) FALSE
, and (3) NA
.
TRUE
indicates that some condition is met, or “true.” FALSE
indicates that some condition is not met, or is “false.” Finally, NA
indicates that the data are missing.
R will check whether various conditions are met for us. We can do this using logical operators:
Logical Operator | Task |
---|---|
== |
equal to |
!= |
not equal to |
< |
less than |
> |
greater than |
<= |
less than or equal to |
>= |
greater than or equal to |
! |
not |
%in% |
in |
If we’d like to check whether 2 is equal to 3, we can use this code:
2==3
[1] FALSE
R checks whether this is true, and returns FALSE
because it’s not true.
How about whether 2 is less than or equal to 3?
2<=3
[1] TRUE
What if we have a vector of integers between 1 and and 7 called x, and we want to check whether the number 3 is in this vector?
<- 1:7
x 3%in%x
[1] TRUE
Logical conditions like these are useful for taking portions/subsets of our data to which a certain condition applies. For example, in our house data, we can take only the rows for Democratic candidates:
$party=="DEM",] house[house
# A tibble: 378 × 8
party state district incumbent_challenge_full receipts disbur…¹ last votes…²
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 DEM AL 1 Open seat 80095. 78973. AVER… 35.5
2 DEM AL 2 Open seat 57723. 57661. HARV… 34.7
3 DEM AL 7 Incumbent 2171040. 1498832. SEWE… 97.2
4 DEM CA 7 Incumbent 1830741. 1126436. BERA 56.6
5 DEM CA 4 Challenger 3022017. 3018529. KENN… 44.1
6 DEM CA 8 Open seat 1878106. 1876718. BUBS… 43.9
7 DEM CA 11 Incumbent 612404. 442547. DESA… 73.0
8 DEM CA 3 Incumbent 1103806. 717060. GARA… 54.7
9 DEM CA 18 Challenger 777037. 697543. KUMAR 36.8
10 DEM CA 22 Challenger 5175373. 5146107. ARBA… 45.8
# … with 368 more rows, and abbreviated variable names ¹disbursements,
# ²voteshare
This code will give you the same output:
$party!="REP",] house[house
We can also look only at candidates who received more than 50% of the vote:
$voteshare>50,] house[house
# A tibble: 380 × 8
party state district incumbent_challenge_full receipts disbu…¹ last votes…²
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 REP AL 1 Open seat 2344517. 2.23e6 CARL 64.4
2 REP AL 5 Incumbent 669026. 2.24e5 BROO… 95.8
3 DEM AL 7 Incumbent 2171040. 1.50e6 SEWE… 97.2
4 REP AR 1 Incumbent 966801. 1.10e6 CRAW… 100
5 DEM CA 7 Incumbent 1830741. 1.13e6 BERA 56.6
6 REP CA 8 Open seat 2028339. 1.96e6 OBER… 56.1
7 DEM CA 11 Incumbent 612404. 4.43e5 DESA… 73.0
8 DEM CA 3 Incumbent 1103806. 7.17e5 GARA… 54.7
9 REP CA 25 Incumbent 10140614. 9.76e6 GARC… 50.0
10 DEM CA 28 Incumbent 19598363. 1.04e7 SCHI… 72.7
# … with 370 more rows, and abbreviated variable names ¹disbursements,
# ²voteshare
Evaluating Multiple Logical Conditions
Operator | Task |
---|---|
& | AND |
| | OR |
For example, we can get only the observations in our dataset for Democrats who received more than 50% of the vote.
$party=="DEM"&house$voteshare>50,] house[house
# A tibble: 192 × 8
party state district incumbent_challenge_full receipts disbu…¹ last votes…²
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 DEM AL 7 Incumbent 2171040. 1.50e6 SEWE… 97.2
2 DEM CA 7 Incumbent 1830741. 1.13e6 BERA 56.6
3 DEM CA 11 Incumbent 612404. 4.43e5 DESA… 73.0
4 DEM CA 3 Incumbent 1103806. 7.17e5 GARA… 54.7
5 DEM CA 28 Incumbent 19598363. 1.04e7 SCHI… 72.7
6 DEM CA 27 Incumbent 1220538. 9.63e5 CHU 69.8
7 DEM CA 37 Incumbent 2233947. 1.22e6 BASS 85.9
8 DEM CT 3 Incumbent 1956686. 1.81e6 DELA… 56.1
9 DEM DC 0 Incumbent 387489. 3.59e5 NORT… 81.9
10 DEM FL 24 Incumbent 412592. 3.21e5 WILS… 75.6
# … with 182 more rows, and abbreviated variable names ¹disbursements,
# ²voteshare
Exercises
- Find the average vote share of candidates in the 2020 elections dataset.
- Find the average vote share of candidates in the 2020 elections dataset.
mean(house$voteshare)
[1] 51.15903
- What is the median amount of money spent by candidates in 2020?
- What is the median amount of money spent by candidates in 2020?
median(house$disbursements)
[1] 1141486
- What are three ways we could summarize how spread out campaign fundraising is (look at spending here)?
- What are three ways we could summarize how spread out campaign fundraising is (look at spending here)?
range(house$receipts)
[1] 3065.2 38160641.6
IQR(house$receipts)
[1] 2228867
sd(house$receipts)
[1] 3251062
var(house$receipts)
[1] 1.05694e+13
- What’s one command we could use to get an overview of many summary statistics for our voteshare column?
- What’s one command we could use to get an overview of many summary statistics for our voteshare column?
summary(house$voteshare)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.168 38.045 51.104 51.159 63.713 100.000
- Take the
incumbent_challenge_full
variable. Turn it into a factor, and order the levels in this order “Incumbents”, “Challengers”, and “Open-Seat Candidates.” Make a proportion table from this factor variable.
- Take the
incumbent_challenge_full
variable. Turn it into a factor, and order the levels in this order “Incumbents”, “Challengers”, and “Open-Seat Candidates.” Make a proportion table from this factor variable.
factor(house$incumbent_challenge_full, levels = c("Incumbent", "Challenger",
"Open seat"))%>%
table() %>%
prop.table()
.
Incumbent Challenger Open seat
0.48898072 0.41597796 0.09504132
- Subset the data so that we only have rows for incumbents OR candidates who received more than 60% of the vote.
- Subset the data so that we only have rows for incumbents OR candidates who received more than 60% of the vote.
$incumbent_challenge_full=="Incumbent"|house$voteshare>60,] house[house
# A tibble: 374 × 8
party state district incumbent_challenge_full receipts disbu…¹ last votes…²
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 REP AL 1 Open seat 2344517. 2.23e6 CARL 64.4
2 REP AL 5 Incumbent 669026. 2.24e5 BROO… 95.8
3 DEM AL 7 Incumbent 2171040. 1.50e6 SEWE… 97.2
4 REP AR 1 Incumbent 966801. 1.10e6 CRAW… 100
5 DEM CA 7 Incumbent 1830741. 1.13e6 BERA 56.6
6 DEM CA 11 Incumbent 612404. 4.43e5 DESA… 73.0
7 DEM CA 3 Incumbent 1103806. 7.17e5 GARA… 54.7
8 REP CA 25 Incumbent 10140614. 9.76e6 GARC… 50.0
9 DEM CA 28 Incumbent 19598363. 1.04e7 SCHI… 72.7
10 DEM CA 27 Incumbent 1220538. 9.63e5 CHU 69.8
# … with 364 more rows, and abbreviated variable names ¹disbursements,
# ²voteshare