get_bio
get_bio.Rmd
Unstructured Biographical Text
We often work with unstructured text but want to extract some
specific structured data from this text. We could read through each text
and extract the desired information manually; however, this is
time-consuming and could be infeasible with larger numbers of texts. The
function get_bio()
allows us to extract desired data from
unstructured texts using the ChatGPT API.
If you don’t have an API Key for ChatGPT, you should create one
through OpenAI before attempting to use this package. For simplicity,
you can save this API key to your .Renviron (Note: the function
usethis::edit_r_environ()
may be helpful here).
As an example, we can scrape the biography of actor Adam Driver from
Wikipedia using the rvest
package.
if(rlang::is_installed("rvest")) {
driver_bio <- rvest::read_html("https://en.wikipedia.org/wiki/Adam_Driver")|>
rvest::html_elements(xpath = "//h2/span[@id='Early_life']/parent::*/following-sibling::*")
driver_bio <- paste(rvest::html_text2(driver_bio[2:5]), collapse = " ")
driver_bio
} else{
driver_bio <- "Driver was born on November 19, 1983,[5] in San Diego, California,[6] the son of Nancy Wright (née Needham), a paralegal, and Joe Douglas Driver.[7][8] Director Terry Gilliam has claimed that Driver has Native American ancestry,[9] though Driver has no known Native American ancestors. His father's family is from Arkansas, and his mother's family is from Indiana. His stepfather, Rodney G. Wright, is a minister at a Baptist church.[10][11] When Driver was seven years old, he moved with his older sister and mother to his mother's hometown Mishawaka, Indiana, where he graduated from Mishawaka High School in 2001.[12][13] Driver was raised Baptist, and sang in the choir at church.[14] Driver has described his teenage self as a \"misfit\"; he told M Magazine that he climbed radio towers, set objects on fire, and co-founded a fight club with friends, inspired by the 1999 film Fight Club.[15] After high school, he worked as a door-to-door salesman selling Kirby vacuum cleaners and as a telemarketer for a basement waterproofing company and Ben Franklin Construction.[16] He applied to the Juilliard School for drama but was not accepted.[17] Shortly after the September 11 attacks, Driver enlisted in the United States Marine Corps.[5] He was assigned to Weapons Company, 1st Battalion, 1st Marines as an 81mm mortar man.[18] He served for two years and eight months before fracturing his sternum while mountain biking.[19] He was medically discharged with the rank of Lance Corporal. Subsequently, Driver attended the University of Indianapolis for a year before auditioning again for Juilliard, this time succeeding. He got the news he was accepted while at work at the Target Distribution Center in Indianapolis. Driver has said that his classmates saw him as an intimidating and volatile figure, and he struggled to fit into a lifestyle so different from the Marines.[15] He was a member of the Drama Division's Group 38 from 2005 to 2009, where he met his future wife, Joanne Tucker. He graduated with a Bachelor of Fine Arts in 2009.[20]"
}
#> [1] "Driver was born on November 19, 1983,[5] in San Diego, California,[6] the son of Nancy Wright (née Needham), a paralegal, and Joe Douglas Driver.[7][8] Director Terry Gilliam has claimed that Driver has Native American ancestry,[9] though Driver has no known Native American ancestors. His father's family is from Arkansas, and his mother's family is from Indiana. His stepfather, Rodney G. Wright, is a minister at a Baptist church.[10][11] When Driver was seven years old, he moved with his older sister and mother to his mother's hometown Mishawaka, Indiana, where he graduated from Mishawaka High School in 2001.[12][13] Driver was raised Baptist, and sang in the choir at church.[14] Driver has described his teenage self as a \"misfit\"; he told M Magazine that he climbed radio towers, set objects on fire, and co-founded a fight club with friends, inspired by the 1999 film Fight Club.[15] After high school, he worked as a door-to-door salesman selling Kirby vacuum cleaners and as a telemarketer for a basement waterproofing company and Ben Franklin Construction.[16] He applied to the Juilliard School for drama but was not accepted.[17] Shortly after the September 11 attacks, Driver enlisted in the United States Marine Corps.[5] He was assigned to Weapons Company, 1st Battalion, 1st Marines as an 81mm mortar man.[18] He served for two years and eight months before fracturing his sternum while mountain biking.[19] He was medically discharged with the rank of Lance Corporal. Subsequently, Driver attended the University of Indianapolis for a year before auditioning again for Juilliard, this time succeeding. He got the news he was accepted while at work at the Target Distribution Center in Indianapolis. Driver has said that his classmates saw him as an intimidating and volatile figure, and he struggled to fit into a lifestyle so different from the Marines.[15] He was a member of the Drama Division's Group 38 from 2005 to 2009, where he met his future wife, Joanne Tucker. He graduated with a Bachelor of Fine Arts in 2009.[20]"
Using get_bio()
to extract data
Now that we have a real example of an unstructured biography, we can decide what information we want to extract from the text. In practice, it’s helpful to read through a sample of unstructured texts to find out what type of information tends to be included in the texts and to create a gold standard set of information to check against the ChatGPT output.
The Adam Driver biography from Wikipedia contains a variety of
information that might be interesting for potential study (his date of
birth, place of birth, college, military experience, marriage, etc.). We
use the get_bio()
function to call ChatGPT’s API to extract
this information.
The get_bio()
function has five key arguments:
bio
, bio_name
, prompt_fields
,
prompt_formats
, and prompt_values
.
bio
must contain the biographical text from which you want
to extract data. bio_name
is optional but recommended and
is used to specify which individual ChatGPT should get information
for.
We put the desired biographical information fields in the
prompt_fields
argument as a character vector. The names of
the prompt_fields should be informative (people reading the fields
should be able to understand what specific information you want the
field to contain). For example, the names might be something like
c(birthdate, town_of_birth)
. If you don’t pass any
information to the prompt_fields
argument, the function
returns the default biographical fields: birth_date,
highest_level_of_education, college, graduate_school,
previous_occupation, gender, town_of_birth, state_of_birth, and
married.
For certain fields, you might want information returned in a specific
format (e.g., dates in MM/DD/YYYY format); you can pass this information
through the prompt_fields_formats
argument as a named list
with names corresponding to values in prompt_fields. For example, we
could pass
prompt_fields_arguments=list(birthdate="MM/DD/YYYY")
to
instruct ChatGPT about proper birthdate formatting.
Finally, we might want to restrict certain fields to only take
certain values: the prompt_fields_values
argument allows us
to pass this information as a named list of vectors with acceptable
values for each field in each vector. As an example, we could pass
prompt_fields_values=list(education=c("High School or less", "College", "Graduate School"))
to tell ChatGPT that we only want education information to be one of
these three values.
get_bio(bio = driver_bio,
bio_name = "Adam Driver",
prompt_fields = c("birth_date", "town_of_birth", "state_of_birth",
"college", "religion", "military_experience",
"married"),
prompt_fields_formats = list(birth_date = "{MM}/{DD}/{YYYY}",
college = "{SCHOOL} - {DEGREE}"),
prompt_fields_values = list(military_experience = c("Yes", "No"),
married = c("Yes", "No")))
#> Input Tokens: 607
#> Output Tokens: 82
#> Total Tokens: 689
#> # A tibble: 1 × 7
#> birth_date town_of_birth state_of_birth college religion military_experience
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11/19/1983 San Diego California Universi… Baptist Yes
#> # ℹ 1 more variable: married <chr>
Custom Prompts with get_bio()
If you would like, you can input a custom prompt for
get_bio()
using the prompt
argument, which
will override the defaults. If you input a custom prompt, you should
include all applicable information from the prompt_fields
,
prompt_fields_formats
, and
prompt_fields_values
arguments in your custom prompt.
Few-Shot Prompting with get_bio()
We can also use few-shot prompting in get_bio()
with the
prompt_fewshot
argument. The prompt_fewshot
argument should be a data.frame or tibble which contains example bios in
a column called “bio”, example names in a column called “bio_name” (if
desired), and example outputs for prompt_fields
in the
applicable columns.
For the get_bio()
example above, few-shot prompting
might look something like this:
fewshot_example <- data.frame(bio = "John Smith was born on the thirteenth of October in 1992 in St. Louis, MO. He went on to earn his Bachelor of Arts degree from Invisible University where he met his wife, Marie. Raised as a Quaker, he was opposed to entering the military.",
bio_name = "John Smith",
birth_date = "10/13/1992",
town_of_birth = "St. Louis",
state_of_birth = "Missouri",
college = "Invisible University - B.A.",
religion = "Quaker",
military_experience = "No",
married = "Yes")
get_bio(bio = driver_bio,
bio_name = "Adam Driver",
prompt_fields = c("birth_date", "town_of_birth", "state_of_birth",
"college", "religion", "military_experience",
"married"),
prompt_fields_formats = list(birth_date = "{MM}/{DD}/{YYYY}",
college = "{SCHOOL} - {DEGREE}"),
prompt_fields_values = list(military_experience = c("Yes", "No"),
married = c("Yes", "No")),
prompt_fewshot = fewshot_example)
#> Input Tokens: 835
#> Output Tokens: 61
#> Total Tokens: 896
#> # A tibble: 1 × 7
#> birth_date town_of_birth state_of_birth college religion military_experience
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11/19/1983 San Diego California Universi… Baptist Yes
#> # ℹ 1 more variable: married <chr>
Using get_bio_function_call()
to extract data
ChatGPT also supports a different type of prompting known as function
calling. This type of prompting can be helpful for extracting structured
information from user input. The function
get_bio_function_call()
allows us to use a function call to
extract biographical data.
Function calling essentially tells ChatGPT we have a function that
takes specific arguments, and we want to extract those arguments from
the input. Function call prompts take three types of information about
each function argument you want to extract: the type of object you want
in the argument (kept as a string in biographR for simplicity), the
possible values of the argument (passed through
prompt_fields_values
argument to
get_bio_function_call()
), and a description of the argument
(this combines elements from prompt_fields_descriptions
and
prompt_fields_formats
).
The GPT 4 models appear to be somewhat better at returning function call information correctly (according to OpenAI).
Turning back to the Adam Driver biography from above, we can use ChatGPT function calling to extract biographical information from the text.
get_bio_function_call(bio = driver_bio,
bio_name = "Adam Driver",
prompt_fields = c("birth_date", "town_of_birth",
"state_of_birth", "college",
"religion", "military_experience",
"married"),
prompt_fields_formats = list(birth_date = "{MM}/{DD}/{YYYY}",
college = "{SCHOOL} - {DEGREE}"),
prompt_fields_values = list(military_experience = c("Yes", "No"),
married = c("Yes", "No")),
prompt_fields_descriptions = list(college = "Information about the individual's college degree.",
religion = "Information about any religious history the individual has."))
#> Input Tokens: 758
#> Output Tokens: 92
#> Total Tokens: 850
#> # A tibble: 1 × 7
#> birth_date town_of_birth state_of_birth college religion military_experience
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11/19/1983 San Diego California Universi… Baptist Yes
#> # ℹ 1 more variable: married <chr>
Few-Shot Prompting with get_bio_function_call()
The syntax for few-shot prompting with
get_bio_function_call()
is the same as the syntax for
get_bio()
. We should input the few-shot examples as a
data.frame or tibble with example biographical text in a column called
“bio”, example biographical names in a column called “bio_name” (if
desired), and example biographical data in columns with names from
prompt_fields.
Looking at the example from the Few-Shot Prompting with
get_bio()
section:
fewshot_example <- data.frame(bio = "John Smith was born on the thirteenth of October in 1992 in St. Louis, MO. He went on to earn his Bachelor of Arts degree from Invisible University where he met his wife, Marie. Raised as a Quaker, he was opposed to entering the military.",
bio_name = "John Smith",
birth_date = "10/13/1992",
town_of_birth = "St. Louis",
state_of_birth = "Missouri",
college = "Invisible University - B.A.",
religion = "Quaker",
military_experience = "No",
married = "Yes")
get_bio_function_call(bio = driver_bio,
bio_name = "Adam Driver",
prompt_fields = c("birth_date", "town_of_birth", "state_of_birth",
"college", "religion", "military_experience",
"married"),
prompt_fields_formats = list(birth_date = "{MM}/{DD}/{YYYY}",
college = "{SCHOOL} - {DEGREE}"),
prompt_fields_values = list(military_experience = c("Yes", "No"),
married = c("Yes", "No")),
prompt_fewshot = fewshot_example)
#> Input Tokens: 951
#> Output Tokens: 71
#> Total Tokens: 1022
#> # A tibble: 1 × 7
#> birth_date town_of_birth state_of_birth college religion military_experience
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11/19/1983 San Diego California Universi… Baptist Yes
#> # ℹ 1 more variable: married <chr>