Get structured biographical data from unstructured text
get_bio.Rd
get_bio()
uses standard ChatGPT chat completions to retrieve structured data from input text and allows for fully customizable prompts.
get_bio_function_call()
uses ChatGPT function calling to retrieve structured data from input text.
Usage
get_bio(
bio,
bio_name = NULL,
prompt = NULL,
prompt_fields = NULL,
prompt_fields_formats = NULL,
prompt_fields_values = NULL,
prompt_fewshot = NULL,
openai_api_key = NULL,
openai_model = "gpt-3.5-turbo",
openai_temperature = 0,
openai_seed = NULL
)
get_bio_function_call(
bio,
bio_name = NULL,
prompt_fields = NULL,
prompt_fields_formats = NULL,
prompt_fields_values = NULL,
prompt_fields_descriptions = NULL,
prompt_fewshot = NULL,
openai_api_key = NULL,
openai_model = "gpt-3.5-turbo",
openai_temperature = 0,
openai_seed = NULL
)
Arguments
- bio
The bio to be processed, a string
- bio_name
The name of the individual whose biographical information is desired, a string. For
get_bio()
, bio_name can be a vector of strings containing the names of all individuals for whom biographical information is desired- prompt
Only for use in
get_bio()
. A string. If desired, a custom prompt. This overrides the default prompt and should include any desired prompt fields, formats, and values.- prompt_fields
A character vector of desired biographical output fields (e.g., "college", "graduate_school")
- prompt_fields_formats
A named list of strings giving desired formats for output fields (e.g., "{SCHOOL} - {DEGREE}"). Names should be present in prompt_fields.
- prompt_fields_values
A named list of character vectors of desired output values for each prompt field. Names should be present in prompt_fields.
- prompt_fewshot
A data.frame or tibble with complete example data. Should have a column called 'bio' containing unstructured example text, a column called 'bio_name' containing the name of the individual in the example (if applicable), and columns with outputs for every field in prompt_fields
get_bio()
Example: data.frame(bio = "John Smith went to Nowhere University, and he graduated with a B.A.", bio_name = "John Smith", gender = "Male", college = "Nowhere University - B.A.")
- openai_api_key
API key for OpenAI, a string. If this is NULL,
get_bio()
searches .Renviron for API key.- openai_model
ChatGPT model to use, defaults to "chatgpt-3.5-turbo"
- openai_temperature
A number between 0 and 2, specifies the amount of randomness in ChatGPT, with more randomness for higher numbers, defaults to 0
- openai_seed
An integer, pecifies a random seed for ChatGPT (this is in the development stage at OpenAI, so it might not work perfectly).
- prompt_fields_descriptions
Only for use in
get_bio_function_call()
. A named list of strings with additional text describing each prompt field. Names should be present in prompt_fields.
Value
A tibble containing desired biographical information or unprocessed API output from custom prompt
Examples
# Biographical Information about Kevin McCarthy from
# https://bioguide.congress.gov/search/bio/M001165
get_bio(bio = "MCCARTHY, KEVIN, a Representative from California;
born in Bakersfield, Kern County, Calif., January 26,
1965; graduated from Bakersfield High School,
Bakersfield, Calif., 1983; attended Bakersfield College,
Bakersfield. Calif., 1983-1986; B.S., California State
University, Bakersfield, Calif., 1989; M.B.A., California
State University, Bakersfield, Calif., 1994; staff,
United States Representative William Thomas of California,
1987-2002; member of the California state assembly,
2002-2007, minority leader, 2004-2006; elected as a
Republican to the One Hundred Tenth and to the eight
succeeding Congresses (January 3, 2007-present); majority
whip (One Hundred Twelfth and One Hundred Thirteenth
Congresses); majority leader (One Hundred Thirteenth
through One Hundred Fifteenth Congresses); minority
leader (One Hundred Sixteenth and One Hundred Seventeenth
Congress); Speaker of the House (One Hundred Eighteenth
Congress).",
bio_name = "Kevin McCarthy")
#> No prompt_fields argument provided. Defaulting to: birth_date, highest_level_of_education, college, graduate school, previous_occupation, gender, town_of_birth, state_of_birth, married.
#> Input Tokens: 429
#> Output Tokens: 138
#> Total Tokens: 567
#> # A tibble: 1 × 9
#> birth_date highest_level_of_educ…¹ college graduate_school previous_occupation
#> <chr> <chr> <chr> <chr> <chr>
#> 1 01/26/1965 M.B.A. Bakers… California Sta… staff, United Stat…
#> # ℹ abbreviated name: ¹highest_level_of_education
#> # ℹ 4 more variables: gender <chr>, town_of_birth <chr>, state_of_birth <chr>,
#> # married <chr>
get_bio_function_call(bio = "MCCARTHY, KEVIN, a Representative from California;
born in Bakersfield, Kern County, Calif., January 26,
1965; graduated from Bakersfield High School,
Bakersfield, Calif., 1983; attended Bakersfield College,
Bakersfield. Calif., 1983-1986; B.S., California State
University, Bakersfield, Calif., 1989; M.B.A., California
State University, Bakersfield, Calif., 1994; staff,
United States Representative William Thomas of California,
1987-2002; member of the California state assembly,
2002-2007, minority leader, 2004-2006; elected as a
Republican to the One Hundred Tenth and to the eight
succeeding Congresses (January 3, 2007-present); majority
whip (One Hundred Twelfth and One Hundred Thirteenth
Congresses); majority leader (One Hundred Thirteenth
through One Hundred Fifteenth Congresses); minority
leader (One Hundred Sixteenth and One Hundred Seventeenth
Congress); Speaker of the House (One Hundred Eighteenth
Congress).",
bio_name = "Kevin McCarthy",
prompt_fields = c("highest_level_of_education",
"previous_occupation", "birth_date"),
prompt_fields_formats = list(highest_level_of_education = "{DEGREE}",
previous_occupation = "{OCCUPATION} - {YEARS}",
birth_date = "{MM}/{DD}/{YYYY}"))
#> Input Tokens: 494
#> Output Tokens: 86
#> Total Tokens: 580
#> # A tibble: 1 × 3
#> highest_level_of_education previous_occupation birth_date
#> <chr> <chr> <chr>
#> 1 M.B.A. staff, United States Representative Wil… 01/26/1965