Clean a series of columns using ChatGPT by mapping data into new categories and formats
clean_columns.Rd
Maps old data in names of column_values to new categories and formats using ChatGPT.
clean_columns()
returns data with new columns corresponding to matched and reformatted
values. This can be helpful for processing messy or unstructured text data
in a column (e.g., open-ended survey responses, names, etc.). clean_columns()
is particularly helpful for post-processing the output of get_bio()
.
clean_columns()
uses the standard chat completion to reformat columns. It makes separate API calls for each column.
clean_columns_function_call()
uses function calling to reformat columns. It tries to complete as many columns as possible in a single API call.
Usage
clean_columns(
data,
column_values,
column_formats,
prompt_fewshot = NULL,
prompt_fewshot_type = "specific",
prompt_fewshot_n = 1,
openai_api_key = NULL,
openai_model = "gpt-3.5-turbo",
openai_temperature = 0,
openai_seed = NULL
)
clean_columns_function_call(
data,
column_values,
column_formats,
column_descriptions = NULL,
prompt_fewshot = NULL,
prompt_fewshot_n = 1,
openai_api_key = NULL,
openai_model = "gpt-3.5-turbo",
openai_temperature = 0,
openai_seed = NULL,
openai_context_window = 4096
)
Arguments
- data
The data to be processed
- column_values
A named list with column names from data as names and values as vectors with the desired categories for the corresponding column
- column_formats
A named list with column names from data as names and values as strings with the desired format for the corresponding column
- prompt_fewshot
A data.frame, tibble, or named list containing example inputs and example outputs (names from input example with suffix "_gpt") Example: list(education = c("J.D.", "BA", "GED", "Did not graduate high school"), education_gpt = c("Graduate School", "College, "High School or less", "High School or less"))
- prompt_fewshot_type
Only for
clean_columns()
. A string, one of "specific" or "general", defaults to "specific". If type is "specific", prompt_fewshot must have column names from data, and few-shot examples should correspond to the specific columns from the data. If type is "general", examples in prompt_fewshot will be reused for each column. It is recommended that the few-shot examples in prompt_fewshot include example values or formats as well with suffix _values and _formats.- prompt_fewshot_n
An integer or named list of integers (with names from example inputs in prompt_fewshot) giving the number of segments to divide each example input into. For example, prompt_fewshot_n=2 would divide inputted example vectors into two separate example prompts and outputs. Note: for
clean_columns_function_call()
, this must be an integer.- openai_api_key
API key for OpenAI, a string. If this is NULL,
clean_columns()
searches .Renviron for API key.- openai_model
ChatGPT model to use, defaults to "chatgpt-3.5-turbo"
- openai_temperature
Specifies the amount of randomness in ChatGPT, a number between 0 and 2 with more randomness for higher numbers, defaults to 0
- openai_seed
An integer, specifies a random seed for ChatGPT (this is in the development stage at OpenAI, so it might not work perfectly)
- column_descriptions
Only for
clean_columns_function_call()
. A named list with column names from data as names and values as strings with the desired description for the column- openai_context_window
Only for
clean_columns_function_call()
. An integer, defaults to 4,096, specifies the context window for the ChatGPT model in use. This is used to determine whether to split the columns to be cleaned into several portions. Note: this is a rough approximation of whether the prompt is too long. It is best to split your data into parts if needed or to use larger GPT models.
Examples
df <- data.frame(age = rnorm(4, 50, 10),
education = c("BA", "B.A.", "High School", "MBA"),
name = c("Wardell Stephen Curry II", "Michael J Jordan",
"James, LEBRON", "Shaq"))
clean_columns(data = df,
column_values = list(education = c("High School", "College",
"Graduate School"),
name = c("Steph Curry", "Michael Jordan",
"LeBron James", "Shaquille O'Neal")))
#> education
#> Input Tokens: 106
#> Output Tokens:89
#> Total Tokens:195
#> name
#> Input Tokens: 124
#> Output Tokens:37
#> Total Tokens:161
#> # A tibble: 4 × 5
#> age education name education_gpt name_gpt
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 29.4 BA Wardell Stephen Curry II College Steph Curry
#> 2 34.7 B.A. Michael J Jordan College Michael Jordan
#> 3 37.5 High School James, LEBRON High School LeBron James
#> 4 42.7 MBA Shaq Graduate School Shaquille O'Neal
clean_columns(data = data.frame(birthday = c("08-13-1923",
"05/15/1976",
"March 13, 1998",
"19th of March in 1994")),
column_formats = list(birthday = "{MM}/{DD}/{YYYY}"))
#> birthday
#> Input Tokens: 121
#> Output Tokens:47
#> Total Tokens:168
#> # A tibble: 4 × 2
#> birthday birthday_gpt
#> <chr> <chr>
#> 1 08-13-1923 08/13/1923
#> 2 05/15/1976 05/15/1976
#> 3 March 13, 1998 03/13/1998
#> 4 19th of March in 1994 03/19/1994
clean_columns_function_call(data = df,
column_values = list(education = c("High School or less",
"College",
"Graduate School")))
#> education
#> Input Tokens: 182
#> Output Tokens: 23
#> Total Tokens: 205
#> # A tibble: 4 × 4
#> age education name education_gpt
#> <dbl> <chr> <chr> <chr>
#> 1 29.4 BA Wardell Stephen Curry II College
#> 2 34.7 B.A. Michael J Jordan College
#> 3 37.5 High School James, LEBRON High School or less
#> 4 42.7 MBA Shaq Graduate School