Clean a series of columns using ChatGPT by mapping data into new categories and formats

Maps old data in names of column_values to new categories and formats using ChatGPT. clean_columns() returns data with new columns corresponding to matched and reformatted values. This can be helpful for processing messy or unstructured text data in a column (e.g., open-ended survey responses, names, etc.). clean_columns() is particularly helpful for post-processing the output of get_bio().

clean_columns() uses the standard chat completion to reformat columns. It makes separate API calls for each column. clean_columns_function_call() uses function calling to reformat columns. It tries to complete as many columns as possible in a single API call.

Usage

clean_columns(
  data,
  column_values,
  column_formats,
  prompt_fewshot = NULL,
  prompt_fewshot_type = "specific",
  prompt_fewshot_n = 1,
  openai_api_key = NULL,
  openai_model = "gpt-3.5-turbo",
  openai_temperature = 0,
  openai_seed = NULL
)

clean_columns_function_call(
  data,
  column_values,
  column_formats,
  column_descriptions = NULL,
  prompt_fewshot = NULL,
  prompt_fewshot_n = 1,
  openai_api_key = NULL,
  openai_model = "gpt-3.5-turbo",
  openai_temperature = 0,
  openai_seed = NULL,
  openai_context_window = 4096
)

Arguments

data: The data to be processed
column_values: A named list with column names from data as names and values as vectors with the desired categories for the corresponding column
column_formats: A named list with column names from data as names and values as strings with the desired format for the corresponding column
prompt_fewshot: A data.frame, tibble, or named list containing example inputs and example outputs (names from input example with suffix "_gpt") Example: list(education = c("J.D.", "BA", "GED", "Did not graduate high school"), education_gpt = c("Graduate School", "College, "High School or less", "High School or less"))
prompt_fewshot_type: Only for clean_columns(). A string, one of "specific" or "general", defaults to "specific". If type is "specific", prompt_fewshot must have column names from data, and few-shot examples should correspond to the specific columns from the data. If type is "general", examples in prompt_fewshot will be reused for each column. It is recommended that the few-shot examples in prompt_fewshot include example values or formats as well with suffix _values and _formats.
prompt_fewshot_n: An integer or named list of integers (with names from example inputs in prompt_fewshot) giving the number of segments to divide each example input into. For example, prompt_fewshot_n=2 would divide inputted example vectors into two separate example prompts and outputs. Note: for clean_columns_function_call(), this must be an integer.
openai_api_key: API key for OpenAI, a string. If this is NULL, clean_columns() searches .Renviron for API key.
openai_model: ChatGPT model to use, defaults to "chatgpt-3.5-turbo"
openai_temperature: Specifies the amount of randomness in ChatGPT, a number between 0 and 2 with more randomness for higher numbers, defaults to 0
openai_seed: An integer, specifies a random seed for ChatGPT (this is in the development stage at OpenAI, so it might not work perfectly)
column_descriptions: Only for clean_columns_function_call(). A named list with column names from data as names and values as strings with the desired description for the column
openai_context_window: Only for clean_columns_function_call(). An integer, defaults to 4,096, specifies the context window for the ChatGPT model in use. This is used to determine whether to split the columns to be cleaned into several portions. Note: this is a rough approximation of whether the prompt is too long. It is best to split your data into parts if needed or to use larger GPT models.

Value

Data with new columns for each entry in column_values containing new mappings

Examples

df <- data.frame(age = rnorm(4, 50, 10),
                 education = c("BA", "B.A.", "High School", "MBA"),
                 name = c("Wardell Stephen Curry II", "Michael J Jordan",
                 "James, LEBRON", "Shaq"))
clean_columns(data = df,
              column_values = list(education = c("High School", "College",
                                                 "Graduate School"),
                                   name = c("Steph Curry", "Michael Jordan",
                                            "LeBron James", "Shaquille O'Neal")))
#> education
#> Input Tokens: 106
#> Output Tokens:89
#> Total Tokens:195
#> name
#> Input Tokens: 124
#> Output Tokens:37
#> Total Tokens:161
#> # A tibble: 4 × 5
#>     age education   name                     education_gpt   name_gpt        
#>   <dbl> <chr>       <chr>                    <chr>           <chr>           
#> 1  29.4 BA          Wardell Stephen Curry II College         Steph Curry     
#> 2  34.7 B.A.        Michael J Jordan         College         Michael Jordan  
#> 3  37.5 High School James, LEBRON            High School     LeBron James    
#> 4  42.7 MBA         Shaq                     Graduate School Shaquille O'Neal

clean_columns(data = data.frame(birthday = c("08-13-1923",
                                             "05/15/1976",
                                             "March 13, 1998",
                                             "19th of March in 1994")),
              column_formats = list(birthday = "{MM}/{DD}/{YYYY}"))
#> birthday
#> Input Tokens: 121
#> Output Tokens:47
#> Total Tokens:168
#> # A tibble: 4 × 2
#>   birthday              birthday_gpt
#>   <chr>                 <chr>       
#> 1 08-13-1923            08/13/1923  
#> 2 05/15/1976            05/15/1976  
#> 3 March 13, 1998        03/13/1998  
#> 4 19th of March in 1994 03/19/1994  
clean_columns_function_call(data = df,
                            column_values = list(education = c("High School or less",
                                                               "College",
                                                               "Graduate School")))
#> education
#> Input Tokens: 182
#> Output Tokens: 23
#> Total Tokens: 205
#> # A tibble: 4 × 4
#>     age education   name                     education_gpt      
#>   <dbl> <chr>       <chr>                    <chr>              
#> 1  29.4 BA          Wardell Stephen Curry II College            
#> 2  34.7 B.A.        Michael J Jordan         College            
#> 3  37.5 High School James, LEBRON            High School or less
#> 4  42.7 MBA         Shaq                     Graduate School