This repository contains code for the following paper:
Automatically Auditing Large Language Models via Discrete Optimization
Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt
First, create and activate the conda environment using:
conda env create -f environment.yml
conda activate auditing-llms
In order to run the experiments where we reverse large language models, i.e. produce prompts that find a fixed output, modify the following example command:
python reverse_experiment.py --save_every 10 --n_trials 1 --arca_iters 50 --arca_batch_size 32 --prompt_length 3 --lam_perp 0.2 --label your-file-label --filename senators.txt --opts_to_run arca --model_id gpt2
This uses the following parameters:
--save_every
dictates how often the returned outputs are saved--n_trials
is the number of times the optimizer is restarted--lam_perp
is the weight of the perplexity loss. Set to 0 to avoid (this makes inputs easier to recover, but they tend to be less natural)--prompt_length
is the number of tokens in the prompt.--label
is a naume used for saving--filename
is a text file containing the fixed outputs, stored indata
. We includesenators.txt
,tox_1tok.txt
,tox_2tok.txt
, andtox_3tok.txt
, where the last three files contain CivilComments examples that at least half of annotators label as toxic, and have 1, 2, and 3 tokens using the GPT-2 tokenizer.--opts_to_run
specifies if arca, autoprompt, or gbda should be run--arca_iters
is the number of full coordinate ascent iterations (through all coordinates) for arca and autoprompt--arca_batch_size
is both the number of gradients averaged and the number of candidates to compute the loss on exactly for arca and autoprompt--gbda_initializations
is number of parallel gbda optimizers to run at once (used when gbda is run)--gbda_iters
for the number of gbda steps (used when gbda is run)--model_id
specifies which model should be audited. You can also optionally add constraints on what tokens are allowed to appear in the input:--unigram_input_constraint
[optional] specifies a unigarm objective over the inputs--inpt_tok_constraint
[optional] specifies a constraint on what kind of tokens are allowed to appear in the input (in this case, only tokens that are all letters).--prompt_prefix
[optional] fixed prefix that comes before the optimized prompt.
To run the experiment where you jointly optimize over prompts and outputs, run e.g.:
python joint_optimization_experiment.py --save_every 10 --n_trials 100 --arca_iters 50 --arca_batch_size 32 --lam_perp 0.5 --label your-file-label --model gpt2 --unigram_weight 0.6 --unigram_input_constraint not_toxic --unigram_output_constraint toxic --opts_to_run arca --prompt_length 3 --output_length 2 --prompt_prefix He said
This includes the following additional paramters:
--output_length
is the number of tokens in the output--unigram_output_constraint
[optional] specifies a unigarm objective over the outputs--output_tok_constraint
[optional] specifies a constraint on what kind of tokens are allowed to appear in the output (in this case, only tokens that are all letters)