Yongchao Zhou*, Andrei Ioan Muresanu*, Ziwen Han*, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba
Project Page | ArXiv | Colab | Demo
This repo contains code for "Large Language Models Are Human-Level Prompt Engineers". Please see our paper and project page for more results.
By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the “program,” optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 21/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts.
To install APE, simply run:
pip install -e .
And add your OPENAI_API_KEY with the following command:
export OPENAI_API_KEY=YOUR_KEY
APE comes with two interfaces, the find_prompts
function and a simplified version which takes care of most of the
configuration for you.
APE is built around three types of templates: evaluation templates, prompt generation templates, and demonstration templates.
The evaluation template (eval_template
) defines the format of the input to the language model used to evaluate the
quality of different prompts. The template supports the following tokens:
- [PROMPT] - The prompt to evaluate
- [INPUT] - The input to the model
- [OUTPUT] - The target output of the model
- [full_DEMO] - Demonstrations (if you want to use few-shot learning)
For example, the zero-shot evaluation template for the instruction induction tasks is:
Instruction: [PROMPT]
Input: [INPUT]
Output: [OUTPUT]
For instruction + few-shot learning, the template is:
Instruction: [PROMPT]
[full_DEMO]
Input: [INPUT]
Output: [OUTPUT]
The prompt generation template (prompt_gen_template
) defines the format of the input to the language model used to
generate candidate prompts. The template supports the following tokens:
- [APE] - A token to be replaced by the LLM's generated text
- [full_DEMO] - Demonstrations of input/output pairs
- [INPUT] - The input of the first demonstration
- [OUTPUT] - The output of the first demonstration
For example, in the instruction induction task, we generate candidate prompts using the following template:
I gave a friend an instruction. Based on the instruction they produced the following input-output pairs:
[full_DEMO]
The instruction was to [APE]
prompt_gen_template
is set to None
. In this case, if the prompt_gen_mode
is forward
, the
above tepmlate is used. If the prompt_gen_mode
is insert
, the template is simply converted from the evaluation
template by replacing [PROMPT]
with [APE]
.
Finally, the demonstration template (demos_template
) defines the format of the demonstrations. The template supports
the following tokens:
- [INPUT] - The input of the demonstration
- [OUTPUT] - The output of the demonstration
Datasets in this codebase are represented using separate lists for inputs and outputs. For example, a dataset for words and their antonyms can be written as:
words = ["sane", "direct", "informally", "unpopular", "subtractive", "nonresidential",
"inexact", "uptown", "incomparable", "powerful", "gaseous", "evenly", "formality",
"deliberately", "off"]
antonyms = ["insane", "indirect", "formally", "popular", "additive", "residential",
"exact", "downtown", "comparable", "powerless", "solid", "unevenly", "informality",
"accidentally", "on"]
data = (words, antonyms)
The find_prompts
function takes the following arguments:
eval_template
- The evaluation templatedemos_template
- The demonstration templateprompt_gen_data
- The data used to generate candidate promptseval_data
- The data used to evaluate the quality of candidate promptsconf
- A dictionary of configuration optionsbase_conf
- The path to the base configuration dataset.'configs/default.yaml'
: Default configuration that uses likelihood to evaluate prompts.'configs/bandits.yaml'
: Configuration that uses the Upper Confidence Bound algorithm to save resources when evaluating prompts. Uses likelihood as its base evaluation method.
few_shot_data
- The data used for few-shot learningprompt_gen_template
- The prompt generation template
For more information on the conf
dictionary, please refer to the annotations in configs/bandits.yaml
.
find_prompts
returns an evaluation result object and a demo function to allow you to evaluate the quality of the
selected prompt manually. To get the best prompts and their associated scores from the evaluation result object, use
the sorted()
method.
The simple_ape
function takes the following arguments:
dataset
- The dataset to use (simple APE uses the same dataset for prompt generation, evaluation, and few-shot learning)eval_template
- The evaluation template (default:'Instruction: [PROMPT]\nInput: [INPUT]\nOutput: [OUTPUT]'
)prompt_gen_template
- The prompt generation template (default:None
)prompt_gen_mode
- The mode to use for prompt generation (default:'forward'
)demos_template
- The demonstration template (default:'Input: [INPUT]\nOutput: [OUTPUT]'
)eval_model
- The model to use for evaluation (default:'text-davinci-002'
)prompt_gen_model
- The model to use for prompt generation (default:'text-davinci-002'
)num_prompts
- The number of candidate prompts to generate (default:50
)eval_rounds
- The number of UCB rounds of evaluation to perform (default:20
)prompt_gen_batch_size
- The batch size to use for prompt generation (default:200
)eval_batch_size
- The batch size to use for evaluation (default:500
)
simple_ape
is designed to simplify many of the choices that need to be made when using find_prompts
. Particularly,
it simplifies evaluation by choosing only the amount of candidate prompts to generate and the number of rounds of UCB to
run. Behind the scenes, simple_ape
uses num_prompts_per_round
equal to num_prompts // 3
and fixes the number of
samples per prompt per round to 5
.
An example usage of this function would look like:
from automatic_prompt_engineer import ape
words = ["sane", "direct", "informally", "unpopular", "subtractive", "nonresidential",
"inexact", "uptown", "incomparable", "powerful", "gaseous", "evenly", "formality",
"deliberately", "off"]
antonyms = ["insane", "indirect", "formally", "popular", "additive", "residential",
"exact", "downtown", "comparable", "powerless", "solid", "unevenly", "informality",
"accidentally", "on"]
eval_template = \
"""Instruction: [PROMPT]
Input: [INPUT]
Output: [OUTPUT]"""
result, demo_fn = ape.simple_ape(
dataset=(words, antonyms),
eval_template=eval_template,
)
find_prompts
returns an evaluation result object and a demo function to allow you to evaluate the quality of the
selected prompt manually. To get the best prompts and their associated scores from the evaluation result object, use
the sorted()
method.
As APE can often be expensive to run, we provide cost estimations for find_prompts
and simple_ape
. Simply
use ape.estimate_cost
or ape.simple_estimate_cost
with the same arguments as find_prompts
and simple_ape
respectively.
We provide a colab notebook for easily using APE:
We also provide a GUI for easily using APE. Please follow the instructions in the following colab to run it:
- automatic_prompt_engineer
|- configs
|- evaluation
|- ape.py
|- config.py
|- evaluate.py
|- generate.py
|- llm.py
|- template.py
- experiments: scripts for experiments
|- configs
|- data
|- evaluation
|- run_instruction_induction.py
|- run_truthful_qa.py
- tests: unit tests
- demo.py: script for launching the GUI
- demo.ipynb: notebook demonstrating simple_ape
To reproduce the experiments from the paper, simply run the scripts in the experiments
folder. For example, to
reproduce the experiments for the instruction induction task, run:
python experiments/run_instruction_induction.py --task=antonyms
To run the TruthfulQA experiment, run:
python experiments/run_truthful_qa.py
Our codebase is based on the following repo. Thanks for open-sourcing!
@article{zhou2022large,
title={Large Language Models Are Human-Level Prompt Engineers},
author={Yongchao Zhou and Andrei Ioan Muresanu and Ziwen Han and Keiran Paster and Silviu Pitis and Harris Chan and Jimmy Ba},
year={2022},
eprint={2211.01910},
archivePrefix={arXiv},
primaryClass={cs.LG}
}