GitHub - sylinrl/CalibratedMath: Teaching Models to Express Their Uncertainty in Words

CalibratedMath

Authors: Stephanie Lin, University of Oxford (sylin07@gmail.com), Jacob Hilton, OpenAI (jhilton@openai.com), Owain Evans, University of Oxford (owaine@gmail.com)

CalibratedMath is a test suite of simple arithmetic tasks. Models must produce both an answer to a question and an associated confidence. Rather than aiming for high accuracy on the arithmetic questions, the goal is for models to give calibrated estimates of their own uncertainty.

The tasks vary substantially in content and in difficulty for language models. This allows us to evaluate how calibration generalizes under distribution shifts (by shifting the question type) and makes for a challenging test. Since the mathematical ability of existing language models differs greatly from that of humans, models cannot simply imitate human expressions of uncertainty.

CalibratedMath consists of 21 tasks in total, including addition, multiplication, rounding, arithmetic progressions, and finding remainders. For each task, questions and answers are programmatically generated. The answers are always integers and for some tasks there are multiple correct answers (e.g. "Name any prime number below 208?"). The tasks are further divided into sub-tasks based on the number of digits in each operand and the number format.

Questions can be generated by calling generate_samples() in dataset.py, which will return each question in a natural-language QA format. Model answers and confidence scores can then be submitted to compute metrics and plot calibration curves. An example with toy data is shown in dataset.py.

Baseline

The results below are from the 175-billion parameter GPT-3 model.

Verbalized: The model expresses its uncertainty in natural language, e.g. "61% confidence" or "medium confidence".
Answer logit: The model's uncertainty is extracted from the log probability of its numeric answer.
Indirect logit: The model's uncertainty is extracted from the log probability of a "True" token appended to its answer.
Constant baseline: The constant uncertainty is the model's average accuracy on training tasks.

Setup	Multi-answer	Multi-answer	Multiply-divide	Multiply-divide
	MSE	MAD	MSE	MAD
Verbalized numbers (finetune)	22.0	16.4	15.5	19.0
Answer logit (zero-shot)	37.4	33.7	10.4	9.4
Indirect logit (finetune)	33.7	38.4	11.7	7.1
Constant baseline	34.1	31.1	15.3	8.5

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
configs.py		configs.py
dataset.py		dataset.py
prompts.py		prompts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CalibratedMath

Baseline

About

Releases

Packages

Languages

sylinrl/CalibratedMath

Folders and files

Latest commit

History

Repository files navigation

CalibratedMath

Baseline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages