Authors: Stephanie Lin, University of Oxford (sylin07@gmail.com), Jacob Hilton, OpenAI (jhilton@openai.com), Owain Evans, University of Oxford (owaine@gmail.com)
CalibratedMath is a test suite of simple arithmetic tasks. Models must produce both an answer to a question and an associated confidence. Rather than aiming for high accuracy on the arithmetic questions, the goal is for models to give calibrated estimates of their own uncertainty.
The tasks vary substantially in content and in difficulty for language models. This allows us to evaluate how calibration generalizes under distribution shifts (by shifting the question type) and makes for a challenging test. Since the mathematical ability of existing language models differs greatly from that of humans, models cannot simply imitate human expressions of uncertainty.
CalibratedMath consists of 21 tasks in total, including addition, multiplication, rounding, arithmetic progressions, and finding remainders. For each task, questions and answers are programmatically generated. The answers are always integers and for some tasks there are multiple correct answers (e.g. "Name any prime number below 208?"). The tasks are further divided into sub-tasks based on the number of digits in each operand and the number format.
Questions can be generated by calling generate_samples()
in dataset.py
, which will return each question in a natural-language QA format. Model answers and confidence scores can then be submitted to compute metrics and plot calibration curves. An example with toy data is shown in dataset.py
.
The results below are from the 175-billion parameter GPT-3 model.
- Verbalized: The model expresses its uncertainty in natural language, e.g. "61% confidence" or "medium confidence".
- Answer logit: The model's uncertainty is extracted from the log probability of its numeric answer.
- Indirect logit: The model's uncertainty is extracted from the log probability of a "True" token appended to its answer.
- Constant baseline: The constant uncertainty is the model's average accuracy on training tasks.
Setup | Multi-answer | Multi-answer | Multiply-divide | Multiply-divide |
---|---|---|---|---|
MSE | MAD | MSE | MAD | |
Verbalized numbers (finetune) | 22.0 | 16.4 | 15.5 | 19.0 |
Answer logit (zero-shot) | 37.4 | 33.7 | 10.4 | 9.4 |
Indirect logit (finetune) | 33.7 | 38.4 | 11.7 | 7.1 |
Constant baseline | 34.1 | 31.1 | 15.3 | 8.5 |