Skip to content

0xjunhao/simple-evals-continue

 
 

Repository files navigation

Overview

This repository contains a lightweight library for evaluating language models. We are open sourcing it so we can be transparent about the accuracy numbers we're publishing alongside our latest models.

Benchmark Results

Model Prompt MMLU GPQA 1 MATH 2 HumanEval MGSM3 DROP3
(F1, 3-shot)
SimpleQA
o3
o3-high 4 n/a 5 93.3 83.4 98.1 88.4 92.0 89.8 48.6
o3 6 4 n/a 92.9 82.8 97.8 87.4 92.3 80.6 49.4
o3-low 4 n/a 92.8 78.6 96.9 87.3 91.9 82.3 49.4
o4-mini
o4-mini-high 6 4 n/a 90.3 81.3 98.2 99.3 93.5 78.1 19.3
o4-mini 6 4 n/a 90.0 77.6 97.5 97.3 93.7 77.7 20.2
o4-mini-low 4 n/a 89.5 73.6 96.2 95.9 93.0 76.0 20.2
o3-mini
o3-mini-high n/a 86.9 77.2 97.9 97.6 92.0 80.6 13.8
o3-mini n/a 85.9 74.9 97.3 96.3 90.8 79.2 13.4
o3-mini-low n/a 84.9 67.6 95.8 94.5 89.4 77.6 13.0
o1
o1 n/a 91.8 75.7 96.4 - 89.3 90.2 42.6
o1-preview n/a 90.8 73.3 85.5 92.4 90.8 74.8 42.4
o1-mini n/a 85.2 60.0 90.0 92.4 89.9 83.9 07.6
GPT-4.1
gpt-4.1-2025-04-14 assistant 7 90.2 66.3 82.1 94.5 86.9 79.4 41.6
gpt-4.1-mini-2025-04-14 assistant 87.5 65.0 81.4 93.8 88.2 81.0 16.8
gpt-4.1-nano-2025-04-14 assistant 80.1 50.3 62.3 87.0 73.0 82.2 07.6
GPT-4o
gpt-4o-2024-11-20 assistant 85.7 46.0 68.5 90.2 90.3 81.5 38.8
gpt-4o-2024-08-06 assistant 88.7 53.1 75.9 90.2 90.0 79.8 40.1
gpt-4o-2024-05-13 assistant 87.2 49.9 76.6 91.0 89.9 83.7 39.0
gpt-4o-mini-2024-07-18 assistant 82.0 40.2 70.2 87.2 87.0 79.7 09.5
GPT-4.5-preview
gpt-4.5-preview-2025-02-27 assistant 90.8 69.5 87.1 88.6 86.9 83.4 62.5
GPT-4 Turbo and GPT-4
gpt-4-turbo-2024-04-09 assistant 86.7 49.3 73.4 88.2 89.6 86.0 24.2
gpt-4-0125-preview assistant 85.4 41.4 64.5 86.6 85.1 81.5 n/a
gpt-4-1106-preview assistant 84.7 42.5 64.3 83.7 87.1 83.2 n/a
Other Models (Reported)
Claude 3.5 Sonnet unknown 88.3 59.4 71.1 92.0 91.6 87.1 28.9
Claude 3 Opus unknown 86.8 50.4 60.1 84.9 90.7 83.1 23.5
Llama 3.1 405b unknown 88.6 50.7 73.8 89.0 91.6 84.8 n/a
Llama 3.1 70b unknown 82.0 41.7 68.0 80.5 86.9 79.6 n/a
Llama 3.1 8b unknown 68.4 30.4 51.9 72.6 68.9 59.5 n/a
Grok 2 unknown 87.5 56.0 76.1 88.4 n/a n/a n/a
Grok 2 mini unknown 86.2 51.0 73.0 85.7 n/a n/a n/a
Gemini 1.0 Ultra unknown 83.7 n/a 53.2 74.4 79.0 82.4 n/a
Gemini 1.5 Pro unknown 81.9 n/a 58.5 71.9 88.7 78.9 n/a
Gemini 1.5 Flash unknown 77.9 38.6 40.9 71.5 75.5 78.4 n/a

Background

Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries. Some use few-shot prompts or role playing prompts ("You are an expert software programmer..."). These approaches are carryovers from evaluating base models (rather than instruction/chat-tuned models) and from models that were worse at following instructions.

For this library, we are emphasizing the zero-shot, chain-of-thought setting, with simple instructions like "Solve the following multiple choice problem". We believe that this prompting technique is a better reflection of the models' performance in realistic usage.

Evals

This repository currently contains the following evals:

Samplers

We have implemented sampling interfaces for the following language model APIs:

Make sure to set the *_API_KEY environment variables before using these APIs.

Setup

Clone the repository, then install the requirements from the project root directory:

pip3 install -r requirements.txt

Running the evals

python3 -m simple-evals-continue.simple_evals --list-models

This will list all the models that you can evaluate.

python3 -m simple-evals-continue.simple_evals --list-evals

This will list all the evals that you can evaluate.

To run the evaluations, you can use the following command:

python3 -m simple-evals-continue.simple_evals --model <model_name> --examples <num_examples>

This will launch evaluations through the OpenAI API.

When running commands, make sure you are not inside the project root directory, as Python module imports require running from outside the package folder.

Footnotes

  1. Includes an answer regex tweak for GPQA benchmark.

  2. For newer models (anything on or after o1) we evaluate on MATH-500, which is a newer, IID version of MATH.

  3. We believe these evals are saturated for our newer models, but are reporting them for completeness. 2

  4. These results are with no tools enabled for o3 or o4-mini 2 3 4 5 6

  5. o-series models do not support using a system prompt.

  6. The default reasoning level for o3-mini is "medium". 2 3

  7. assistant system message in OpenAI API doc: "You are a helpful assistant." .

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.9%
  • Jupyter Notebook 37.1%