Skip to content

Commit 985b860

Browse files
gptoss complete eval (#15)
1 parent 5984053 commit 985b860

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

OpenAI/GPT-OSS.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ python -m gpt_oss.evals --model 120b-low --eval gpqa --n-threads 128
162162
python -m gpt_oss.evals --model 120b --eval gpqa --n-threads 128
163163
python -m gpt_oss.evals --model 120b-high --eval gpqa --n-threads 128
164164
```
165-
165+
To eval on AIME2025, change `gpqa` to `aime25`.
166166
With vLLM deployed:
167167

168168
```
@@ -176,23 +176,25 @@ vllm serve openai/gpt-oss-120b \
176176
--no-enable-prefix-caching
177177
```
178178

179-
Here is the score we are able to reproduce, and we encourage you to help reproduce as well!
179+
Here is the score we were able to reproduce without tool use, and we encourage you to try reproducing it as well!
180+
We’ve observed that the numbers may vary slightly across runs, so feel free to run the evaluation multiple times to get a sense of the variance.
181+
For a quick correctness check, we recommend starting with the low reasoning effort setting (120b-low), which should complete within minutes.
180182

181183
Model: 120B
182184

183185
| Reasoning Effort | GPQA | AIME25 |
184186
| :---- | :---- | :---- |
185187
| Low | 65.3 | 51.2 |
186-
| Mid | (77) | 79.6 |
187-
| High | (82.2) | 93.0 |
188+
| Mid | 72.4 | 79.6 |
189+
| High | 79.4 | 93.0 |
188190

189191
Model: 20B
190192

191193
| Reasoning Effort | GPQA | AIME25 |
192194
| :---- | :---- | :---- |
193195
| Low | 56.8 | 38.8 |
194196
| Mid | 67.5 | 75.0 |
195-
| High | | |
197+
| High | 70.9 | 85.8 |
196198

197199
## Known Limitations
198200

0 commit comments

Comments
 (0)