File tree Expand file tree Collapse file tree 1 file changed +7
-5
lines changed Expand file tree Collapse file tree 1 file changed +7
-5
lines changed Original file line number Diff line number Diff line change @@ -162,7 +162,7 @@ python -m gpt_oss.evals --model 120b-low --eval gpqa --n-threads 128
162162python -m gpt_oss.evals --model 120b --eval gpqa --n-threads 128 
163163python -m gpt_oss.evals --model 120b-high --eval gpqa --n-threads 128 
164164``` 
165- 
165+ To eval on AIME2025, change  ` gpqa `  to  ` aime25 ` . 
166166With vLLM deployed:
167167
168168``` 
@@ -176,23 +176,25 @@ vllm serve openai/gpt-oss-120b \
176176  --no-enable-prefix-caching 
177177``` 
178178
179- Here is the score we are able to reproduce, and we encourage you to help reproduce as well! 
179+ Here is the score we were able to reproduce without tool use, and we encourage you to try reproducing it as well!
180+ We’ve observed that the numbers may vary slightly across runs, so feel free to run the evaluation multiple times to get a sense of the variance.
181+ For a quick correctness check, we recommend starting with the low reasoning effort setting (120b-low), which should complete within minutes.
180182
181183Model: 120B
182184
183185|  Reasoning Effort |  GPQA |  AIME25 | 
184186|  :---- |  :---- |  :---- | 
185187|  Low  |  65.3 |  51.2 | 
186- |  Mid  |  (77)  |  79.6 | 
187- |  High  |  (82.2)  |  93.0 | 
188+ |  Mid  |  72.4  |  79.6 | 
189+ |  High  |  79.4  |  93.0 | 
188190
189191Model: 20B
190192
191193|  Reasoning Effort |  GPQA |  AIME25 | 
192194|  :---- |  :---- |  :---- | 
193195|  Low  |  56.8 |  38.8 | 
194196|  Mid  |  67.5 |  75.0 | 
195- |  High  |    |   | 
197+ |  High  |  70.9  |  85.8   | 
196198
197199## Known Limitations  
198200
 
 
   
 
     
   
   
          
    
    
     
    
      
     
     
    You can’t perform that action at this time.
  
 
    
  
    
      
        
     
       
      
     
   
 
    
    
  
 
  
 
     
    
0 commit comments