Benchmark feature like for AIME/HMMT

I'm testing R1 0528 (free) on HMMT and it seems like it works to just paste in the question and look to see if it gets the answers right manually, but I think we could orchestrate it to run the whole test and grade automatically, like the matharena project does which has open sourced code for this, and tune prompts or the enhanced pipeline and get better scores. You're fixes for making enhanced like pure python code should show up improvement here.

We need versioning of basic and enhanced modes so we can show like v2 is better. Or show like the git hash of the checkout for beta testing version ones before they are tagged.

Problem 1 I was able to get the correct solution first try with basic 1 iteration, and I tested the +1 iteration and it worked fine to do the 2nd one and still retained same correct answer, but it was much more readable. 
I skipped to problem 9 where matharena shows it sometimes gets right or wrong and it got it right. I tried extending and also new task to see accuracy rate. All 5 got it right (one was 2 iterations). It had to use the retry code a lot for deepseek to finish and work

Problem 11, didn't get any correct on Matharena! testing basic 1 iter and enhanced 2,2 iter. Basic got it wrong, but enhanced correct! Trying to extend basic, and do a 2nd enhanced 2,2 and a 3,1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark feature like for AIME/HMMT #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark feature like for AIME/HMMT #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions