Skip to content

Benchmark feature like for AIME/HMMT #9

@jjoshua2

Description

@jjoshua2

I'm testing R1 0528 (free) on HMMT and it seems like it works to just paste in the question and look to see if it gets the answers right manually, but I think we could orchestrate it to run the whole test and grade automatically, like the matharena project does which has open sourced code for this, and tune prompts or the enhanced pipeline and get better scores. You're fixes for making enhanced like pure python code should show up improvement here.

We need versioning of basic and enhanced modes so we can show like v2 is better. Or show like the git hash of the checkout for beta testing version ones before they are tagged.

Problem 1 I was able to get the correct solution first try with basic 1 iteration, and I tested the +1 iteration and it worked fine to do the 2nd one and still retained same correct answer, but it was much more readable.
I skipped to problem 9 where matharena shows it sometimes gets right or wrong and it got it right. I tried extending and also new task to see accuracy rate. All 5 got it right (one was 2 iterations). It had to use the retry code a lot for deepseek to finish and work

Problem 11, didn't get any correct on Matharena! testing basic 1 iter and enhanced 2,2 iter. Basic got it wrong, but enhanced correct! Trying to extend basic, and do a 2nd enhanced 2,2 and a 3,1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions