Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for evaluation of Euros Models #2

Open
archiki opened this issue Apr 9, 2024 · 5 comments
Open

Code for evaluation of Euros Models #2

archiki opened this issue Apr 9, 2024 · 5 comments

Comments

@archiki
Copy link

archiki commented Apr 9, 2024

Can you add the code for reproducing the main results in the paper for various math and coding datasets, along with their prompts and the data splits used?

@cgq15
Copy link
Collaborator

cgq15 commented Apr 10, 2024

Thanks for your interest! We are working on it right now and we will release the evaluation code soon.

@archiki
Copy link
Author

archiki commented Apr 16, 2024

Thanks @cgq15, in the meantime could you let me know the generation configs used for different task types? From what I have been able to reproduce the numbers are not consistent with Table 3 of your paper (see below), and I think it can be attributed to generation configs such as temperature, top_p, top_k, and do_sample as I am using the same prompts as listed in the paper.

Reproduction Study:
Dataset: HumanEval, Reported Performance: 55.5, Performance Obtained: 47.56 (temp=0.2, top_p=0.9, top_k=50)
Dataset: MATH, Reported Performance: 32.6, Performance Obtained: 29.6 (PoT with above parameters)

Note: I have loaded the model with 8-bit quantization.

@cgq15
Copy link
Collaborator

cgq15 commented Apr 17, 2024

Hi, we are releasing the eval code today. So please stay tuned.
For hyperparameters, we set temperature=0, i.e. greedy decoding for all coding and math tasks. Also, we test the models with float16 weights.

@lifan-yuan
Copy link
Collaborator

Hi @archiki,

we have released the eval code. Enjoy!

@archiki
Copy link
Author

archiki commented May 14, 2024

Thanks a lot! @lifan-yuan can you clarify if the performance reported on MBPP and HumanEval are on the regular test sets or the EvalPlus suites? TIA!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants