-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Hi Authors and Contributors,
Thanks for the clear and well-organized codebase!
I am reproducing the results from Table 2: Results of general methods on diverse domains (mathematics, science, knowledge, medicine, coding) in paper using Llama-3.3-70B-Instruct.
Some of my reproduced results don’t match those reported in Table 2. I guess this may be related to the evaluation setup. Could you clarify which model you used for the evaluator, and whether the current evaluate_xverify.py script has any changes compared to what was used in the paper?
Also, similar to issue #1, we’re looking forward to more methods being released, such as AFlow.
Thanks again for your contribution!
Metadata
Metadata
Assignees
Labels
No labels