Question about Reproducibility

Hi Authors and Contributors,

Thanks for the clear and well-organized codebase!

I am reproducing the results from Table 2: Results of general methods on diverse domains (mathematics, science, knowledge, medicine, coding) in paper using Llama-3.3-70B-Instruct.

Some of my reproduced results don’t match those reported in Table 2. I guess this may be related to the evaluation setup. Could you clarify which model you used for the evaluator, and whether the current evaluate_xverify.py script has any changes compared to what was used in the paper?

Also, similar to issue #1, we’re looking forward to more methods being released, such as AFlow.

Thanks again for your contribution!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Reproducibility #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about Reproducibility #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions