Skip to content

Question about Reproducibility #6

@May-Mq

Description

@May-Mq

Hi Authors and Contributors,

Thanks for the clear and well-organized codebase!

I am reproducing the results from Table 2: Results of general methods on diverse domains (mathematics, science, knowledge, medicine, coding) in paper using Llama-3.3-70B-Instruct.

Some of my reproduced results don’t match those reported in Table 2. I guess this may be related to the evaluation setup. Could you clarify which model you used for the evaluator, and whether the current evaluate_xverify.py script has any changes compared to what was used in the paper?

Also, similar to issue #1, we’re looking forward to more methods being released, such as AFlow.

Thanks again for your contribution!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions