Skip to content

Aggregation logic to obtain domain-function performance ranking from dataset inference files provided #3

@being-agentic

Description

@being-agentic

Thanks for your work and code on X-MAS. I have a few questions, as I am trying to reproduce the results, unsuccessfully yet, please let me know what I might be missing -

  1. On Page 14, Section D, how do you map plan, verify, direct and aggregate to AgentVerse role assigner, solver, critic and evaluator? Is the mapping plan == role assigner, direct == solver, verify == critic and aggregate == evaluator or something else? Similarly, what is the mapping for LLM-Debate and DyLAN?

  2. Table 1 provides the ranking across all chatbots for X-MAS-Bench, but could you provide the same ranking for candidate LLMs used for X-MAS-Design?

  3. I was able to download the Google drive results.zip with the predictions. I then tried to aggregate the performance to obtain the ranking mentioned in your paper in Table 1 and Table 5 but I am unable to reproduce the ranking. What is the aggregation logic used to obtain ranking at the level of mathematics, finance, etc? I tried average, weighted average and approximate accuracy over this combination mentioned in the paper below* but none of them matched the ranking in Table 1 and Table 5. Could you please share the performance ranking logic and how it can be obtained from the drive files?

  • mathematics (AIME-2024 [53], AQUA-RAT [51], GSM-Hard [52], MATH [27], MMLU- Math [54], MMLU-Pro-Math [55]), coding (HumanEval [56], HumanEval-Plus [58], MBPP [57], MBPP-Plus, MMLU-Coding, MMLU-Pro-coding), science (GPQA-Main [36], GPQA-Diamond, SciBench [59], SciEval [60], SciKnowEval [61], MMLU-Sci, MMLU-Pro-Sci), medicine (MedM- CQA [62], MedQA [63], PubMedQA [64], MMLU-Med, MMLU-Pro-Med), and finance (Fi- nanceBench [65], FinQA [66], FPB [67], MMLU-Finan, MMLU-Pro-Finan)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions