Aggregation logic to obtain domain-function performance ranking from dataset inference files provided

Thanks for your work and code on X-MAS. I have a few questions, as I am trying to reproduce the results, unsuccessfully yet, please let me know what I might be missing - 

1. On Page 14, Section D, how do you map plan, verify, direct and aggregate to AgentVerse role assigner, solver, critic and evaluator? Is the mapping plan == role assigner, direct == solver, verify == critic and aggregate == evaluator or something else? Similarly, what is the mapping for LLM-Debate and DyLAN? 


2. Table 1 provides the ranking across all chatbots for X-MAS-Bench, but could you provide the same ranking for candidate LLMs used for X-MAS-Design? 

3. I was able to download the Google drive results.zip with the predictions. I then tried to aggregate the performance to obtain the ranking mentioned in your paper in Table 1 and Table 5 but I am unable to reproduce the ranking. What is the aggregation logic used to obtain ranking at the level of mathematics, finance, etc? I tried average, weighted average and approximate accuracy over this combination mentioned in the paper below* but none of them matched the ranking in Table 1 and Table 5. Could you please share the performance ranking logic and how it can be obtained from the drive files? 

* mathematics (AIME-2024 [53], AQUA-RAT [51], GSM-Hard [52], MATH [27], MMLU- Math [54], MMLU-Pro-Math [55]), coding (HumanEval [56], HumanEval-Plus [58], MBPP [57], MBPP-Plus, MMLU-Coding, MMLU-Pro-coding), science (GPQA-Main [36], GPQA-Diamond, SciBench [59], SciEval [60], SciKnowEval [61], MMLU-Sci, MMLU-Pro-Sci), medicine (MedM- CQA [62], MedQA [63], PubMedQA [64], MMLU-Med, MMLU-Pro-Med), and finance (Fi- nanceBench [65], FinQA [66], FPB [67], MMLU-Finan, MMLU-Pro-Finan)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aggregation logic to obtain domain-function performance ranking from dataset inference files provided #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aggregation logic to obtain domain-function performance ranking from dataset inference files provided #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions