-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Thanks for your work and code on X-MAS. I have a few questions, as I am trying to reproduce the results, unsuccessfully yet, please let me know what I might be missing -
-
On Page 14, Section D, how do you map plan, verify, direct and aggregate to AgentVerse role assigner, solver, critic and evaluator? Is the mapping plan == role assigner, direct == solver, verify == critic and aggregate == evaluator or something else? Similarly, what is the mapping for LLM-Debate and DyLAN?
-
Table 1 provides the ranking across all chatbots for X-MAS-Bench, but could you provide the same ranking for candidate LLMs used for X-MAS-Design?
-
I was able to download the Google drive results.zip with the predictions. I then tried to aggregate the performance to obtain the ranking mentioned in your paper in Table 1 and Table 5 but I am unable to reproduce the ranking. What is the aggregation logic used to obtain ranking at the level of mathematics, finance, etc? I tried average, weighted average and approximate accuracy over this combination mentioned in the paper below* but none of them matched the ranking in Table 1 and Table 5. Could you please share the performance ranking logic and how it can be obtained from the drive files?
- mathematics (AIME-2024 [53], AQUA-RAT [51], GSM-Hard [52], MATH [27], MMLU- Math [54], MMLU-Pro-Math [55]), coding (HumanEval [56], HumanEval-Plus [58], MBPP [57], MBPP-Plus, MMLU-Coding, MMLU-Pro-coding), science (GPQA-Main [36], GPQA-Diamond, SciBench [59], SciEval [60], SciKnowEval [61], MMLU-Sci, MMLU-Pro-Sci), medicine (MedM- CQA [62], MedQA [63], PubMedQA [64], MMLU-Med, MMLU-Pro-Med), and finance (Fi- nanceBench [65], FinQA [66], FPB [67], MMLU-Finan, MMLU-Pro-Finan)