about gpt-4-0125-preview reference answer #21

duguodong7 · 2024-09-10T02:17:12Z

hello,

我想咨询一下在MT-bench上测试时，使用的reference answer 是通过 gen_api_answer.py --model gpt-4-0125-preview这个命令来获取的吗？
生成的reference answer有80个，然后把其中100～130个用official commenthttps://github.com/lm-sys/FastChat/pull/3158这里的正确的30个进行替换吗？
总结一下；judge model 是用gpt-4-0125-preview，但是80个问题的reference answer 是怎么获取呢，judge的结果是可复现的还是会有波动呢？

duguodong7 · 2024-09-10T03:20:05Z

Then I see that reference answer is only prepare for 100~130, we do not neet to run gen_api_answer.py since the reference 30 is given.
However, with the provided gpt-4-0125-preview.jsonl and gpt-4-0125-preview as judge model, we still can not obtaion a same result for openchat_3.5 as shown in paper, we are trying more tests. The result we got is (7.4375 6.9875 7.2125) in stead of (7.14 6.55 6.84) as shown in paper.

duguodong7 · 2024-09-10T08:44:42Z

I just tested the result of FuseChat-2.0 provided in your link, the result is (1st turn: 7.6125 2nd turn: 6.425 mean: 7.01875) in stead of (7.70 7.05 7.38) what you report, may I know what is wrong and how can I reproduce it ? I did not use vllm and also I maintain all environment version the same with you. I generate the result of FuseChat-2.0 with 8 GPUs.

yangzy39 · 2024-09-10T16:33:01Z

Regarding MT-Bench, the evaluation results may fluctuate. When conducting the evaluation, we only changed the reference and judge model from gpt-4-0613 to gpt-4-0125-preview. Below are several reasons that could lead to differences in reproducibility:

Ensuring that the reference answer and API used during the evaluation are from gpt-4-0125-preview.
Ensuring that the correct chat template is used when evaluating our model, which may require modifying the matching code in FastChat (https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py).
Ensuring the openchat_3.5 model you tested is from https://huggingface.co/openchat/openchat_3.5.

Also, we provide our judgement file for openchat_3.5 and FuseChat-2.0(FuseChat-7B-SCE).
judgement.zip

duguodong7 · 2024-09-13T13:30:46Z

Thank you so much! I did make an error when setting the chat template, and the performance improved after correcting it.

18907305772 added the good first issue Good for newcomers label Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about gpt-4-0125-preview reference answer #21

about gpt-4-0125-preview reference answer #21

duguodong7 commented Sep 10, 2024

duguodong7 commented Sep 10, 2024

duguodong7 commented Sep 10, 2024

yangzy39 commented Sep 10, 2024 •

edited

Loading

duguodong7 commented Sep 13, 2024

about gpt-4-0125-preview reference answer #21

about gpt-4-0125-preview reference answer #21

Comments

duguodong7 commented Sep 10, 2024

duguodong7 commented Sep 10, 2024

duguodong7 commented Sep 10, 2024

yangzy39 commented Sep 10, 2024 • edited Loading

duguodong7 commented Sep 13, 2024

yangzy39 commented Sep 10, 2024 •

edited

Loading