[bug] MT-Bench only evaluate the second turn. #1085

notoschord · 2024-04-25T03:02:26Z

notoschord
Apr 25, 2024

When MTBenchDataset is loaded, multi_turn is always True, so only second turn answers will be evaluated.

class MTBenchDataset(BaseDataset):

    def load(self, path: str, name: str, multi_turn=True, judge_type='single'):
        filename = osp.join(path, f'{name}.json')
        dataset = DatasetDict()
        raw_data = []
        with open(filename, 'r', encoding='utf-8') as f:
            json_data = json.load(f)
            for problem in json_data:
                if 'dialogue' in problem:
                    system_prompt, prompt_template = prompt_construct(
                        problem, multi_turn, judge_type)
                    dialogue = problem['dialogue']
                    capability = problem['capability']
                    others = problem['others']
                    others['round'] = int(len(dialogue) / 2)
                    user_contents = [
                        item['content'] for item in dialogue
                        if item['role'] == 'user'
                    ]
                    question = ' '.join(user_contents)
                    others['question'] = question
                    raw_data.append({
                        'dialogue': dialogue,
                        'capability': capability,
                        'system_prompt': system_prompt,
                        'prompt_template': prompt_template,
                        'others': others,
                        'judge': {
                            'capability': capability,
                            'others': others,
                        }
                    })
        dataset = Dataset.from_list(raw_data)
        return dataset

bittersweet1999 · 2024-04-25T03:16:20Z

bittersweet1999
Apr 25, 2024
Collaborator

Hello, you can set multi_turn in here

opencompass/configs/datasets/subjective/multiround/mtbench_single_judge_diff_temp.py

Line 56 in 41196c4

dict(

to test just one turn for MTBench
Besides, if you want to test alpacaeval2, just use this config is totally same as official https://github.com/open-compass/opencompass/blob/main/configs/eval_subjective_alpacaeval.py
Feel free to contact me if you have any other problems on subjective evaluation for llm.

0 replies

notoschord · 2024-04-25T03:33:32Z

notoschord
Apr 25, 2024
Author

Thank you for your answer. It can evaluate the first turn separately.
But if I want to evaluate both first turn and second turn by one run, how can I implement it? It seems that MTBenchSummarizer not support it.

8 replies

notoschord Apr 26, 2024
Author

I support this feature in a simple way. Although it requires inferences twice for the first turn, MT-Bench only have 80 questions, so it will not take much time.

main...notoschord:opencompass:main

bittersweet1999 Apr 26, 2024
Collaborator

Thank you for your contribution, I will check later

notoschord Apr 26, 2024
Author

My evaluation results on Qwen1.5 and llama3-8B:

model	total	turn-1	turn-2
llama3-8b	8.14	8.48	7.79
qwen1.5-14b-chat	7.83	8.21	7.46
qwen1.5-32b-chat	8.27	8.5	8.04

notoschord May 20, 2024
Author

@bittersweet1999 Have you checked it? Would it be helpful if I open a PR?

bittersweet1999 May 20, 2024
Collaborator

It's ok, welcome to open a PR, but you may need to set max_out_len=512 or 1024, because this is a multiround dataset, 2048 may too long for each turn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] MT-Bench only evaluate the second turn. #1085

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[bug] MT-Bench only evaluate the second turn. #1085

notoschord Apr 25, 2024

Replies: 2 comments · 8 replies

bittersweet1999 Apr 25, 2024 Collaborator

notoschord Apr 25, 2024 Author

notoschord Apr 26, 2024 Author

bittersweet1999 Apr 26, 2024 Collaborator

notoschord Apr 26, 2024 Author

notoschord May 20, 2024 Author

bittersweet1999 May 20, 2024 Collaborator

notoschord
Apr 25, 2024

Replies: 2 comments 8 replies

bittersweet1999
Apr 25, 2024
Collaborator

notoschord
Apr 25, 2024
Author

notoschord Apr 26, 2024
Author

bittersweet1999 Apr 26, 2024
Collaborator

notoschord Apr 26, 2024
Author

notoschord May 20, 2024
Author

bittersweet1999 May 20, 2024
Collaborator