🦁 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild (v2)

Intro

Evaluation Framework

Dataset Overview

How to add a new model to 🦁 WildBench benchmark

Note

If your model is on HuggingFace and/or it is supported by vLLM, please create an Issue here to tell us your model id, chat template, and your preferred sampling parameters. We will add the script to run your model to the repo here and run inference and evaluation for you.

If you'd like to try to run inference on your model by yourself or you'd like to create a PR for adding your model here, you can follow the instructions below.

Installation

conda create -n wildbench python=3.10
conda activate wildbench
pip install vllm -U # pip install -e vllm 
pip install openai datasets tenacity
# pip install google-cloud-aiplatform 
pip install google-generativeai
pip install cohere mistralai 
pip install anthropic==0.19.0
pip install reka-api==3.0.8
# export HF_HOME=/path/to/your/custom/cache_dir/

Case 1: Models supported by vLLM

You can take the files under scripts as a reference to add a new model to the benchmark, for example, to add Yi-1.5-9B-Chat.sh to the benchmark, you can follow the following steps:

Create a script named "Yi-1.5-9B-Chat.sh.py" under scripts folder.
Copy and paste the most similar existing script file to it, rename the file to the [model_pretty_name].sh.
Change the model_name and model_pretty_name to 01-ai/Yi-1.5-9B-Chat and Yi-1.5-9B-Chat.sh respectively. Make sure that model_name is the same as the model name in the Hugging Face model hub, and the model_pretty_name is the same as the script name without the .py extension.
Specify the conversation template for this model by modifying the code in src/fastchat_conversation.py or setting the --use_hf_conv_template argument if your hugingface model contains a conversation template.
Run your script to make sure it works. You can run the script by running bash scripts/Yi-1.5-9B-Chat.sh in the root folder.
Create a PR to add your script to the benchmark.

Case 2: Models that are only supported by native HuggingFace API

Some new models may not be supported by vLLM for now. You can do the same thing as above but use `--engine hf` in the script instead, and test your script. Note that some models may need more specific configurations, and you will need to read the code and modify them accordingly. In these cases, you should add name-checking conditions to ensure that the model-specific changes are only applied to the specific model.

Case 3: Private API-based Models

You should change the code to add these APIs, for example, gemini, cohere, claude, and reka. You can refer to the `--engine openai` logic in the existing scripts to add your own API-based models. Please make sure that you do not expose your API keys in the code. If your model is on Together.AI platform, you can use the `--engine together` option to run your model, see `scripts/dbrx-instruct@together.sh` for an example.

Evaluation

Note

If you'd like to have your model results verified and published on our leaderboard, please create an issue telling us and we'll do the inference and evaluation for you.

Metrics

How do you evaluate the performance of LLMs on WildBench? （V2 Updates)

Checklists

For each task in WildBench (v2), we generate a checklist of 5-10 questions by prompting GPT-4-turbo and Claude-3-Opus to comprehensively evaluate the responses of different models. The checklist is example-specific and is designed to be interpretable and easy to verify. We combine the responses of GPT-4-turbo and Claude-3-Opus to finalize the checklists to reduce the bias of a single evaluator. These checklists are used as part of the prompts for LLM judges to evaluate the responses of different models.

WB Score

To individually evaluate the performance of each model on WildBench, we prompt GPT-4-turbo to give a score form 1 to 10 for each model's response. The WB score is the average of the scores on 1024 examples, and re-scaled by (Y-5)*2, where Y is the original score outputted by GPT-4-turbo. Note that 5 represents that a response is boderline acceptable.

WB Reward

To evaluate two models (A and B) on a certain task of WildBench, we prompt GPT-4-turbo to choose the better response between two models. There are five choices: A is much/worse than B, A is slightly better/worse than B, and Tie. We define WB reward for Model A as follows:

Reward=100 if the A is much better than B.
Reward=50 if the A is slightly better than B.
Reward=0 if there is a Tie.
Reward=-50 if the A is slightly worse than B.
Reward=-100 if the A is much worse than B.

We use three reference models (GPT-4-turbo-0429, Claude-3-Haiku, and Llama-2-70B-chat) to compute the rewards for each model. The final WB Reward-Mix is the average of the three rewards on 1024 examples.

Mitigating Length Bias

As many studies have shown, LLM judges tend to prefer longer responses. To mitigate this bias, we propose a simple and customizable length penalty method. We convert Slightly Win/Lose to be a Tie if the winner is longer than the loser by a certain length threshold (K characters). We set K=500 by default, but you can customize it on our leaderboard UI. Note that K= ∞ will disable the length penalty.

Run evaluation scripts

We suggest to use OpenAI's Batch Mode for evaluation, which is faster, cheaper and more reliable.

You can:

1. Run bash evaluation/run_all_eval_batch.sh ${MODEL_PRETTY_NAME}to submmit the eval jobs.; Or if you only want to do scoring, running bash evaluation/run_score_eval_batch.sh to submmit the eval jobs for only doing the WB Score. (about $5 per model)
1. Run python src/openai_batch_eval/check_batch_status_with_model_name.py ${MODEL_PRETTY_NAME} to track the status of the batch jobs.
1. Step 2 will download the results when batch jobs are finished, and then you can view the results (see next section).

Remarks

${MODEL_PRETTY_NAME} should be the same as the script name without the .sh extension.
You can also track the progress of your batch jobs here: https://platform.openai.com/batches. The maximum turnaround time is 24 hours, but it is usually much faster depending on the queue and rate limits.
If you'd like to have more control on the evaluation methods, the detail steps are illustrated in EVAL.md.

View the results

When Step 3 in the above section is finished, you can view the results by running the following commands:

WB Score: python src/view_wb_eval.py score
WB Reward on GPT-4-turbo: python src/view_wb_eval.py pairwise-gpt4t 500
WB Reward on Claude-3-Haiku: python src/view_wb_eval.py pairwise-haiku 500
WB Reward on Llama-2-70b-chat: python src/view_wb_eval.py pairwise-llama 500

The 2nd argument is K, the length margin for the length penalty. You can set it to -1 or leave it empty to disable the length penalty.

Correlation Analysis: How well does WildBench (v2) correlate with human preferences?

To analyze the correlation between WildBench (v2) and human evaluation, we consider the correlation between different metrics and human-based Chatbot Arena Elo scores (until 2024-05-20 on Hard-English split). We find that the WB Reward-Mix has the highest correlation. Please find the pearson correlation coefficients below:

Top Models: ['gpt-4-turbo-2024-04-09', 'claude-3-opus-20240229', 'Meta-Llama-3-70B-Instruct', 'claude-3-sonnet-20240229', 'mistral-large-2402', 'Meta-Llama-3-8B-Instruct']
All Models: ['gpt-4-turbo-2024-04-09', 'claude-3-opus-20240229', 'Meta-Llama-3-70B-Instruct', 'Qwen1.5-72B-Chat', 'claude-3-sonnet-20240229', 'mistral-large-2402', 'dbrx-instruct@together', 'Mixtral-8x7B-Instruct-v0.1', 'Meta-Llama-3-8B-Instruct', 'tulu-2-dpo-70b', 'Llama-2-70b-chat-hf', 'Llama-2-7b-chat-hf', 'gemma-7b-it', 'gemma-2b-it']

Todos

Models pending to test

Create an Issue if you'd like to add a model that you wanna see on our leaderboard!

Code updates

support models via openai-style apis

Leadeboard updates

Show task categorized results

Citation

@misc{wildbench2024,
	title= {WildBench: Benchmarking Language Models with Challenging Tasks from Real Users in the Wild},
	author = {Bill Yuchen Lin and Khyathi Chandu and Faeze Brahman and Yuntian Deng and Abhilasha Ravichander and Valentina Pyatkin and Ronan Le Bras and Yejin Choi},
	year = 2024,
	url	= {https://huggingface.co/spaces/allenai/WildBench},
}

Name	Name	Last commit message	Last commit date
Latest commit yuchenlin add todo models Jun 13, 2024 54e3a50 · Jun 13, 2024 History 152 Commits
.github/workflows	.github/workflows	Create static.yml	Mar 7, 2024
docs	docs	pngs and docs	Jun 7, 2024
eval_results/v2.0522	eval_results/v2.0522	add gpt4o as judge for scoring	Jun 13, 2024
evaluation	evaluation	add gpt4o as judge for scoring	Jun 13, 2024
scripts	scripts	Merge branch 'main' of https://github.com/allenai/WildBench into main	Jun 13, 2024
src	src	Merge branch 'main' of https://github.com/allenai/WildBench into main	Jun 13, 2024
vllm	vllm	new scripts for v2	May 26, 2024
.gitignore	.gitignore	model eval results	Jun 1, 2024
EVAL.md	EVAL.md	better docs	May 31, 2024
LICENSE	LICENSE	Initial commit	Mar 6, 2024
README.md	README.md	add todo models	Jun 13, 2024
requirements.txt	requirements.txt	specify the anthropic version. change WildBench path. Add API key che…	Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦁 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild (v2)

Intro

Evaluation Framework

Dataset Overview

How to add a new model to 🦁 WildBench benchmark

Installation

Evaluation

Metrics

Checklists

WB Score

WB Reward

Mitigating Length Bias

Run evaluation scripts

View the results

Correlation Analysis: How well does WildBench (v2) correlate with human preferences?

Todos

Models pending to test

Code updates

Leadeboard updates

Citation

About

Releases

Packages

Languages

License

ec-jt/WildBench

Folders and files

Latest commit

History

Repository files navigation

🦁 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild (v2)

Intro

Evaluation Framework

Dataset Overview

How to add a new model to 🦁 WildBench benchmark

Installation

Evaluation

Metrics

Checklists

WB Score

WB Reward

Mitigating Length Bias

Run evaluation scripts

View the results

Correlation Analysis: How well does WildBench (v2) correlate with human preferences?

Todos

Models pending to test

Code updates

Leadeboard updates

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages