Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instructions for evaluating on MT bench using vLLM #2770

Merged
merged 1 commit into from
Dec 16, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 32 additions & 10 deletions fastchat/llm_judge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,26 +40,48 @@ You can use this QA browser to view the answers generated by you later.
### Evaluate a model on MT-bench

#### Step 1. Generate model answers to MT-bench questions

To generate model answers, you can either use [vLLM](https://github.com/vllm-project/vllm) via a FastChat server (recommended) or Hugging Face.

##### Using vLLM (recommended):

1. Launch a VLLM worker
```
python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.vllm_worker --model-path [MODEL-PATH]
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
Arguments:
- `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID.
- `[MODEL-ID]` is a name you give to the model.
- Arguments:
- `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID.

e.g.,
2. Generate the answers
```
python gen_model_answer.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5
python gen_api_answer.py --model [MODEL-NAME] --openai-api-base https://localhost:8000/v1 --parallel 50
```
The answers will be saved to `data/mt_bench/model_answer/[MODEL-ID].jsonl`.
- Arguments:
- `[MODEL-NAME]` is the name of the model from Step 1.
- `--parallel` is the number of concurrent API calls to the vLLM worker.

To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model [here](../../docs/model_support.md#how-to-support-a-new-model).
##### Using Hugging Face:
1. Generate the answers
```
python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]
```
- Arguments:
- `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID.
- `[MODEL-ID]` is a name you give to the model.

- You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs.

You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs.
- e.g. `python gen_model_answer.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5`

The answers will be saved to `data/mt_bench/model_answer/[MODEL-ID/MODEL-NAME].jsonl`.

To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model [here](../../docs/model_support.md#how-to-support-a-new-model).

#### Step 2. Generate GPT-4 judgments
There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading.
In MT-bench, we recommond single-answer grading as the default mode.
In MT-bench, we recommend single-answer grading as the default mode.
This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison.
For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.

Expand Down