diff --git a/fastchat/llm_judge/README.md b/fastchat/llm_judge/README.md index 1d2646b13..5b97800c2 100644 --- a/fastchat/llm_judge/README.md +++ b/fastchat/llm_judge/README.md @@ -40,26 +40,48 @@ You can use this QA browser to view the answers generated by you later. ### Evaluate a model on MT-bench #### Step 1. Generate model answers to MT-bench questions + +To generate model answers, you can either use [vLLM](https://github.com/vllm-project/vllm) via a FastChat server (recommended) or Hugging Face. + +##### Using vLLM (recommended): + +1. Launch a VLLM worker ``` -python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID] +python3 -m fastchat.serve.controller +python3 -m fastchat.serve.vllm_worker --model-path [MODEL-PATH] +python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 ``` -Arguments: - - `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID. - - `[MODEL-ID]` is a name you give to the model. + - Arguments: + - `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID. -e.g., +2. Generate the answers ``` -python gen_model_answer.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5 +python gen_api_answer.py --model [MODEL-NAME] --openai-api-base https://localhost:8000/v1 --parallel 50 ``` -The answers will be saved to `data/mt_bench/model_answer/[MODEL-ID].jsonl`. + - Arguments: + - `[MODEL-NAME]` is the name of the model from Step 1. + - `--parallel` is the number of concurrent API calls to the vLLM worker. -To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model [here](../../docs/model_support.md#how-to-support-a-new-model). +##### Using Hugging Face: +1. Generate the answers +``` +python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID] +``` +- Arguments: + - `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID. + - `[MODEL-ID]` is a name you give to the model. + + - You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs. -You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs. +- e.g. `python gen_model_answer.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5` + +The answers will be saved to `data/mt_bench/model_answer/[MODEL-ID/MODEL-NAME].jsonl`. + +To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model [here](../../docs/model_support.md#how-to-support-a-new-model). #### Step 2. Generate GPT-4 judgments There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading. -In MT-bench, we recommond single-answer grading as the default mode. +In MT-bench, we recommend single-answer grading as the default mode. This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison. For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.