Paper link: https://arxiv.org/abs/2406.14952
This is the official repository of ESC-Eval, which includes the datasets and models used in the ESC-Eval paper. The paper proposes a method for evaluating ESC models using a role-playing model, and the specific process is illustrated in the following Figure.
- Middle quality character card upload
- ./data: role_cards data used in the paper.
- ./ESC-Role: our trained role playing agents which performace better than GPT4 in role-palying of a trouble person.
- ./ESC-RANK: our trained scorer for scoring dialogues data according to 7 well-designed dimensions.
- ./result: some examples of multi-turn conversations.
- ./score: some examples of scoring results.
- ./evaluate.py: get the multi-round dialogue script of the ESC model.
- ./score.py: get the score of each dimention for multi-round dialogue.
- Download ESC-Role and replace the folder of './ESC-Role'
- Change your LLM-based ESC-model to the format of below (we also list examples of llama3 and Qwen1.5 in evaluate.py) :
class YourModel():
def __init__(self):
self.tokenizer = AutoTokenizer.from_pretrained("model_dir")
self.model = AutoModelForCausalLM.from_pretrained("model_dir",torch_dtype="auto",device_map="auto").eval()
def __call__(self, message) -> str:
reponse=self.model.chat(message)
return response
- run evaluate.py to get multi-turn dialogue data, examples:
python evaluate.py -ef ./data/test_zh.json -rf ./result/ -lang zh
python evaluate.py -ef ./data/test_en.json -rf ./result/ -lang en
- Download ESC-RANK to folder ESC-RANK, and prepare Internlm2-chat's folder in score.py.
- run score.py using ESC-RANK on your interactive data.
python score.py
Statics
ESC-Role is a specific role-playing models for ESC evaluation, which could be download form : https://huggingface.co/haidequanbu/ESC-Role
ESC-RANK is our training scoring for ESC evaluation, which could be download form : https://huggingface.co/haidequanbu/ESC-RANK
Scoring performace
Our paper is coming soon.