-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl
dataset which the principle is 100% helpfulness.
for example, the scores of 9th sample in evol_instruct.jsonl
dataset is as following:
models | helpfulness | honesty | instruction following | truthfulness | overall score |
---|---|---|---|---|---|
gpt-3.5-turbo | 4 | 5 | 4 | 5 | 7 |
llama-2-70b-chat | 4 | 4 | 5 | 5 | 7.5 |
mpt-30b-chat | 3 | 4 | 3 | 5 | 6.5 |
vicuna-33b | 5 | 4 | 4 | 5 | 6.5 |
The answer of vicuna-33b has the highest helpfulness but lowest overall score.
My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.
Any suggestions will be appriciated, thx.
Metadata
Metadata
Assignees
Labels
No labels