Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perplexity: Compute scores correlated to HellaSwag #2312

Merged
merged 6 commits into from
Jul 22, 2023

Conversation

klosax
Copy link
Contributor

@klosax klosax commented Jul 21, 2023

This PR adds a --perplexity-lines parameter to the perplexity tool. In this mode the perplexity is calculated over each line of the prompt instead of over each ctx window.


HellaSwag scores is a great way to measure how much of the English language the model understands.

Make two runs on a model, one prompted with a file containing "correct" sentences (one per line) and another run with a file containing "wrong" sentences. The measured perplexity from both files can be used to compute a score that is linearly correlated to the HellaSwag score.

ppl_correct = Cumulative perplexity on each line of hellaswag_val_correct.txt, lower values are better.
ppl_wrong = Cumulative perplexity on each line of hellaswag_val_wrong.txt, higher values are better.

The formula (ppl_wrong - ppl_correct) / ppl_correct correlates linearly with HellaSwag scores on Open LLM Leaderboard.

Test files: klosax/ppl_hellaswag.

Open LLaMA 3B

200 lines ppl_wrong ppl_correct formula x 100
F16 24.6445 16.0094 53.937455
Q8_0 24.6335 16.0000 53.959752
Q5_1 24.9139 16.2154 53.643474
Q4_0 25.4092 16.4574 54.393903
400 lines ppl_wrong ppl_correct formula x 100
F16 23.7929 16.3507 45.515894
Q8_0 23.7787 16.3446 45.483392
Q5_1 24.0065 16.5328 45.205637
Q4_0 24.4787 16.8715 45.088413

@ggerganov ggerganov added high priority Very important issue generation quality Quality of model output labels Jul 22, 2023
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this.

It would be interesting to see what are the HellaSwag numbers for the different quantizations that we have and how they compare to F16.
We can merge and do the evaluations later, or if can post some numbers here in the PR

@klosax klosax merged commit b5fe67f into ggml-org:master Jul 22, 2023
@klosax klosax deleted the perplexity-lines branch July 22, 2023 13:05
@klosax klosax mentioned this pull request Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants