Perplexity: Compute scores correlated to HellaSwag #2312
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a
--perplexity-lines
parameter to the perplexity tool. In this mode the perplexity is calculated over each line of the prompt instead of over each ctx window.HellaSwag scores is a great way to measure how much of the English language the model understands.
Make two runs on a model, one prompted with a file containing "correct" sentences (one per line) and another run with a file containing "wrong" sentences. The measured perplexity from both files can be used to compute a score that is linearly correlated to the HellaSwag score.
ppl_correct
= Cumulative perplexity on each line of hellaswag_val_correct.txt, lower values are better.ppl_wrong
= Cumulative perplexity on each line of hellaswag_val_wrong.txt, higher values are better.The formula
(ppl_wrong - ppl_correct) / ppl_correct
correlates linearly with HellaSwag scores on Open LLM Leaderboard.Test files: klosax/ppl_hellaswag.
Open LLaMA 3B