-
Notifications
You must be signed in to change notification settings - Fork 11.5k
perplexity: more statistics, added documentation #6936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perplexity: more statistics, added documentation #6936
Conversation
It doesn't seem like Q4_K_M shows a substantial decrease in quality as a Q4_0 does: Mean PPL: L3 8b 7.2904 It's now primarily up to other downstream projects to refrain from hosting or using the quantized Q4_0 llama 8b model, and use the best quality quantizations. To discourage the use of this quantization: would it be appropriate to update the quantize binary with mean perplexity for only llama3? This is what a large percent of people use it for. |
I added a LLaMA 3 8b scoreboard to the perplexity README: Edit: this is pre BPE tokenizer fix
|
The difference is massive and to me it looks like something is off. Is that perhaps caused by #6920 or other problems with Llama3 tokenizer and quantization? |
My understanding is that the tokenizer issues are related to special tokens which should not be appearing in Wikitext. So I don't think that this is the problem. |
I see, my understanding was that it is a broad issue, affecting overall performance and can be observed for example by errors in mathematics that are not present in FP or correctly quantized models. But perhaps I was wrong and you are right, I guess such a broad issue with tokenizer would probably break the model altogether and produce incoherent and unusable output. Output from the quants are overall very usable. |
It's a pre-tokenization issue, it should affect everything. Originally it was discovered because there were issues with addition see #6914. I'm not sure if it makes sense to compare the same model with different tokenizers, but here it goes. Comparing Q4_K_M before and after tokenizer changes
Mean ΔPPL: -0,6616 Comparing Q8_0 vs Q4_K_M after tokenizer changes
Mean ΔPPL: 0.1054 |
If these results can be improved with tokenizer fixes that's great. I'll redo the table. |
0b8d056
to
9912fc8
Compare
@JohannesGaessler can you include the PPL for fp16 Llama 2? seems like it's a valuable comparison, Llama 2 vs Llama 3 is fine but we care more about affect of quantization on Llama 2 vs affect of quantization on Llama 3 right? |
It's implicitly already in the table since it includes the delta to FP16. |
Oh yeah okay fair, didn't notice the delta Am I correct then in concluding that, besides Q2, it all seems pretty reasonable in terms of deltas? |
Almost all of these metrics are not directly comparable since the tokenizer has changed. But based on the results I have so far I would say that the biggest difference is for the small quants that have always been problematic. |
9912fc8
to
281a2d8
Compare
I might be blind, but what was the chunk size and context in that perplexity test? |
I updated the tables with the values post BPE tokenizer fix. The absolute values change but because both FP16 and the quantized models now do better the relative performance has stayed largely the same.
I am using the llama.cpp defaults unless noted otherwise, so the chunk size is 512 tokens. |
I see, thank you for the information. I was perplexed (heh) at first, because I got significantly higher results, close to what Galunid was showing. But then I remembered I was using an instruct model, not the base model. Over the full 546 chunks I was getting PPL = 8.6246 +/- 0.06425 with the default settings (ctx 512) for q4_k_s and an imatrix. I guess that's fine then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I am not sure that keeping the pre-BPE tokenizer data in the README is very useful, though, I think in the long run it will add to the confusion for people out of the loop.
Just a word of caution: comparing perplexities across models with different token dictionary sizes is meaningless because a bigger dictionary means each token is "harder" to predict, resulting in a naturally higher PPL. Also since the total amount of tokens for a given prompt is also different between Llama 2 and 3, the running average PPL is meaningless too. |
I definitely agree when it comes to direct LLaMA 2 vs. LLaMA 3 comparisons. I think you can still extract some useful information by comparing the change in metrics towards smaller quant formats though; I think it's undeniable that for example the q2_K quality loss is much more severe for LLaMA 2 vs. LLaMA 3 (based on the token probability metrics). |
* perplexity: more statistics, added documentation * add LLaMA 3 8b scoreboard
@JohannesGaessler May I ask am I doing correct to get the ppl results?
Seems I don't quite reproduce the results of your table. For Meta-Llama-3-8B.Q4_0.gguf, I get the following result: Hope to get some guidance and help! Thanks! |
The tables don't have any results for LLaMA 2 7b q4_0, only results for LLaMA 2 7b q4_K_M and LLaMA 3 8b q4_0. |
Yeah, I know, I'm comparing for the results for LLaMA 3 8b q4_0 in my previous comment:
Not sure why the latest llama.cpp even gets higher ppl strangely. And my results don't match yours Hope to have some help in producing your official results! Thanks! |
Okay sorry, I misread your previous post. I thought you had only posted a single link, namely the one to the LLaMA 2 repository and thought that meant you are using only LLaMA 2 models. Looking at your post in more detail, the number of chunks for your calculation is different from mine. When I run LLaMA 3 on the Wikitext-2 test set I get 564 chunks, not 635. So this implies that the input data you use is different. I obtained my copy of Wikitext-2 via Also what hardware and backend are you using? That can also make a (small) difference. I'll download the specific models that you linked and test them myself. |
Hi @JohannesGaessler While if I use yesterday's latest llama.cpp, I get 635 chunks and even higher ppl, not sure if this is due to the change of BPE processing or anything else? I download the data manually from the same link in the script, which should be the same as yours. I run it on CPUs, both desktop Core cpu and server Xeon cpu, seems the result is slightly different on different cpus but still quite similar. Thanks for your help! |
Can you do a git bisect to pinpoint the commit that is causing the problem? |
Actually, when I downloaded and ran the v1 model myself using the latest llama.cpp commit I got this warning:
So presumably this is a known issue. So I don't need you to do a git bisect after all. |
I checkout to this commit: e2764cd (the last commit in April 26th) and get the result of 564 chunks. But seems both of my two results with v1 and v2 diff quite much from your result in the table. Anything I can do to check my results? |
Don't use this model - it's outdated as the warning indicates. Just convert it yourself |
I can confirm that I get 635 chunks (and worse perplexity) with both linked models along with the warning so this is not an issue with the code but with those specific model files. |
I wrote in the earliest comment that when using the latest llama.cpp, I also tried to convert Llama-3-8B to Q40 manually using I saw in the README it says
Am I doing something wrong in the model conversion? Thanks for the help! |
Check the console output, there should be a big warning about a hash not matching. Ever since the tokenizer changes I have the issue of the hash used for determining the BPE pre-tokenizer not matching, see #6920 (comment) . I was told it's a version issue and since the conversion works if I just replace the hash in the Python script I didn't bother looking further into this. I think I saw someone on Reddit or somewhere say that there are issues with the Python script getting confused by the file |
Cool, I took a close look at the discussions in #6920 , update the config files of llama3 and then use convert-hf-to-gguf.py to convert manually, eventually I can get similar results as yours: Thanks so much for your patience! Really helps me a lot! |
I have seen subjective reports about quantization being more harmful for LLaMA 3 than for LLaMA 2. I decided to investigate this and have to this end added more statistics (and documentation) to
perplexity
. My findings align with the subjective reports.LLaMA 2 vs. LLaMA 3 comparison
LLaMA 3 quantization scoreboard
Results are sorted by Kullback-Leibler divergence relative to FP16.
The "WT" importance matrices were created using varying numbers of Wikitext tokens and can be found here.
There seems to be no consistent improvement from using more Wikitext tokens for the importance matrix.
K-quants score better on mean Δp than the legacy quants than e.g. KL divergence would suggest.
Pre BPE tokenizer fix
Metric explanation
The `perplexity` example can be used to calculate the so-called perplexity value of a language model over a given text corpus. Perplexity measures how well the model can predict the next token with lower values being better. Note that perplexity is **not** directly comparable between models, especially if they use different tokenizers. Also note that finetunes typically result in a higher perplexity value even though the human-rated quality of outputs increases.Within llama.cpp the perplexity of base models is used primarily to judge the quality loss from e.g. quantized models vs. FP16.
The convention among contributors is to use the Wikitext-2 test set for testing unless noted otherwise (can be obtained with
scripts/get-wikitext-2.sh
).By default only the mean perplexity value and the corresponding uncertainty is calculated.
The uncertainty is determined empirically by assuming a Gaussian distribution of the "correct" logits per and then applying error propagation.
More statistics can be obtained by recording the logits from the FP16 version of a model.
To do this, supply
perplexity
with--kl-divergence-base path/to/logit/binary/file.kld
.The program will then record all logits and save them to the provided path in binary format.
The logit file will be very large, 11 GiB for LLaMA 2 or 37 GiB for LLaMA 3 when using the Wikitext-2 test set.
Once you have the file, supply
perplexity
with the quantized model, the logits file via--kl-divergence-base
,and finally the
--kl-divergence
argument to indicate that the program should calculate the so-called Kullback-Leibler divergence.This is a measure of how similar the FP16 and the quantized logit distributions are with a value of 0 indicating that the distribution are the same.
The uncertainty on the mean KL divergence is calculated by assuming the KL divergence per token follows a Gaussian distribution.
In addition to the KL divergence the following statistics are calculated with
--kl-divergence
:Unfortunately due to the difference in vocabulary size it is very difficult to make direct comparisons between the two models. However, the minimum and maximum changes in token probability are much larger for LLaMA 3 than they are for LLaMA 2 and the percentiles are also much more asymmetric which (I would argue) is indicative of higher quality loss. This effect gets larger towards smaller quant sizes.
Example of new console output