-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KL-divergence #5076
KL-divergence #5076
Conversation
How are you calculating kl divergence? Average or mean? Would it also be possible to show the top1 match frequency and maybe the max divergence (or q99)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
I don't think I understand the difference between average and mean. Top-1 match can be easily added, along with other statistics such as max/q99, etc. But I personally don't like reviewing big PR's, and I assume neither do other people, so my preference would be to merge this, and then add additional functionality in subsequent PR(s). |
To clarify what KL-divergence is being computed: One can compute
I.e., in the former case implemented by the PR, we compute KL-divergence for each token and get the average of that. In the latter case, we first compute average probabilities for the tokens in the vocabulary over the evaluated tokens, and then we compute one KL-divergence based on that. |
max_logit = std::max(max_logit, logits[i]); | ||
min_logit = std::min(min_logit, logits[i]); | ||
} | ||
min_logit = std::max(min_logit, max_logit - 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this done? Because the value would be 0 anyways due to the scale? A comment would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to reduce the size of the data being stored in the base run, I store them as uint16_t
(the log-probabilities for wiki.test.run
would be 20 GB, we have a size of 10 GB that way). The minimum logit can be very small, so I have decided to limit the probability range to e^(-16) ~ 1e-7
. This slightly improves the precision of the 16-bit values being stored.
in.read((char *)&n_vocab, sizeof(n_vocab)); | ||
in.read((char *)&n_chunk, sizeof(n_chunk)); | ||
if (in.fail()) { | ||
fprintf(stderr, "%s: failed rwading n_vocab, n_chunk from %s\n", __func__, params.logits_file.c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rwading -> reading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing. Will fix in another PR.
A very quick benchmark/sanity check against the
|
The sanity check tells us that either at least one of the two implementations is wrong, or that not the same thing is being computed. From a quick look at the Python script it seems all tokens in |
That would make sense, because a full logits.gz file is over 20GB for the full wiki.test.raw. Is there any benefit to only evaluating half the tokens per chunk? I guess the difference between the two would average out as the sample size increases, the test was only done on 6K tokens. |
Are you interested in what token probabilities the model predicts without any context? Or with a context consisting of bos token, bos + 1 token, etc? If so, then you would include all tokens in the evaluation. But if you want to see performance with a context of
The |
Not really. Based on the standard deviation of |
I forgot to say: thank you for this PR. It's already useful to me since it helps me to judge the precision loss in #4801 . |
Glad to hear it is useful. To me PPL and KL-divergence are basically the same thing. I wrote about this earlier (see #4739 (comment)). So, just for fun, here is a comparison between |
Maybe I'm beating a dead horse here, but to me, the usefulness of the KL divergence, is the fact that you get a much more fine grained measurement of how much change in probability mass is occurring. A model might be better at predicting some tokens compared to the other tokens due to quant loss or other changes caused by lower precision, but worse at others, and this would not necessarily be reflected in the average ppl (not without a much larger amount of perplexity calculations) When you measure the whole distribution instead of one token probability, you don't have to make any of those assumptions, and you don't need as much compute to rule out the margin of error. These are my old charts for Mistral 7b (before importance matrix and etc) where I did KL divergences with my own hacked together Python script, for example; it shows off the exponential degradation quite well. |
According to your graph, |
Yes, that's probably correct. However at least kl centiles / top token data is correctly decreasing, so it seems the more useful indicator vs kl average/perplexity. |
* kl-divergence: be able to save all logits to a file * Add ability to compute KL-divergence --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* kl-divergence: be able to save all logits to a file * Add ability to compute KL-divergence --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
There have been several discussions about the potential value of being able to compute KL-divergence as another quantization accuracy test.
There is the Python script that @Ttl provided in PR #4739. But for those who prefer C/C++, this PR adds the ability to perform KL-divergence calculations natively in
llama.cpp
.Usage
First get all logits of the
fp16
model viaBe warned: the file can become quite large (about 10 GB for
wiki.test.raw
and context of 512) as alln_vocab
logits are stored in the file for each evaluated token.Then run a calculation using the base logits obtained in step 1 for a quantized model (or any model that has the same vocabulary):
Note: you don't need to provide the test dataset via
-f
again as tokens are taken from the data stored in<file_name>
(and if you do provide it, it will be simply ignored. In this way it is assured that the base model logits and the quantized model logits are based on the exact same set of tokens).If everything goes well, you will see output such as
I.e., you get the PPL of the quantized model along with KL-divergence and the logarithm of the ratio of the quantized model PPL to the base model PPL. The statistical uncertainty on the KL-divergence and
ln(PPL(Q)/PPL(Base))
is much lower compared to the uncertainty of PPL itself.