-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Penalty threshold: A mechanism for improving repetition penalties #5561
base: master
Are you sure you want to change the base?
Conversation
Only apply penalties to tokens whose relative frequency in the penalty context is less than or equal to this value.
@@ -259,13 +259,17 @@ int main(void) { | |||
test_typical({0.97f, 0.01f, 0.01f, 0.01f}, {0.97f}, 0.5f); | |||
test_typical({0.4f, 0.2f, 0.2f, 0.2f}, {0.2f, 0.2f, 0.2f}, 0.5f); | |||
|
|||
test_repetition_penalties({0.2f, 0.2f, 0.2f, 0.2f, 0.2f}, {0}, {0.25f, 0.25f, 0.25f, 0.25f, 0}, 50.0f, 0.0f, 0.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how these tests are intended to work, even before my changes. The only prior token here is 0
, so I would expect the resulting probability vector to be {0, 0.25f, 0.25f, 0.25f, 0.25f}
, that is, the probability at index 0
to be penalized. Please help me understand what is going on here so I can make sure the code actually works correctly.
The repetition penalty is inherently flawed in obvious ways. I'm not very familiar with the literature and the history of this sampling strategy, but my guess is that it is something that was useful in the early days when base models used to fall in repetition loops quite easily. Today, there is almost 0 reasons to use it. So probably it is not worth investing in it |
To me, the repetition penalty is the single most important sampling parameter. Every model I've ever used repeats itself without it. Just recently, I accidentally ran Mixtral-8x7b (currently the top-ranked open source model) with repetition penalty disabled for a few hours. The looping was unbearable. You can find plenty of discussions in forums where people argue whether 1.18 or 1.19 is the better repetition penalty preset. This is the first time I've heard someone say there is no reason to use RepPen. If RepPen is not needed, why is it enabled by default? It does distort the distribution emitted by the model, after all. In fact, I had planned on creating multiple PRs improving several aspects of this crucial parameter. Please let me know whether this is worth for me to continue working on. |
I made a Python reimplementation at oobabooga/text-generation-webui#5539 for tests. Here is what I got for the prompt below, where there is no new line or space after the final "Yes":
Results:
My first impression is that the parameter is very sensitive. Probably 3 decimal places are needed to find something optimal for a given situation. Space is never penalized because I don't know if that's the optimal way to fix repetition penalty, but it seems like an interesting starting point. |
Is this the base model or the instruct model? My experience with the instruct model is that it never enters repetition loops with temp 0 and all repetition penalties disabled. |
Sampling from the original distribution avoids looping in almost all cases (1.0 Temperature with nothing else), but when you layer things like lower temperature + Top P + etc then that's when the problem is usually introduced. Recently though, I've been thinking about a repetition penalty that penalizes tokens that are "average" in terms of total occurences. So tokens that repeated over and over as a part of natural english grammar don't get penalized, tokens that are used extremely rarely don't get penalized, and those that lie in the "mean" get penalized. I think that would maybe be a more natural solution to the issue you describe with rep pen (because there wouldn't be a hard cutoff perhaps?) But really, the best / most natural solution would be to have a model that isn't as overfit on Q/A or Instruct tasks in the first place, and has a greater text diversity during SFT (prose mixed in with Instruct data, etc). Not to mention, stateful sampling is kind of wacky to meaningfully control across the board (see Mirostat, which doesn't really make sense in the modern era)
There is not much relevant sampling literature because it is an afterthought in academia; the best solution is almost always a better model if you have the compute to make one, not a better way of choosing from the model's flawed predictions. Most of the interesting sampling developments have been on the open source end for that reason, because we aren't in a position to do pretrains, unfortunately. Sampling is very relevant for controlling the output of models, though; I made an overview for the popular ones the other day that might interest you. |
Optimal, maybe. But beneficial, no. This is just a heuristic, and it doesn't have to be perfect in order to improve the distribution. I would say that in your experiment, every parameter value you've tried resulted in the exclusion of (mostly) tokens that shouldn't be penalized. Note that the example text is very short, and tokens whose high frequency is an outlier will quickly average away once the length increases.
Ah, TIL. I could swear I saw spaces as tokens last time I looked, but maybe that was a different model or I just misremember.
I'm trying to retrace my steps, and I just realized it might have been Nous Hermes 2 Mixtral 8x7B. That finetune has some other quirks, so this might have been part of the problem. My basic point still stands though. I don't think most people who use LLMs to write fiction can imagine life without repetition penalty.
Very nice, thank you. Your point about frequency penalty breaking standard grammar is pretty much what this PR tries to fix. In fact, I believe with the penalty threshold enabled, much stricter frequency penalties suddenly become viable because the structural tokens are all exempted. BTW, has anyone ever tried to find an optimal set of sampling parameters by minimizing perplexity over a given text or something? |
Don't think anyone has, but I've considered it before, it's just I was too lazy to hack it into perplexity.cpp. My main concern is that truncation probably just makes ppl worse on average for when it's not a part of the considered candidate set even if it qualitatively can result in better outputs, but I'm not sure about that. |
Also, keep in mind that truncation makes the excluded tokens 0% probability which would result in infinite perplexity. You'd have to add a small constant when checking to make it not fail (i.e. 1e-6) when evaluating ppl I think. Overall I'm a fan of this PR and the fact other people are looking at sampling based solutions; the steerability of these models goes far beyond just the prompt. |
Repetition isn't just loops, it's using the same words and phrases sprinkled throughout. Sadly still have to use it. |
Here is an idea I had: rather than doing it in absolute terms, doing the exclusion relative to the most common token.
Here is a test for the same prompt above:
It looks more like |
I like your idea, and I can see how in many cases, it would improve the range of values that make sense. The reason I don't think it's a replacement for my approach is that it doesn't generalize as well. If a token makes up 10% of all tokens (and the input is of sufficient length), you can be pretty sure it's an "essential" token that shouldn't be penalized. By contrast, if a token occurs 10% as often as the most common token, that doesn't necessarily tell you anything. Even the fact that a token is the most common token doesn't automatically imply it should not be penalized, if no token is particularly common. The outcome might be heavily dependent on the structure and type of the text, and the language it is written in. I think that ideally, both approaches should be implemented. Honestly, there are far too few sampling parameters at the moment. Considering how incredibly complex the whole thing is, and that other than changing the model, sampling is all you can do to improve output quality, there should be hundreds of knobs that you can tweak rather than a dozen or so. |
How do you think this method would work with coding models? I've found you need to keep the repetition penalty ~1.0 (none) when asking for large sections of code to be written or edited, or otherwise the quality of the code degrades quite significantly, but then if you leave it at 1.0 when you want to discuss code verbally the same models will often start to loop... The problem is:
|
If code in the language that is to be generated is already present in the context, the penalty threshold should protect the programming language's standard tokens from being penalized. So it would indeed improve the situation you describe. |
The current repetition penalty system suffers from a fundamental, conceptual problem. This PR implements a new sampling parameter that I believe can help alleviate that problem in many cases.
The problem
Consider a typical prose text:
Imagine we want to generate tokens based on that input. To avoid looping and boring results, we apply a repetition and/or frequency penalty, penalizing the generation of new tokens that are already present in our input.
The most frequent tokens in our input are spaces, punctuation, and words like
a
,the
, etc.In other words, we are penalizing the very structure of standard English. That's really bad.
Consider the following dialogue:
In addition to the aforementioned tokens, the most frequent tokens in that dialogue include the tokens comprising the names of our chat participants,
Phaedrus
andSocrates
.In other words, we are penalizing the very chat structure that we want our model to generate. This is super bad. In fact, this problem is already being tacitly acknowledged by the existence of llama.cpp's
--no-penalize-nl
option, though that option feels rather ad-hoc because the underlying issue is much more general.This PR
This PR implements a new sampling parameter,
penalty_threshold
. Tokens whose relative frequency in the penalty context exceeds that threshold are exempted from repetition penalties. For example, ifpenalty_threshold
is set to0.1
, any token that makes up more than 10% of the input will not have penalties applied to it.The idea is that if a token is very common in the input, it is probably a token that is essential to the structure of the type of text we are dealing with. This could be regular prose, a formatted chat log, code, etc. Such essential tokens should never be penalized, as doing so distorts the structure implied by the input.
This is a very general solution that works in many practical situations. A value of
0.1
, applied to prose, will usually exclude only space characters from being penalized. Lower values will then also exclude punctuation, conjunctions, common pronouns etc. The cool thing is that unlike with--no-penalize-nl
(or its proposed extension, #3675), we do not have to think about the type of text we are dealing with. The threshold mechanism automatically adapts to the input.Default value
The default is set to the conservative value of
1.0
. Since all tokens by definition have a relative frequency of <= 1, this means the penalty applies to all tokens as before, that is, this feature is inactive by default.I do, however, believe that it should be considered to set an active default of
0.1
or so, sincepenalty_repeat
is also active by default and as demonstrated above, applying a repetition penalty in the current sense flat out does the wrong thing in many very common situations.Note
I am rather unfamiliar with the relevant literature. If this approach has been previously suggested or implemented (which wouldn't surprise me at all), please point out the paper or code so I can give credit to prior art as appropriate.