Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Penalty threshold: A mechanism for improving repetition penalties #5561

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

p-e-w
Copy link

@p-e-w p-e-w commented Feb 18, 2024

The current repetition penalty system suffers from a fundamental, conceptual problem. This PR implements a new sampling parameter that I believe can help alleviate that problem in many cases.

The problem

Consider a typical prose text:

Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation.

Imagine we want to generate tokens based on that input. To avoid looping and boring results, we apply a repetition and/or frequency penalty, penalizing the generation of new tokens that are already present in our input.

The most frequent tokens in our input are spaces, punctuation, and words like a, the, etc.

In other words, we are penalizing the very structure of standard English. That's really bad.

Consider the following dialogue:

Phaedrus: Enough; I see that I have no hope of practising my art upon you. But if I am to read, where would you please to sit?

Socrates: Let us turn aside and go by the Ilissus; we will sit down at some quiet spot.

Phaedrus: I am fortunate in not having my sandals, and as you never have any, I think that we may go along the brook and cool our feet in the water; this will be the easiest way, and at midday and in the summer is far from being unpleasant.

Socrates: Lead on, and look out for a place in which we can sit down.

Phaedrus: Do you see that tallest plane–tree in the distance?

Socrates: Yes.

In addition to the aforementioned tokens, the most frequent tokens in that dialogue include the tokens comprising the names of our chat participants, Phaedrus and Socrates.

In other words, we are penalizing the very chat structure that we want our model to generate. This is super bad. In fact, this problem is already being tacitly acknowledged by the existence of llama.cpp's --no-penalize-nl option, though that option feels rather ad-hoc because the underlying issue is much more general.

This PR

This PR implements a new sampling parameter, penalty_threshold. Tokens whose relative frequency in the penalty context exceeds that threshold are exempted from repetition penalties. For example, if penalty_threshold is set to 0.1, any token that makes up more than 10% of the input will not have penalties applied to it.

The idea is that if a token is very common in the input, it is probably a token that is essential to the structure of the type of text we are dealing with. This could be regular prose, a formatted chat log, code, etc. Such essential tokens should never be penalized, as doing so distorts the structure implied by the input.

This is a very general solution that works in many practical situations. A value of 0.1, applied to prose, will usually exclude only space characters from being penalized. Lower values will then also exclude punctuation, conjunctions, common pronouns etc. The cool thing is that unlike with --no-penalize-nl (or its proposed extension, #3675), we do not have to think about the type of text we are dealing with. The threshold mechanism automatically adapts to the input.

Default value

The default is set to the conservative value of 1.0. Since all tokens by definition have a relative frequency of <= 1, this means the penalty applies to all tokens as before, that is, this feature is inactive by default.

I do, however, believe that it should be considered to set an active default of 0.1 or so, since penalty_repeat is also active by default and as demonstrated above, applying a repetition penalty in the current sense flat out does the wrong thing in many very common situations.

Note

I am rather unfamiliar with the relevant literature. If this approach has been previously suggested or implemented (which wouldn't surprise me at all), please point out the paper or code so I can give credit to prior art as appropriate.

Only apply penalties to tokens whose relative frequency in the penalty context is less than or equal to this value.
@@ -259,13 +259,17 @@ int main(void) {
test_typical({0.97f, 0.01f, 0.01f, 0.01f}, {0.97f}, 0.5f);
test_typical({0.4f, 0.2f, 0.2f, 0.2f}, {0.2f, 0.2f, 0.2f}, 0.5f);

test_repetition_penalties({0.2f, 0.2f, 0.2f, 0.2f, 0.2f}, {0}, {0.25f, 0.25f, 0.25f, 0.25f, 0}, 50.0f, 0.0f, 0.0f);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how these tests are intended to work, even before my changes. The only prior token here is 0, so I would expect the resulting probability vector to be {0, 0.25f, 0.25f, 0.25f, 0.25f}, that is, the probability at index 0 to be penalized. Please help me understand what is going on here so I can make sure the code actually works correctly.

@ggerganov
Copy link
Owner

The repetition penalty is inherently flawed in obvious ways. I'm not very familiar with the literature and the history of this sampling strategy, but my guess is that it is something that was useful in the early days when base models used to fall in repetition loops quite easily. Today, there is almost 0 reasons to use it. So probably it is not worth investing in it

@p-e-w
Copy link
Author

p-e-w commented Feb 19, 2024

@ggerganov

To me, the repetition penalty is the single most important sampling parameter. Every model I've ever used repeats itself without it. Just recently, I accidentally ran Mixtral-8x7b (currently the top-ranked open source model) with repetition penalty disabled for a few hours. The looping was unbearable. You can find plenty of discussions in forums where people argue whether 1.18 or 1.19 is the better repetition penalty preset. This is the first time I've heard someone say there is no reason to use RepPen. If RepPen is not needed, why is it enabled by default? It does distort the distribution emitted by the model, after all.

In fact, I had planned on creating multiple PRs improving several aspects of this crucial parameter. Please let me know whether this is worth for me to continue working on.

@oobabooga
Copy link
Contributor

I made a Python reimplementation at oobabooga/text-generation-webui#5539 for tests. Here is what I got for the prompt below, where there is no new line or space after the final "Yes":

Phaedrus: Enough; I see that I have no hope of practising my art upon you. But if I am to read, where would you please to sit?

Socrates: Let us turn aside and go by the Ilissus; we will sit down at some quiet spot.

Phaedrus: I am fortunate in not having my sandals, and as you never have any, I think that we may go along the brook and cool our feet in the water; this will be the easiest way, and at midday and in the summer is far from being unpleasant.

Socrates: Lead on, and look out for a place in which we can sit down.

Phaedrus: Do you see that tallest plane–tree in the distance?

Socrates: Yes

Results:

penalty_threshold=0.01
Non-valid tokens: ['<0x0A>', '▁the', 'ed', '▁in', '▁to', '▁and', '▁I', '▁you', '▁that', '▁at', '▁we', '▁have', '▁go', '▁my', '▁will', '▁am', 'ates', '▁see', '▁down', '▁sit', 'ha', 'ocr', 'rus', '.', ',', 'S', ';', ':', 'P', '?']

penalty_threshold=0.02
Non-valid tokens: ['<0x0A>', '▁the', '▁in', '▁and', '▁I', '▁you', '.', ',', ':']

penalty_threshold=0.03
Non-valid tokens: ['<0x0A>', '▁the', '▁and', ':']

penalty_threshold=0.04
Non-valid tokens: ['<0x0A>']

penalty_threshold=0.05
Non-valid tokens: ['<0x0A>']

penalty_threshold=0.06
Non-valid tokens: []

<0x0A> is \n, and a "Non-valid token" is a token excluded from repetition penalty due to being above the threshold.

My first impression is that the parameter is very sensitive. Probably 3 decimal places are needed to find something optimal for a given situation.

Space is never penalized because is not a separate token in the llama tokenizer.

I don't know if that's the optimal way to fix repetition penalty, but it seems like an interesting starting point.

@ggerganov
Copy link
Owner

Just recently, I accidentally ran Mixtral-8x7b

Is this the base model or the instruct model? My experience with the instruct model is that it never enters repetition loops with temp 0 and all repetition penalties disabled.

@kalomaze
Copy link
Contributor

kalomaze commented Feb 19, 2024

To me, the repetition penalty is the single most important sampling parameter.

Sampling from the original distribution avoids looping in almost all cases (1.0 Temperature with nothing else), but when you layer things like lower temperature + Top P + etc then that's when the problem is usually introduced.
However, you sort of need truncation (or just lower Temp by itself) to keep the model coherent.
I created Min P truncation a while back to help alleviate the problem, because it means you can use higher temperature while the model stays "sane", which can be greatly beneficial for improving creativity while avoiding sampling from the tail end of the distribution.

Recently though, I've been thinking about a repetition penalty that penalizes tokens that are "average" in terms of total occurences. So tokens that repeated over and over as a part of natural english grammar don't get penalized, tokens that are used extremely rarely don't get penalized, and those that lie in the "mean" get penalized.

I think that would maybe be a more natural solution to the issue you describe with rep pen (because there wouldn't be a hard cutoff perhaps?) But really, the best / most natural solution would be to have a model that isn't as overfit on Q/A or Instruct tasks in the first place, and has a greater text diversity during SFT (prose mixed in with Instruct data, etc). Not to mention, stateful sampling is kind of wacky to meaningfully control across the board (see Mirostat, which doesn't really make sense in the modern era)

I am rather unfamiliar with the relevant literature.

There is not much relevant sampling literature because it is an afterthought in academia; the best solution is almost always a better model if you have the compute to make one, not a better way of choosing from the model's flawed predictions. Most of the interesting sampling developments have been on the open source end for that reason, because we aren't in a position to do pretrains, unfortunately.

Sampling is very relevant for controlling the output of models, though; I made an overview for the popular ones the other day that might interest you.

@p-e-w
Copy link
Author

p-e-w commented Feb 19, 2024

@oobabooga

My first impression is that the parameter is very sensitive. Probably 3 decimal places are needed to find something optimal for a given situation.

Optimal, maybe. But beneficial, no. This is just a heuristic, and it doesn't have to be perfect in order to improve the distribution. I would say that in your experiment, every parameter value you've tried resulted in the exclusion of (mostly) tokens that shouldn't be penalized.

Note that the example text is very short, and tokens whose high frequency is an outlier will quickly average away once the length increases.

Space is never penalized because is not a separate token in the llama tokenizer.

Ah, TIL. I could swear I saw spaces as tokens last time I looked, but maybe that was a different model or I just misremember.

@ggerganov

Is this the base model or the instruct model? My experience with the instruct model is that it never enters repetition loops with temp 0 and all repetition penalties disabled.

I'm trying to retrace my steps, and I just realized it might have been Nous Hermes 2 Mixtral 8x7B. That finetune has some other quirks, so this might have been part of the problem.

My basic point still stands though. I don't think most people who use LLMs to write fiction can imagine life without repetition penalty.

@kalomaze

Sampling is very relevant for controlling the output of models, though; I made an overview for the popular ones the other day that might interest you.

Very nice, thank you. Your point about frequency penalty breaking standard grammar is pretty much what this PR tries to fix. In fact, I believe with the penalty threshold enabled, much stricter frequency penalties suddenly become viable because the structural tokens are all exempted.

BTW, has anyone ever tried to find an optimal set of sampling parameters by minimizing perplexity over a given text or something?

@kalomaze
Copy link
Contributor

kalomaze commented Feb 19, 2024

BTW, has anyone ever tried to find an optimal set of sampling parameters by minimizing perplexity over a given text or something?

Don't think anyone has, but I've considered it before, it's just I was too lazy to hack it into perplexity.cpp. My main concern is that truncation probably just makes ppl worse on average for when it's not a part of the considered candidate set even if it qualitatively can result in better outputs, but I'm not sure about that.
It'd be interesting to see if you can "calibrate" values for something like Dynamic Temperature that way, to better match the entropy profile of a piece of text or something...

@kalomaze
Copy link
Contributor

kalomaze commented Feb 19, 2024

Also, keep in mind that truncation makes the excluded tokens 0% probability which would result in infinite perplexity. You'd have to add a small constant when checking to make it not fail (i.e. 1e-6) when evaluating ppl I think.

Overall I'm a fan of this PR and the fact other people are looking at sampling based solutions; the steerability of these models goes far beyond just the prompt.

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 20, 2024

Repetition isn't just loops, it's using the same words and phrases sprinkled throughout. Sadly still have to use it.

@oobabooga
Copy link
Contributor

oobabooga commented Feb 20, 2024

Here is an idea I had: rather than doing it in absolute terms, doing the exclusion relative to the most common token.

  • Before: exclude tokens that appear more than penalty_factor * (total number of tokens) times
  • After: exclude tokens that appear more than penalty_factor * (how many times the most common token appears) times

penalty_factor = 1 disables the effect, and now the parameter is more interpretable and has a better range of values.

Here is a test for the same prompt above:

penalty_threshold=1
Non-valid tokens: []

penalty_threshold=0.9
Non-valid tokens: ['<0x0A>']

penalty_threshold=0.8
Non-valid tokens: ['<0x0A>']

penalty_threshold=0.7
Non-valid tokens: ['<0x0A>']

penalty_threshold=0.6
Non-valid tokens: ['<0x0A>']

penalty_threshold=0.5
Non-valid tokens: ['<0x0A>', '▁the', '▁and', ':']

penalty_threshold=0.4
Non-valid tokens: ['<0x0A>', '▁the', '▁in', '▁I', '▁and', ',', ':']

penalty_threshold=0.3
Non-valid tokens: ['<0x0A>', '▁the', '▁in', '▁I', '▁and', '▁you', '.', ',', ':']

penalty_threshold=0.2
Non-valid tokens: ['<0x0A>', '▁the', 'ed', '▁in', '▁I', '▁and', '▁you', '▁that', '▁we', 'ates', '▁sit', 'ocr', 'rus', '.', ',', ':', 'S', ';']

penalty_threshold=0.1
Non-valid tokens: ['<0x0A>', '▁the', 'ed', '▁in', '▁to', '▁I', '▁and', '▁you', '▁that', '▁at', '▁have', '▁my', '▁we', '▁am', '▁will', '▁go', '▁see', 'ates', '▁down', 'ha', '▁sit', 'ocr', 'rus', '.', ',', ':', 'S', 'P', ';', '?']

penalty_threshold=0.01
Non-valid tokens: ['<s>', '<0x0A>', '▁a', '▁the', '▁m', 'ed', '▁in', '▁to', '▁I', '▁of', '▁and', 'ad', '▁is', 'est', 'un', '▁for', '▁you', '▁be', '▁on', 'us', 'ay', '▁that', 'ate', '▁as', '▁un', '▁this', '▁not', '▁at', '▁by', '▁us', '▁have', '▁can', '▁from', 'ple', 'ok', '▁if', '▁my', '▁we', '▁which', '▁am', '▁will', '▁no', '▁out', '▁would', '▁any', '▁go', '▁some', 'iss', 'ough', '▁Le', '▁way', '▁where', '▁see', 'ates', '▁look', '▁may', '▁En', '▁But', '▁read', 'als', '▁think', '▁art', '▁down', '▁being', '▁Il', '▁our', '▁Do', '▁Ph', '▁place', 'idd', '▁far', 'ha', '▁never', '▁upon', '▁turn', '▁having', '▁bro', '▁Let', '▁please', '▁along', '▁Yes', '▁water', '▁pract', '▁hope', '▁fort', '▁distance', 'ising', '▁feet', '▁sit', 'tree', 'ocr', '▁spot', '▁plane', '▁summer', '▁quiet', '▁sand', '▁cool', 'asant', '▁tall', 'rus', '▁aside', '▁easiest', 'a', '.', ',', ':', 'S', 'P', ';', '?', '–']

It looks more like min_p like this.

@p-e-w
Copy link
Author

p-e-w commented Feb 20, 2024

@oobabooga

I like your idea, and I can see how in many cases, it would improve the range of values that make sense.

The reason I don't think it's a replacement for my approach is that it doesn't generalize as well. If a token makes up 10% of all tokens (and the input is of sufficient length), you can be pretty sure it's an "essential" token that shouldn't be penalized. By contrast, if a token occurs 10% as often as the most common token, that doesn't necessarily tell you anything. Even the fact that a token is the most common token doesn't automatically imply it should not be penalized, if no token is particularly common. The outcome might be heavily dependent on the structure and type of the text, and the language it is written in.

I think that ideally, both approaches should be implemented. Honestly, there are far too few sampling parameters at the moment. Considering how incredibly complex the whole thing is, and that other than changing the model, sampling is all you can do to improve output quality, there should be hundreds of knobs that you can tweak rather than a dozen or so.

@jukofyork
Copy link
Contributor

jukofyork commented Apr 13, 2024

How do you think this method would work with coding models?

I've found you need to keep the repetition penalty ~1.0 (none) when asking for large sections of code to be written or edited, or otherwise the quality of the code degrades quite significantly, but then if you leave it at 1.0 when you want to discuss code verbally the same models will often start to loop...

The problem is:

  • The code itself (especially those derived from C syntax, with lots of curly braces, etc) is very repetitive and getting "creative" is absolutely not what you want.
  • The changes needed to a section of code being discussed all tend to be very similar: "Rename x to...", "Rename y to...", and so on, and it's easy for the model to just get stuck repeating or bore the user with long lists like this when you'd really prefer the writing to be more "creative" and have the information portrayed better.

@p-e-w
Copy link
Author

p-e-w commented Apr 18, 2024

@jukofyork

How do you think this method would work with coding models?

If code in the language that is to be generated is already present in the context, the penalty threshold should protect the programming language's standard tokens from being penalized. So it would indeed improve the situation you describe.

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level performance Speed related topics generation quality Quality of model output enhancement New feature or request and removed performance Speed related topics labels May 10, 2024
@81549361 81549361 mentioned this pull request Aug 23, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants