-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude Top Choices (XTC): A sampler that boosts creativity, breaks writing clichés, and inhibits non-verbatim repetition #6335
Conversation
Wouldn't you get a similar effect from setting a high temperature after removing all poor candidates? Let's say you removed all candidates except the top 4 (via top-K/min-P/top-P or whatever) Bear = 50% Now for more creativity, crank up the temperature, and you end up with whereas with XTC you just get how would that be more creative? The first set still has an equally likely chance to pick from a good variety of candidates. Especially considering people like to reroll gens a lot, you probably just end up with many runs of Sword, whereas you have more variety without it. |
The only benefit perhaps would be to remove "toxic" slop tokens from the output, e.g. Shivers down her spine, but then identifying such slop tokens is non-trivial |
Not sure where to put this, I did a quick, hacky, might be bugged implementation of XTC on llama-cpp-python using (It is hacky because llama-cpp-python's samplers mostly calls back to samplers implemented in llama.cpp itself. On the other hand, trying to get new proposed samplers merged into major backends in the ecosystem probably requires passing through a rigorous process (and there are good, legitimate reasons to it) and is simply going to take time.)
|
I have tried that approach many times. The problem is that this throws away the information contained in the probability distribution, by essentially making all remaining tokens (almost) equally likely. One of the following two things will happen: If you truncate aggressively, only 1-2 candidates will remain, which are then sampled with near-equal probability. This is the opposite of creativity, as it simply locks in the most likely candidates. If, on the other hand, you truncate more loosely, the model will start to derail because it can no longer distinguish between likely and less likely tokens. And enhanced creativity is still not guaranteed, because the most likely tokens remain the most likely tokens. XTC doesn't alter the relative probabilities of tokens, retaining all the information from the distribution. It only excludes high-probability tokens from sampling under certain circumstances. The output generated with XTC is very different from what happens when you increase the temperature. The best way to convince yourself of that is to try it.
Actually, identifying such tokens is quite easy: They will usually be the most probable tokens in the distribution. If the input is "shivers down her", then the probability distribution might be
And in that case, |
Fair enough. This sampler is simple enough that I could probably get a working example sometime soon. Just thinking, do you have ideas on how a "critical" token can be preserved? I understand that so long as more than one token passes the threshold then the most likely token(s) MUST be discarded - I could imagine some cases where that would lead to extensive degradation of outputs unless the threshold is very high. Would it be worth considering the probability difference between the most likely and the least likely token? Consider examples with threshold = 0.1 CandidateA = 85% Here we only have 2 candidates that pass the threshold, and XTC means we MUST pick B. Do you think that is ideal in this scenario, considering how confident the model is of A over B? This is quite different from your Bear/Tree/Door/Sword example. This would more likely be Again this is just brainstorming. |
I wanted to test this in SillyTavern (using text-generation-webui with this pull request applied), so I wrote a patch for it, I am sharing it here in case someone else prefers using SillyTavern UI too (at least for me, it makes testing much easier): Note: In SillyTavern, within "AI Response Configuration" > "Sampler Select" enable "Ooba Sampler Priority Block", "xtc_probability" and "xtc_threshold". Then in "Sampler Priority" section click "Load default order" to make sure it is correct. |
I played around with this for a while and found that with the default params of 0.1/0.5 it had a tendency to runaway with huge amounts of verbosity. I expect this is because the EOS token is being truncated when it really shouldn't. I would probably add a parameter to handle the EOS token separately to more accurately control the length of generations. Maybe add a separate probability for excluding EOS from the top candidates where 0 means it's never excluded and 1 means it's always excluded (current behavior). |
Maybe show the logits - why is the EOS not the only candidate with p>0.1 after the AI response was completed? What other candidates were there? |
This is more of an issue for prose, where the "end" is an open-ended question. There are multiple places where the model could potentially cut itself off. Per this post on Reddit: I'll try setting a logit bias for EOS and see how that works. |
I'm also wondering if instead of excluding all but the minimum above the threshold, whether the number to exclude could be parametrized as "Exclude Count": If there were three tokens above the threshold: "Yes" : 0.3 An "Exclude Count: 1" parameter would exclude the top result, returning "No", while an "Exclude Count: 2" parameter would exclude "Yes" and "No", returning "Maybe". This could be another way of controlling the aggressiveness of the sampler, with 0 defaulting to the normal behavior. Something like applying the logic only to the top n probs:
I would guess that even excluding just the top one or two could have a large impact on the feel of the result. |
XTC only takes effect if at least *two* tokens are above the threshold, so values larger than 0.5 do not make sense
I have considered several additional parameters and mechanisms (probability blending rather than "all or nothing", exclusion count control, token whitelist, ...) already during development, but they all add complexity to the user experience, and at the end of the day, there are already two parameters for controlling the strength of XTC and toning down its effects. As Right now, you can look at any probability distribution, and immediately see what effect a given set of XTC parameters would have on it, without needing to do any computation or even real thinking. The only other sampler for which this is true is Top-K, and it's a feature that I would really like to preserve. |
Looks good. I had tried a rudimentary version of this more similar to @p-e-w have you experienced any issues with the model failing to stop due to the EOS/EOT tokens not being generated while using these new parameters? |
That's a fair point. There's something beautiful about something like min_p where it's simple, elegant, and easy to understand. Due to the effectiveness of XTC, I would also expect this to become a standard inclusion for any sort of creative generation going forward, so some more granular control may be nice to have. If there are added parameters, it would still work just fine without using them, but the option could exist all the same for people to experiment with. Then again an exclude_n may not add any real value, but it does feel like a natural generalization of the base sampler. In fact, it's kind of like an inverted top-k. |
No. In fact, I haven't noticed any artifacts with the recommended parameter values. My real-world testing has mostly consisted of adventure-style roleplay chat with these parameters:
Message length was what I'm used to, and I saw no special characters or other garbage in the output. Number of paragraphs was also like before, even though XTC can theoretically suppress line breaks in some situations. What is your opinion on additional parameters like those proposed in the comments above? They are easy enough to add of course, but I'm worried that there will simply be too many knobs to turn. Determining when output is "better" is difficult enough even in the best case, but with more than two parameters that all control XTC's strength in some sense, the parameter space would be overwhelmingly large and any perceived improvement might be little more than noise. |
I won't belabor the point then. This is a great feature and the amount of testing you've done probably outweighs the few tests I've done. Raising the threshold would probably result in a similar effect to excluding the top n. |
in my limited testing with a mistral large model, while it did improve creative writing dramatically it also seemed to make the model much dumber. |
Which parameters did you use?
…On Tue, Aug 20, 2024, 4:36 PM RedDragonGecko ***@***.***> wrote:
in my limited testing with a mistral large model, while it did improve
creative writing dramatically it also seemed to make the model much dumber.
—
Reply to this email directly, view it on GitHub
<#6335 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH2WKO3GSA3O7U3C6AQTZ23ZSOSEZAVCNFSM6AAAAABMWKR2KSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZG4YTSNJYGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I used your suggested settings: xtc_probability(0.5.), Min-P (0.02) DRY (multiplier 0.8), with all other samplers disabled. |
Try raising the XTC threshold to 0.2 from 0.1 and see how it feels then. Increasing the threshold should result in fewer low probability tokens selected. |
Mistral models are famous for running hot by default (for NeMo, Mistral officially recommends a temperature of just 0.3). What this means is that the top logits are more tightly packed together than for other models. Which can lead to unusually many tokens meeting the default threshold of 0.1, which will in turn lead to many sensible choices being discarded, resulting in nonsense being generated occasionally. As suggested by @stepfunction83, you can try raising the threshold to get a more sensible cutoff. You could also lower the temperature, as long as you don't have the "temperature last" sampling option active. |
After extensive testing over last few days, I think what is missing is a list of tokens to exclude from the effect of XTC. It could be just like dry_sequence_breakers, but in this case a list of tokens to exclude from the effect (so if the token is the most probable and it is in the exclusion list, it should not be cut off). As it is now, it can cut off end of stream tokens, new lines, among some other things like "```". This can break workflows, for example if I want more creativity to generate one prompt at a time in a text block, it can mess up formatting by missing a new line before ending the text block, or fail to end the message and generate more than 1 text block. It is even more unstable if count of blocks to be generated is more than one. Just adjusting threshold or probability does not achieve desired effect, quite the opposite - the issue still can occur, even if less frequently, but the output becomes less creative. There are more nuanced cases as well. For example, character names can be sometimes determined incorrectly, especially true if a character has more than one way to be named, like a title and name, and throwing out the most probable option causes unwanted change in style, either making it more formal or more causal than it should be. Having a field with a list like xtc_sequence_exclusions (implemented in a way similar to dry_sequence_breakers, as a comma-separated list of quoted strings) would completely solve this. Perhaps consider setting it by default to newlines, "```" and of stream tokens - if someone wants to get longer paragraphs or messages, they could just delete them in the list, so it should be easy to adjust and understand. And adding custom strings, based on personal use case, would add great flexibility. |
@p-e-w btw I implemented it in koboldcpp too. think i got it right only part to confirm is - if less than 2 tokens exceed |
The way it's implemented, it looks at the next token over's probability
after sorting to determine exclusion, so last token above the threshold
will automatically be retained without needing to worry about specific
counts. If there's only one above the threshold, the next one over would be
below, so it would be the one retained.
Also, definitely agree on the ability to add a list of excluded tokens.
…On Wed, Aug 21, 2024, 12:05 PM LostRuins Concedo ***@***.***> wrote:
@p-e-w <https://github.com/p-e-w> btw I implemented it in koboldcpp too.
think i got it right
***@***.***
<LostRuins/koboldcpp@5bf527a>
only part to confirm is - if less than 2 tokens exceed xtc_threshold, the
sampler does nothing, correct?
—
Reply to this email directly, view it on GitHub
<#6335 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH2WKOYI4BSUZLPESV2RTV3ZSS3FHAVCNFSM6AAAAABMWKR2KSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGQ2TEOJUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I don't have a lot of models lower than 70b that I use regularly. Especially nothing lower than 30b. Was hoping someone who enjoys those would find one. I already use low temperature, and the parameters do work for me. But in that case it's turning XTC off or having negligible effect. Having set threshold lower and probability slightly higher, it generates real kino. Filtering for those 2 tokens, it isn't as likely to degrade. For me it's win-win. |
The latest patch feels like improvement, after using it for few days. I am testing with Mistral Large. I also tested with Magnum, but it is based on Mistral Large, I did not find any smaller models that would work well for me. I am fine if eos would be hardcoded if it is hard to make it controllable... but I feel that exclusion list is definitely a necessity with this sampler. And it could make exclusion of newline controllable without any additional checkboxes, just by having it as an item in the list by default. When actually using this sampler for writing, as a writer I need fine control. There are many phrases, names or words which I want ignored by the XTC sampler (so it would be like it is turned off when processing current token if current and previous tokens match a string I added to the exclusion list). As it is now, for example, if some phrase is supposed to be used most of the time but some other may be used too with similar meaning, without possibility to exclude it, usage of the more rare option will be forced much more often, so I have to stop the generation, manually edit it, continue. Otherwise, not just some phrase but overall tone of writing may change in the wrong direction, especially in longer text generations. Of course, no matter how good sampler is, I still have to do manual edits in the middle of generation, I do not expect the model to be perfect 100% of the time. This problem arises not just with phrases... let's say I give context that implies that variety of items may be discussed, which implies that there is an item that needs to be mentioned more often, but the sampler forces the opposite to happen. So for example I make multiple generations, and instead of expected distribution get the opposite of what I wanted because I could not add necessary keywords to the exclusion list. Even within a single generation, it presents a problem and makes sometimes harder to steer the model in the way I want, because it keeps inserting less probable words, names or phrases more often - generally, this is a good thing, and this is why I like this sampler. But as a writer, if I know exact words or phrases that I do not want to be affected, I really miss an option to add them. I hope it is possible to implement. Like someone else mentioned, just a simple list of strings like it is there for the dry sampler, would work great. I think it is more important even, because I don't remember ever changing the "dry" list of strings... but I definitely would be changing XTC list of excluded string if it was implemented, adjusting it depending on a story, and perhaps having some custom profiles I can quickly switch. |
I'm at a loss for what to do here. Every single report of problems mentions 70+B models. I don't even have the ghost of a theory for why larger models are affected but smaller ones are not (I now have several thousands of messages generated with 4 different models <= 35B, and not a single case of excessive message length or missing newlines). The best proposed "solution" is either hardcoding or partially-hardcoding a bandaid exclusion list, even though there is no theoretical justification for treating EOS/newline differently than any other token. I'm leaning towards recommending that this PR be merged in its original form without special-casing any tokens, and that people experiencing problems patch the sampler code themselves to do what they want, until we get a better understanding of what is actually going on here. I accept and believe that there are problems for certain use cases with certain models, but I don't think adding ad-hoc parameters is a good idea just because they have been observed to alleviate some of those problems in some cases. Other truncation samplers don't have special treatment for specific tokens either. I have described multiple times why it doesn't make sense that XTC should introduce behaviors that don't happen without it, since by construction such behaviors would happen with a significant probability even with XTC disabled. The fact that some reports appear to contradict this demonstrates that we currently lack an understanding of the actual mechanics that cause runaway outputs. I don't believe that trying to fix this issue without understanding it is the right way forward. |
Runaway generation or not, it doesn't make sense to me that XTC should touch structural tokens like EOS or newline. They are categorically different from textual tokens.
I would argue that most people would desire the improved word variety that comes with XTC, but would not want it to impact paragraph or generation
length.
If they do, that should be an explicit choice to do so.
I disagree that it should be merged in it's original form and that a minimal user-facing implementation should contain at least a checkbox for excluding EOS and newline from the sampler (if not a user editable exclusion list as previously discussed).
Users would determine what works best in practice and then a future implementation can adjust accordingly. Without thus user-facing flexibility, it would be difficult to gauge preferences.
If nothing else, this could provide a greater range of creative options to use, which I feel is really the goal at the end of the day.
…On Sun, Sep 8, 2024, 12:12 AM Philipp Emanuel Weidmann < ***@***.***> wrote:
I'm at a loss for what to do here. Every single report of problems
mentions 70+B models. I don't even have the ghost of a theory for why
larger models are affected but smaller ones are not (I now have several
thousands of messages generated with 4 different models <= 35B, and not a
single case of excessive message length or missing newlines).
The best proposed "solution" is either hardcoding or partially-hardcoding
a bandaid exclusion list, even though there is no theoretical justification
for treating EOS/newline differently than any other token.
I'm leaning towards recommending that this PR be merged in its original
form without special-casing any tokens, and that people experiencing
problems patch the sampler code themselves to do what they want, until we
get a better understanding of what is actually going on here. I accept and
believe that there are problems for certain use cases with certain models,
but I don't think adding ad-hoc parameters is a good idea just because they
have been observed to alleviate some of those problems in some cases. Other
truncation samplers don't have special treatment for specific tokens either.
I have described multiple times why it doesn't make sense that XTC should
introduce behaviors that don't happen without it, since by construction
such behaviors would happen with a significant probability even with XTC
disabled. The fact that some reports appear to contradict this demonstrates
that we currently lack an understanding of the actual mechanics that cause
runaway outputs. I don't believe that trying to fix this issue without
understanding it is the right way forward.
—
Reply to this email directly, view it on GitHub
<#6335 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH2WKO6VTXS7ASBXAF725QLZVPFBZAVCNFSM6AAAAABMWKR2KSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZWGUZTQNRSGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Every existing sampler impacts paragraph and generation length, and none provide an option to prevent that. If you have Min-P set to 0.1, and the newline token has a probability below 10% of the top token's probability, then Min-P will set that probability to zero, and suddenly a newline cannot occur at that position when previously it could. If you increase or decrease the temperature, the probability of newlines etc. will be altered. By definition, samplers modify token probabilities, and since token probabilities control paragraph and output length, all samplers impact those. But there is no reason to expect that XTC would distort paragraphs and output lengths to an observable degree, and indeed it doesn't, for any model that I have tested. The fact that some users have observed distortions shows that we don't understand what is happening in those cases, and "let's tape over that with a feature that no other sampler provides" isn't the right answer. I'm unconvinced that the problems seen with some large models cannot be solved by simply combining XTC with other samplers. All existing truncation samplers are content-agnostic, and I don't see why XTC shouldn't be as well. |
The difference is that existing samplers touch the tail of the distribution, not the head. Min-p excluding low probability tokens is very different than XTC excluding the highest probability ones. XTC is basically restricting the "intelligence" of the model instead of trying to enhance it, so avoiding applying that to key structural tokens would be desirable. My other point is that there's no harm in providing flexibility to users. If they don't need it, they can ignore it. If they want to engage with it, they can engage. Limiting functionality just because it's been done that way so far is not a good argument in my view. |
I don't see much harm in a checkbox. Makes both sides happy. It still writes long when the tokens are not made biased towards eos/newline. There is no way that any turncating sampler can catch this behavior from the other end. How do you push specific tokens to be in exactly the right middleground? IME, XTC has had to be run at low temperature anyway to keep the models from going incoherent. I tried to use token bias on EOS and it didn't work that well either. Theory seems simple. When newline or eos is in the top priority, it gets eliminated and the model keeps writing. The alternative is large models outputting long, dense and rambling text after some messages. Set up correctly, XTC is kino. Set up incorrectly it's thesaurus schizo. |
I rented an A100 server this morning, planning to test XTC with large models, but I couldn't get TGWUI to work with CUDA 11.8. So I still haven't been able to reproduce these problems yet. This issue is quite deep, and I think it's time to separate it from this PR, which is working fine for many people including myself. I propose the following:
That way, the vast majority of users who aren't running 70+B models can enjoy the benefits of XTC, and potential improvements for very large models can be tested and evaluated separately. |
I tried your original settings to Magnum-v2-70b, and the result is beautiful I think. Prompt: Write a story about a battle to the death between 2 men: Jeff, who controls fire, and John, who controls water. Output: Jeff focused his mind, raising his arms towards the sky. With a snarl of effort, he summoned flames from thin air. The fire coalesced into a searing blaze in his hands. The heat radiated out, crackling with power. John responded by diving into the cool waters at his feet. He submerged himself fully, then erupted upwards, spewing a geyser of water into the air. The liquid took shape, cohering around his outstretched arms like a shimmering shield. "Come then, you charred pretender!" John called out defiantly over the roar of the sea. Jeff responded with only a glowering glare, flinging balls of scorching fire at his foe. John dodged them nimbly, the flames hissing as they struck the beach. In turn, he sent tidal waves crashing over Jeff, drenching the flames. They battled back and forth like this for long moments - John's surging deluges snuffing out Jeff's infernal blasts, only for the fire user to conjure new flames, burning through John's drenching waters. The sand sizzled and steamed from the elemental clash. The sun beat down mercilessly. Finally, they tired of this stalemate. Simultaneously, the two elemental warriors unleashed their greatest efforts. Jeff called down a pillar of white-hot fire from the heavens above. John called up a tsunami of water from the depths below. Fire met water in the middle, an explosion of scalding steam filling the sky. The two fighters grappled together amidst the blinding fog of their battle's final release. When the steam dissipated, only one figure stood. Jeff, blackened and blistered but triumphant. "I've claimed my rightful place as king of the elements!" he cackled with blackened, burning lips. But his laughter cut off short, and his body went limp, crumpling to the scoured beach. An arrow, dripping with water's essence, protruded from his smoking breastbone. John walked out from behind the cover of a nearby rock formation. "You made a critical error, old friend. Did you really think I didn't have one final trick up my sleeve? You were so focused on playing with fire, you forgot about my trusty bow and quiver full of enchanted arrows." The water user retrieved his bow from the fallen Jeff's corpse. He nocked a single arrow, aimed at the lifeless form. "Now, to the grave with you, traitor and foe!" He fired one last shaft into Jeff's unmoving body, ensuring the fire mage was well and truly gone. John stood over his enemy, a satisfied grin on his face, triumphant at last over his former companion, and the master of both water and fire. And that, as they say, is that! |
I switched to SillyTavern staging branch and tested XTC around with Nemo-based 12b models, and for me it seems like model is definitely dumber, struggling with facts from the context here and there, at least with default 0.1/0.5 settings. But yes, writing is more creative, although i'm not sure it's worth it. |
That's because they are finetuned from an already good model to have better responses at the top - as such, XTC can negate the work done by finetuning. Nemo Instruct works well with 0.1/0.5 A way to control that would be an upper limit, such as |
I had the same thought. @p-e-w Wouldn't it be better to use a relative probability threshold like |
@josephrocca fwiw that comment was made a month ago. Since then, XTC has been live in KoboldCpp and hundreds of people have already been using it to great success. So I think the existing implementation is good enough - something like threshold 0.15 probability 0.5 works very well. |
Yep, I know it works and has rave reviews. But it seems like it could be made plainly better with a change like this. The current approach doesn't really make sense. |
This would be a substantially different sampler than the one described and is likely out of scope of the PR. "Better" is very much subjective at this point. Also, on another note, after more experimentation with this in Kobold, I haven't experienced the issues I did when using it in text-gen, so the implementation here is likely good to go. |
Here's why that doesn't work, or at least, why it doesn't do the same thing as XTC: Let's say the prompt is
We want to break the trite cliché of "shivers down her spine". With XTC, we can decide in an absolute sense that a token with a probability above 10% is a "viable" token. That's quite intuitive, because something that has a 10% chance of happening is pretty much by definition a sensible option. If choosing something with a probability of 10% makes the model go off the rails, then it can go off the rails anyway. Now let's say that instead, the sampler worked with relative probabilities like you propose. That is, the threshold is not fixed, but a percentage of the probability of the top token. Which percentage should we choose? In order to eliminate
The model is much less certain this time, and a relative threshold of 15% of the top token probability ends up eliminating all tokens with a probability greater than 1.8%! In other words, only extremely unlikely tokens remain for sampling. Relative thresholds are a bad idea because whether a token makes sense is not a relative concept. It has nothing to do with the probability of other tokens, and everything to do with a (subjective) assessment of "possible". Here's an analogy to make this more clear: A gambler who believes that there is a 20% chance of Barcelona beating Real may consider that a viable bet. If he assesses the chance at a mere 3%, that bet might not be viable in his opinion. But regardless of his individual perception, whether Barca beating Real is a viable bet has nothing whatsoever to do with the probability of Liverpool beating Chelsea. That's simply an entirely different matter. Whether a token is viable is measured against what the user considers a viable probability – not against how probable other tokens in the same distribution happen to be. All that being said, there are obviously alternative ways to do top-truncation. Considering that there are half a dozen bottom truncation samplers, I don't see a reason why more than one top-truncation sampler shouldn't be implemented as well. But those other samplers won't be XTC, and that's fine. |
Ahh I see, so basically with a reasonable threshold like 10%, XTC never 'activates' in situations where it wouldn't make sense to activate, so the aforementioned issue is never a practical concern. And RE the next section of your answer, I was initially confused about your choice of 15% of the size of the top probability (rather than within 15% of the size of the top probability), but IIUC, what you're saying is that you'd need to have such an extreme relative probability threshold in order to solve the Great explanation, thank you! 🙏 |
I want to keep the "\n" and EOS exclusion because I agree with @Ph0rk0z, the goal of this sampler is to get the model creative in its words, not in the format of its reply. This is an ugly heuristic but I believe that it should lead to better results in conversations. About adding a parameter to exclude certain tokens, I don't have any use for this and don't see much demand. Please write a PR if you want this feature. |
An issue may be caused by this pr:
Please review. |
@Touch-Night can you share the preset that generates the error? Export it with the 💾 button and paste the parameters here. |
Certainly. Let's move to #6414, I pasted the exported parameters there. |
Background
Apart from some special cases like repetition penalties, all widely used sampling algorithms fall into two categories:
All of these sampling strategies have one thing in common: They don't change the probability order of tokens, and in particular, the most probable tokens from the raw distribution are still the most probable tokens after applying such samplers.
It is therefore unsurprising that existing samplers are somewhat ill-suited for the task of enhancing a model's creativity. The best you can do is either reduce truncation (which will shift the range of acceptable tokens towards the "garbage end" of the distribution), or reshape the distribution to make low probability (garbage) tokens more likely. The result tends to be models going "off the rails" rather than being more creative in the commonly used sense of the word.
What XTC does
This pull request introduces the Exclude Top Choices (XTC) sampling algorithm. XTC is a novel sampler that turns truncation on its head: Instead of pruning the least likely tokens, under certain circumstances, it removes the most likely tokens from consideration.
More precisely, it removes all except the least likely token meeting a given threshold, with a given probability. This ensures that at least one "viable" choice remains, retaining coherence. Truncation samplers can be applied as usual, preventing garbage from being sampled. The result is coherent output (because truncation removes bad tokens) with unprecedented creativity (because XTC removes "boring" tokens).
My experience so far has been that this gives spectacular results. The creativity is off the charts, while the coherence is virtually unchanged. This is especially apparent when regenerating a chat message several times: Models tend to generate roughly the same message structure each time once a sufficiently long context has established expectations. But with XTC enabled, models will often generate messages that are completely different from previous attempts, because eliminating the most likely choices breaks ingrained patterns.
One stone, many birds
XTC doesn't just boost creativity, it also breaks writing clichés and inhibits repetition, including non-verbatim (paraphrased/structural) repetition. It is the first sampler that I'm aware of that can successfully do the latter. Standard repetition penalties operate by first trying to identify repetition and then penalizing tokens accordingly. But detecting paraphrased or structural repetition is extremely difficult, so repetition penalties usually aren't able to prevent it from happening. By contrast, XTC penalizes tokens simply for being very likely, which often includes tokens that reflect the model's tendency to repeat previous output.
Demonstration
The following outputs are not cherry-picked. They were the first outputs I generated with each given configuration.
mistral-7b-instruct-v0.2.Q4_K_M
Baseline (Min-P = 0.02)
Notes:
Min-P = 0.02, Temperature = 1.5
Notes:
Min-P = 0.02, XTC threshold = 0.1, XTC probability = 0.5
Notes:
How to try out XTC
xtc
branch from my fork.xtc_probability
to a value greater than zero (0.5 is a good start). I recommend pairing it with Min-P (0.02) and DRY (multiplier 0.8), with all other samplers disabled.If you want to use XTC over the API (e.g. with SillyTavern), you will need to patch the client to send the appropriate XTC parameters, or TGWUI itself to hardcode a non-zero probability. Note that SillyTavern also sends the "sampler priority" parameter, which might interfere with proper operation of XTC unless further patching is done (see next section).
Important note: To use XTC with a GGUF model, you need to use the "llamacpp_HF creator" in the "Model" tab and then load the model using llamacpp_HF, because otherwise Transformers-based samplers have no effect.
Position in the sampler stack
While there is certainly room for experimentation, I strongly recommend to place XTC after all truncation samplers. This ensures that truncation happens based on the original distribution and remains predictable, regardless of how much probability mass is removed by XTC.
Checklist