-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added implementation of DRY sampler (post-refactor) #9702
added implementation of DRY sampler (post-refactor) #9702
Conversation
@compilade You had kindly offered possibly doing a code review for the previous iteration of this PR before the sampler API overhaul. If you still have availability for that, I would definitely love to get your feedback and insights on this newest iteration. |
@compilade Thanks for taking a look! If it’s any help the koboldcpp version of this has a lot of extra comments that go more in-depth on the intricacies of the Z-algorithm and the clever way they implement it. Most of the core implementation here is exactly the same except for the bits I had to change due to the ring_buffer accessing from the end. |
KoboldCPP's handling of sequence breakers is excellent, and it's great to see it ported here. However, the Z-algorithm is actually unnecessary complexity as I had remarked in the Kobold PR. Quadratic runtime can be avoided by simply capping the maximum sequence length being matched (to 50, in the case of text-generation-webui). The penalty is exponential in the match length, so this makes no difference in practice as the value will be astronomically large regardless. This can simplify the algorithm considerably, and has been tested to perform well even in the worst-case scenario where the input consists entirely of repetitions of a single token. Personally, I prefer simplicity over notions of algorithmic elegance if performance is demonstrably equivalent as in this case, although I accept and respect that not everyone shares that opinion. Either way, great to see this happening! Make sure you run some basic smoke tests before merging this in, e.g. using the input from the "Demonstration" section of my original PR, to verify the system is actually doing what it is supposed to, since the implementation is fairly complex at this point. |
@p-e-w I really appreciate your feedback and am a fan of not just this sampler but also XTC, which I was happy to see at least two people beating me to the punch on as far as implementing with this new sampler API (albeit one of the submissions being made through reddit rather than a PR). As for the issues you raise about complexity, you bring up very fair points. I will admit that I do tend to be a sucker for "elegant" algorithms or ones that seem particularly clever to me. Having said that, I would not have been opposed to implementing it the less complex way but didn't realize you weren't a fan of the z-algorithm portion until I was already pretty far into the process and saw your comments while digging deeper into all the PRs.
In regard to your demonstration text, that was indeed one of the first things I tested it on! For anybody else looking at this, here is the first part of it that is fed to the LLM (in this case, I fed it to Nemo):
Here is what Nemo gave me with DRY off when I ran it just now:
Here is what Nemo gave me with this implementation of DRY turned on (multiplier: 0.8, base: 1.75, allowed_length: 2, default sequence breakers which include ":"):
Here is some relevant debug output showing the final tokens of the prompt, as well as the penalty itself and a comparison of logits before and after:
|
Fair enough, as long as everything works it should be fine, and since you've already implemented it I guess there's little point in changing it now. The elephant in the room with such decisions is usually maintainability, but to be frank I doubt this code will need to be touched very often. BTW, I love the output you got with the demonstration example. It shows not only that DRY is capable of stopping repetition (which is a mathematical certainty in principle), but that the model recovers organically from the intervention, creating a natural transition to a completely different train of thought. Many people have told me that they expect DRY to just make the model paraphrase the previous output instead of repeating it verbatim, or introduce spelling mistakes, but your output demonstrates that this usually doesn't happen with current-gen models. And it's even less likely to happen in actual conversations, because in such scenarios, DRY gradually increases the penalty rather than just slamming the door because the user already imposed a 9-token repetition like in this example. |
@compilade I went ahead and made several small fixes that hopefully should address the strict compile warnings/errors I saw from the workflows. Incidentally, I also removed a "cstring" include in sampler.cpp that I had added earlier but that is no longer needed, and also removed the redundant line in server.cpp I had commented on earlier (and resolved that comment). Thanks again and let me know if I can do anything else to help the process. |
It looks like @compilade has been showing a busy status since the weekend and I’m guessing has a pretty full plate at the moment. I look forward to seeing any feedback he may have, but I also wanted to extend an invitation for feedback to any other maintainers who may have a moment as well? @slaren or @ggerganov, with your recent sampling-focused work/review, would either of you have a chance to take a look? I am excited about this being so close to the finish line and I have seen great results from all the testing I have done. Thank you! |
There is a lot of code here that makes this change hard to review. It would be easier to do so if all the debug code was removed. Functions like |
@slaren Thank you for the feedback!
Despite it not being a strict interface function, I left the About an hour ago I resolved the recent conflicts that arose from #9805. Prior to that, all of the CI workflows that @compilade triggered were showing as passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like @compilade has been showing a busy status since the weekend and I’m guessing has a pretty full plate at the moment.
@wwoodsTM Correct, I've had much less free time that I'd like in the last weeks. But I still want to give you feedback.
I really appreciate the efforts you've put into documenting your changes in the PR description.
We now have the explicit permission from @pi6am to base this version on his implementation from koboldcpp. I also added him as a co-author on earlier commits for this PR. I believe I had addressed all of the concerns from @compilade's code review in my latest commit, but I of course welcome any and all additional feedback. @ggerganov @slaren Thanks. |
@slaren I implemented your suggestions and fixed a conflict with the latest version of master. Continuing the conversation on the sequence breaker functions: While I had only meant these to be a "part 2" of the init process, I can see how this was not clear and how regardless of intention they don't mesh well with the design. To address this, I came up with a way of getting rid of the sequence breaker functions altogether which I just implemented in the latest commit. This new approach uses one main init function which includes the sequence breakers as a parameter, and then I made two overloaded wrapper versions, one for the C-interface and another for test-sampling.cpp. Does this approach seem workable? |
Why no take |
So as @compilade mentioned before, I preferred the flexibility of C++ strings to allow for the possibility of NUL characters as I wasn't sure whether there might be some contexts where this literal "sequence breaker" for C-strings might be used as a sequence breaker for DRY. I also wasn't sure whether there might be any other weird encoding things that might crop up with C-strings. But maybe the chances are quite remote of this? Also, in my particular use case, I use llama-server and pass sequence breakers through JSON, and so this probably biases me a bit toward thinking of that as kind of the principal interface versus the C-interface. As a side note, I just made an additional commit to address warnings-made-errors from the pedantic runs that I had missed. |
I am not familiar with this sampler, and I am not sure what kinds of strings are used as sequence breakers, but it seems very unlikely that allowing NUL characters is important. If it is really necessary, it can be done by passing an additional array with the sizes of each string. |
Do you feel this is a deal-breaker in its current form? My feeling is that at present it is a decent compromise to allow for keeping things in pure C++ string form where possible, such as when the breakers are submitted through JSON and tokenized in get_overlapping_token_sequences, as opposed to converting to C strings and then converting back to C++ strings every time. In the current implementation, sequence breakers would only be converted from C-strings on an as-needed basis when they are specifically submitted through the C-interface. |
My opinion is that it is not worth the maintenance cost of having "secret" API functions, especially for something as minor as this. It takes two lines of code to convert a vector of |
I hear you, it’s just that C-strings have a history of sending shivers down my spine and making me curse in repetitive ways that would probably be penalized by this sampler. I was hoping to avoid converting the overlap code from kobold to use them, but honestly it shouldn’t be hard since we’re not really manipulating the strings or doing anything complicated with them, so I will get over my fear of them and make the change a bit later today when I get home. Thanks again for your feedback. |
I am not sure if I understand correctly, but it is ok to store the strings in the sampler state as |
Ok, I have gone ahead and eliminated the C++ version of |
I have implemented all suggestions so far into the latest commit, including the use of |
faff4b1
to
519cd19
Compare
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com> * Implementation of DRY Sampling (post-sampling-refactor) * Better merge of latest changes to master in server.cpp * Added co-authors, testing, fix for consistent prompt sampling * Misc. cleanup * Remove debug define * Fixing strict compile errors, cleaning up * Removed unnecessary include * Removed debugging code, moved (de)tokenizing functions * Bug fixes to z-alg implementation * small fix for string_format * restoring Koboldcpp comments, adding attribution in llama.h and llama-sampling.cpp * fixes, adjustments from code review * Removing sequence breaker functions * Addressed warnings from pedantic compile * Trying alternative ways to pull in external tokenize/detoneize functions * Removed C++ version of llama_sampler_init_dry * Implemented testing code * Switched to using llama_tokenize_internal * Removed support for JSON-within-JSON for seq breakers * Removed old llama_sampler_init_dry declaration * Trying ggerganov's suggestion with llama_n_ctx_train * Using vocab instead of model * Restored common_sampler_init signature, using llama_n_ctx_train for context, updated testing code * Minor cleanup * Added --dry-sequence-breaker arg for single breaker, fixed docs and default values * Removed mixed arrays for prompt
519cd19
to
9697338
Compare
Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
Not sure what I did wrong with the "co-authored" part in the first force-pushes as GitHub didn't seem to want to recognize that; maybe it deals with force-pushes differently in that respect? Anyway, I removed an extra empty line that I probably should've removed originally anyway and did a normal commit to make sure that @l3utterfly and @pi6am were restored as co-authors on this PR. |
Is it possible my force-pushes and commit all in quick succession threw off the tests and that's why we're getting failures? |
Some of the builds have warnings as errors, it fails to build here:
|
Thank you! Just recommitted. I thought I had my warnings and pedantic settings all cranked to the max but somehow that one didn't show for me. |
I am able to reproduce the crash on my own machine coming from running this: I am now trying to track down the issue. |
@ggerganov @ngxson I resolved the crash issue, which was due to DRY being in the default sampler chain but not being initialized when disabled. Now DRY will properly initialize with empty data structures even when disabled. |
I believe this thing may be finally ready! 😅 @ngxson Would you be ok with removing the block for merging at this point? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup thanks, the code looks much cleaner now.
Thank you, it was greatly helped by your guidance and that of everyone else who has offered feedback. @ggerganov Do you feel this is ready to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wwoodsTM Good job!
damn, what a ride thanks @wwoodsTM for your work ! |
@ggerganov Thank you, I learned a lot throughout this process and have appreciated your feedback throughout. I am really excited to see this merge, let me know if I need to do anything else on my end for that step. I also hope if there is a squash that the “co-authored” sections of the commits survive better than when I did my force-pushes. |
Sorry for barging in, but what's the reason for DRY being a sampler in queue and not a sampler before the queue, like penalties? It used to go right after penalties and before the queue in the first PR, and I'm not sure there is a reason to have DRY anywhere except at the beginning of the queue. |
@MaggotHATE You are right that I originally implemented it as occurring between traditional penalties and the sampler chain. The change to add it into the chain stemmed from @compilade's request in this review comment. I also like you wonder whether there is much applicability in other places, but I felt that I could address compilade's request and expand flexibility a bit without hurting anything if I put it first in the chain. |
Oh, sorry, looks like I've missed this comment. If it is by request, then it's ok. I'm still not sure where else it might be used, but more flexibility is definitely for better. |
The new PR that was merged needs an array instead of a str ggerganov/llama.cpp#9702
* llama.cpp Enable dry w/ array convert The new PR that was merged needs an array instead of a str ggerganov/llama.cpp#9702 * Safe sequence breakers parse * Support comma-separated list of llama.cpp sequence breakers #3026 --------- Co-authored-by: Beinsezii <beinsezii@gmail.com>
* sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
* sampling : add DRY sampler (post-refactor) * DRY: Trying to fix coauthors, removed unneeded line * DRY: Fixed redundant code * DRY: Fixed crash issue due to DRY being in chain but uninitialized --------- Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com> Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
This is a continuation of the previous PR #6839 to implement DRY sampling (and I think a pretty faithful port of Koboldcpp’s impressive implementation, authored by @pi6am [who is also credited as co-author for this PR], which in turn is originally based on @p-e-w‘s design and ideas as explained here), but now overhauled to use the new sampler API. I added @l3utterfly as co-author in a recent commit of this PR.
Here are some important notes (EDIT: all have been addressed at this point, and in the case of the first point, this was moved to a separate PR, #10019):
I may have also fixed Bug: uncached prompt is not used for penalty #8971 (or it may turn out to be more of a bandaid), but by moving/forcing sampling to happen over the entire prompt at the conclusion of prompt processing in server.cpp, I was able to get consistent sampling of the prompt, regardless of caching or whether it is the first generation or not.I also fixed a recent breaking change that was made, though I am not sure when this change happened. For quite a while it was possible (and indeed I have heavily relied on the feature in my own software for a long while now) to be able to submit a mixed array of strings and token IDs as a singleton prompt. This I believe was originally implemented from PR server: allow json array in prompt or content for direct token input #2306. Apologies if this is considered out of scope here, and I can remove it or move it to a separate PR if need be, but to be honest this feature was the main reason why I converted to using llama.cpp's server. It also seems this fix restores the feature and doesn't stop the other prompting options from working either, but if I missed something here, please let me know.I added a context size parameter to gpt_sampler_init(). If there is a smarter way to handle this, I am happy to do it in a less drastic way, but I couldn't immediately figure that out.The classic issue that I have dealt with from day one with this where I can't figure out how to get the vector of strings for the sequence breakers from server.cpp/common/API land to src/llama-sampling land is still here. I again "cheated" with a C++ section at the bottom of llama.h and made a separate function for setting the sequence breakers apart from the regular init function. There is probably some clever way to avoid this or to use a void or extern that I am not as experienced with. I welcome any help or suggestions with that or any of these other issues.Finally, I did not implement a CLI argument for sequence breakers. If anybody wants to take a whack at that, please feel free. My thought was that maybe it could be done similarly to how the reverse/anti prompt is done (otherwise known as stopping strings) or how JSON schema is done in the CLI, but I wasn't really sure how to pull it off correctly.