-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Move token decoding and stopping evaluation to router #138
Conversation
Hello! Thanks for the PR. First a disclaimer: I only skimmed through the PR but I want to get this discussion started.
I would argue that the decoding logic is already "in rust" since right now
I'm not sure I see what you mean here. From what I saw, you moved part of the
True. But this could have been implemented with the current system.
Do you have an example use case for timi-limit based stopping criteria?
Shoudn't this be added directly to The main claim of this PR is that moving the code from Python to Rust is faster.
|
Thanks @OlivierDehaene, and sorry for the PR being quite large. For context - I have been making many changes/additions on an internal fork for some time now with the intention of contributing (or at least offering) most of them back, but you've also been making changes quickly including implementing some of the same things before I had a chance to port PRs over! This could probably be separated into two logical parts - moving the stop criteria handling could be done while leaving the detokenization on the python side, so perhaps it's worth discussing them separately.
I know what you mean, but the lookup of individual tokens is probably not very significant either way (just a hashtable index). I was thinking more about how that can be inlined with surrounding conditional logic, including that related to incremental decoding. It seemed nice for the python shards to just stream token ids back and not worry about the text manipulation side of things.
Yes, exactly. I did this originally in support of moving the stopping criteria logic out but then found that this decomposition of
and the resulting simpler code for each of these should be easier to maintain or implement for new kinds of models (again just imho)
Sure, but this seems a bit more complicated - the stopping decision logic is split between two different places, and each of the separate model subclasses have to be involved. Performance/simplicity-wise what's the downside of having the stop criteria logic done in one place on the rust side? It also means the stop sequence evaluation can be more efficient... it's all on the infer loop critical path.
Depending on the model/hardware, the generation might not be super fast. So it can be useful to say for example give me as much as you can in 2 seconds rather than guessing a max token limit.
See the discussion with @Narsil here. I agree with you that it would be nice for it to be added there, but it sounds like streaming functionality may be too niche. In any case I thought perhaps this project could be a good place to incubate it given it's the main application for it. Kind of like how you've included various custom/specialized changes to
I wouldn't say that's the main claim, more like just one of the potential benefits. I think more significant is the separation/movement of concerns.
We are working on benchmarking and so I hope can get some data on this soon. There are other variations too, like lots of stop sequences being used, etc.
I thought I saw that @Narsil was working hard on filling these gaps :) |
Also: - Fixes incremental decoding inconsistency issues - Adds time limit based stopping option
@OlivierDehaene I was going to rebase this but then realized your benchmark stuff talks directly to the internal proto interface and so would need to be adjusted too. |
@njhill, |
closing in favor of more recent work that implements large parts of this PR #202 |
Wrong link ? |
woops sorry about that bad copy/paste 😅 I've updated the comment above |
Benefits:
I've included implementation of IncrementalDecoder logic that addresses the various problems associated with incremental decoding when streaming for the various kinds of tokenizers. Specifically:
The main change is removal of the stopping criteria logic from the python model
generate_token
methods, and adding a separateprune
method to the batch classes that prunes the batch based on a list of completed request ids. These lists are passed in via a second parameter added to the internalDecode
rpc.A new optional
time_limit_ms
parameter is added to the external API.Given that this changes the internal python APIs a bit, the python-based tests need some adjustments. I've only partially done that, and set most of them to be skipped for now.