-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix sharing of stateful logits processors #5329
base: main
Are you sure you want to change the base?
Conversation
This allows vllm to support stateful LPs that must be unique for each sequence. Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Thanks @maxdebayser! You need to run |
I did run format.sh, but maybe I need to update my dev dependencies.
…On Fri, Jun 7, 2024, 2:24 PM Nick Hill ***@***.***> wrote:
Thanks @maxdebayser <https://github.com/maxdebayser>! You need to run
format.sh to fix the linting, and it looks like there's some tests that
need updating.
—
Reply to this email directly, view it on GitHub
<#5329 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ3JGQW35BE2OMISOT4O53ZGHUFHAVCNFSM6AAAAABI5WNYHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVGIZTSNJYG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@br3no I have a couple of questions:
Unless (1) is possible and very cheap, the best option would be to extend this PR to support pooling and make use of (2), assuming that (2) is possible. This library looks like exactly what we'd need for that: https://pypi.org/project/pondpond/, in conjunction with finding the right place to hook the sequence completion event (possibly |
I haven't benchmarked anything, but the code suggests that copying the There is no support for resetting the I believe to improve this situation, we need to work with the Outlines community. NOTE This problem only applies to the |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Thanks @br3no, so |
Ah I just remembered there is another PR #5006 related to this that I had been meaning to look at, hopefully that can help solve it! |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Sounds like there should be a replay of the FSM, at the very least; there should be a parameter passed when initializing new Just some more plumbing required to ensure that when you call method
|
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
I've synced the branch with main and all the unit tests that are touched by this PR work, but guided decoding in main currently seems to be broken: #7557 |
This is related to PR 4109
In vLLM users can submit more than one sequence at the same time with the completion request in the legacy OpenAI API. Since the sampling params are validated only once a single SamplingsParam object is shared between all sequences even when they belong to different sequence groups. The SamplingsParam is mostly a data object, but it has a list of logits processors that are executable. If the logits processors are stateless there is no problem. But there are implementations of logits processors that have internal state that depends on the sequence seen so far.
For example, the
CFGLogitsProcessor
would crash with the request below:resulting in the following error:
PR 4109 solves the crashing behavior, but the output is still affected by the sharing problem.
The additional solution proposed here is based on adding factories for stateful logits processors. The basic idea is:
and copy the logits_processors and call the factories to populate SequenceData.logits_processors
the sampling_params.logits_processors
Here are some diagrams to illustrate the current code structure to better visualize the proposed changes:
There are two scenarios that need further investigation:
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!