-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Core] Async Scheduling X Spec Decoding Compatibility #24799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Core] Async Scheduling X Spec Decoding Compatibility #24799
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.
f417e8f to
b530bf3
Compare
8172c2b to
163f9ab
Compare
4466156 to
f971753
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
13773be to
337aab8
Compare
3630428 to
3ad3c1b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
9885f53 to
bc3b8a6
Compare
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
bc3b8a6 to
c82429d
Compare
fix typo and grammar Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution and continued effort. I am satisfied with the state of the PR
|
May I double check. |
|
@benchislett Maybe it's worth to run CI against this PR? |
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
5667a1c to
409e504
Compare
The answer is yes. you can see e2e testcase |
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
|
Waiting for final sign-off from @njhill or @WoosukKwon before marking as "ready" |
|
Apologies I was battling a perf regression and more CI issues today but will focus on this tomorrow. |
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ronald1995 ... I'm still in the middle of reviewing but posting a few comments as I go which you could maybe start to look at in parallel...
| ) | ||
| assert _all_logprobs_match( | ||
| results[0][1], other_test_logprobs | ||
| def test_with_spec_decoding(self, monkeypatch: pytest.MonkeyPatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good for the baseline run of this to not use spec decoding, so that we compare the output with that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will complete the testcase
| results[0][1], other_test_logprobs | ||
| ) | ||
|
|
||
| outputs.append((test_config, results)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to record the acceptance rate for each test run here and compare that at the end too (for the spec decoding cases of course).
I think this could be done fairly easily via the exposed metrics, you can look at other tests as an example, e.g.
vllm/tests/v1/e2e/test_spec_decode.py
Lines 162 to 163 in 18b3982
| # Collect draft and acceptance stats. | |
| metrics = spec_llm.get_metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will add this
vllm/v1/engine/core.py
Outdated
| if ( | ||
| (sampling_params := request.sampling_params) | ||
| and self.use_spec_decode | ||
| and self.async_scheduling | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move these checks to the front-end. I can do this soon in another branch based on this one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
front-end you mentioned is coreclient? i tried to add the check in add_request method of coreclient, but there are kinds of of coreclient, so i add the check logic in core.py in case of too many modifications for coreclient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill i have fix the issues about e2e test. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ronald1995. I guess it would be good if possible to make these additional changes to the test:
- Have the baseline for output comparisons be the non-spec decode run
- Don't run any other non-spec server configs as part of the spec decode test (since we don't really need to test those and it adds significant time having to start the larger model)
9338a9b to
ff928a2
Compare
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
ff928a2 to
5aa2e38
Compare
Signed-off-by: Nick Hill <nhill@redhat.com>
@Ronald1995 I've pushed a commit with this change (moving the validation to processor.py), along with some other minor simplification and comment cleanup, I hope this is ok. |
| monkeypatch, | ||
| MTP_MODEL, | ||
| [{}], | ||
| spec_configs=[{"method": "mtp", "num_speculative_tokens": 1}, None], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change this to 2? Perhaps it doesn't make a difference but I feel like it may avoid some >1 edge case.
| spec_configs=[{"method": "mtp", "num_speculative_tokens": 1}, None], | |
| spec_configs=[{"method": "mtp", "num_speculative_tokens": 2}, None], |
|
@Ronald1995 what will happen if vllm/vllm/v1/worker/gpu_model_runner.py Lines 2692 to 2695 in ca6f755
I think it won't update I think we also need to address that case if so. |
| if self.speculative_config.get("method") not in get_args( | ||
| EagleModelTypes | ||
| ): | ||
| raise ValueError( | ||
| "Currently, async scheduling is only supported " | ||
| "with EAGLE/MTP kind of speculative decoding" | ||
| ) | ||
| elif self.speculative_config.get("disable_padded_drafter_batch"): | ||
| raise ValueError( | ||
| "async scheduling for EAGLE/MTP kind of speculative " | ||
| "decoding is enabled, but disable_padded_drafter_batch=True " | ||
| "disable_padded_drafter_batch=True is not supported for " | ||
| "this situation now. please set " | ||
| "disable_padded_drafter_batch=Fasle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am moving the async scheduling config validation into VllmConfig in #28250. Hopefully that PR can be merged quickly and we can then move this check along with that.
Purpose
PR #19970 implements async_scheduling, PR #23569 implement
prepare_inputoverlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code, make it don't rely on cpu's sample_token_ids.this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the
_draft_token_idsdirectly for next stepexecute_model.because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.
Test Plan
we will make the e2e test.
Test config:
test device: Nvidia A100
Test Result
performance
precision
I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.