v0.7.3 support speculative decoding #252

mengwei805 · 2025-03-06T08:57:03Z

What this PR does / why we need it?

support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models.

Does this PR introduce any user-facing change?

u can refer to https://docs.vllm.ai/en/latest/features/spec_decode.html#

How was this patch tested?

Four modes of speculative decoding have been tested, consistent with GPU devices

Signed-off-by: mengwei805 <mengwei25@huawei.com>

wangxiyuan · 2025-03-07T07:27:31Z

platform.check_and_update_config should be updated as well to init spec_worker IMO

wangxiyuan · 2025-03-07T10:32:48Z

vllm_ascend/patch/patch_spec_decode_worker.py

+logger = init_logger(__name__)
+
+
+def create_worker(


https://github.com/vllm-project/vllm/blob/main/vllm/platforms/cuda.py#L131-L134

The adaptation work here has been completed in #236

I missed it.

wangxiyuan · 2025-03-10T01:40:16Z

vllm_ascend/patch/patch_spec_decode_worker.py

+            # the use of TP1DraftModelRunner
+            if draft_tp == 1 and draft_model_config.hf_config.model_type !=\
+                    "deepseek_mtp":
+                draft_worker_kwargs["model_runner_cls"] = TP1DraftModelRunner


So the patch code here does two change mainly:

remove is_cuda_like hard code

change draft_worker to a new TP1DraftModelRunner in vllm-ascend

right?

remove is_cuda_like when importing the package and here, because the mode of using EAGLE based draft models must use TP1DraftModelRunner;

change draft_worker to a new TP1DraftModelRunner in vllm-ascend;

When the draft_model type is deepseek_mtp, force TP1DraftModelRunner not to be used. Refer to [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict vllm#13626 and [Model][Speculative Decoding] DeepSeek MTP spec decode vllm#12755. I think this version can be done this way, because fixing this problem may require a lot of work(may need to adapt multi_step_worker, modify npu_worker, etc.), we can do better in the next version.

In summary, I have verified that the current modification can correctly run 4+1 speculative decoding modes: The 4 is speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models, and The 1 is deepseek_mtp.

### What this PR does / why we need it? Backport: #252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: #423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, `MultiStepWorker` would not step into the branch using NPU prepare, but only into the branch using CPU prepare (`line 52` of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has `no effect` on the `correct operation` of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in `patch_multi_step_worker.py`: first, the `is_cuda_like()` check is removed and the `TP1DraftModelRunner` rewritten in vllm_ascend is used; second, the `supports_gpu_multi_step()` function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU, but not MLA. The relevant adaptation is in `vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why the `input_positions` of `model_input.attn_metadata` in vllm-ascend needs to be added in `execute_model`, it is done in `model_runner.py`, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in `draft_model_runner.py` in `line118` to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO： - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>

### What this PR does / why we need it? Backport: vllm-project#252 This support speculative decoding in Ascend, including speculating with a draft model、by matching n-grams in the prompt、using MLP speculators and using EAGLE based draft models. Backport: vllm-project#423 spec decode MultiStepWorker support TP1DraftModelRunner fully, support run the draft_model_runner with multi-step prepare on the NPU directly and support draft_model_runner use MLA. 1. before this pr, `MultiStepWorker` would not step into the branch using NPU prepare, but only into the branch using CPU prepare (`line 52` of `vllm_ascend/patch/patch_multi_step_worker.py`). Although this has `no effect` on the `correct operation` of speculative decoding and the performance of the two branches is basically the same as of the current version, I support entering this branch in this PR. In general, there are two main changes in `patch_multi_step_worker.py`: first, the `is_cuda_like()` check is removed and the `TP1DraftModelRunner` rewritten in vllm_ascend is used; second, the `supports_gpu_multi_step()` function is made to return true on NPU devices when outer Multi_step_worker could work correct. 3. before this pr, `TP1DraftModelRunner` only supports Attention on NPU, but not MLA. The relevant adaptation is in `vllm_ascend/worker/draft_model_runner.py`. Although I don’t know why the `input_positions` of `model_input.attn_metadata` in vllm-ascend needs to be added in `execute_model`, it is done in `model_runner.py`, so I also made corresponding changes. Otherwise, when atten_backend is MLA, it will prompt that input_positions cannot be found. 4. I commented out two lines in `draft_model_runner.py` in `line118` to support the scenario of K>1. ``` # lora_mapping=model_input.lora_mapping, # lora_requests=model_input.lora_requests, ``` I added comments. In the future, when vllm-ascend supports lora feature, the changes here can be restored. TODO： - [ ] revert the patch when the related issues are addressed in vllm ### How was this patch tested? CI passed with new added test. - e2e test for medusa proposer: tests/singlecard/spec_decode/e2e/test_medusa_correctness.py - e2e test for mlp proposer: tests/singlecard/spec_decode/e2e/test_mlp_correctness.py - e2e test for n-gram proposer: tests/singlecard/spec_decode/e2e/test_ngram_correctness.py Tests for patched files: - tests/singlecard/spec_decode/test_dynamic_spec_decode.py - tests/singlecard/spec_decode/test_multi_step_worker.py - tests/singlecard/spec_decode/test_ngram_worker.py - tests/singlecard/spec_decode/test_spec_decode_worker.py --------- Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: mengwei805 <mengwei25@huawei.com>

mengwei805 added 3 commits March 6, 2025 08:51

v0.7.3 support speculative decoding

100c2fd

Signed-off-by: mengwei805 <mengwei25@huawei.com>

fix codecheck

eb7c4c7

Signed-off-by: mengwei805 <mengwei25@huawei.com>

fix codecheck

89864f2

Signed-off-by: mengwei805 <mengwei25@huawei.com>

wangxiyuan reviewed Mar 7, 2025

View reviewed changes

wangxiyuan approved these changes Mar 7, 2025

View reviewed changes

wangxiyuan mentioned this pull request Mar 7, 2025

Speculative decoding not working #47

Closed

wangxiyuan reviewed Mar 10, 2025

View reviewed changes

wangxiyuan merged commit 11f4971 into vllm-project:v0.7.3-dev Mar 10, 2025
11 checks passed

wangxiyuan mentioned this pull request Mar 12, 2025

[Feature]: speculative decoding、Chunked Prefill、Prefix caching #289

Closed

This was referenced Apr 9, 2025

[SpecDecode][MiniCPM] pick certain feature to main #484

Closed

[SpecDecode] Add spec decode support #500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.7.3 support speculative decoding #252

v0.7.3 support speculative decoding #252

Uh oh!

mengwei805 commented Mar 6, 2025

Uh oh!

wangxiyuan commented Mar 7, 2025

Uh oh!

wangxiyuan Mar 7, 2025

Uh oh!

mengwei805 Mar 7, 2025

Uh oh!

wangxiyuan Mar 7, 2025

Uh oh!

wangxiyuan Mar 10, 2025

Uh oh!

mengwei805 Mar 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		logger = init_logger(__name__)


		def create_worker(

v0.7.3 support speculative decoding #252

v0.7.3 support speculative decoding #252

Uh oh!

Conversation

mengwei805 commented Mar 6, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangxiyuan commented Mar 7, 2025

Uh oh!

wangxiyuan Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

mengwei805 Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

mengwei805 Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mengwei805 Mar 10, 2025 •

edited

Loading