[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict #13626

benchislett · 2025-02-20T19:02:24Z

This PR extends the DeepSeek MTP implementation such that the number of speculative tokens may be greater than the number of prediction modules. In these such cases, the layers are reused by feeding the hidden states from one into the next.

This will also fix a few issues with the original implementation that would not have been able to handle k > 1:

Disable TP1DraftModelRunner usage. As I understand DeepSeek MLA is not compatible with multi-step at all, so this pathway would not be compatible with k > 1. I do not expect there would be any benefits to using this path for k=1.
Enable hidden states forwarding in the fallback model-runner case, so that either code path can be used for EAGLE-style speculative decoding. I believe this will enable TP > 1 on all EAGLE models, but I have not tested this claim beyond MTP.

Below are benchmark results on 8xH200 with sample requests from ShareGPT:

Max Concurrency	Baseline TPOT	k=1 TPOT	k=2 TPOT	k=3 TPOT
1	22.97ms	17.31ms	14.98ms	14.30ms
2	27.45ms	17.50ms	16.17ms	16.72ms
4	29.86ms	22.62ms	20.53ms	25.30ms
8	35.09ms	38.43ms	25.25ms	32.39ms

With the following reported acceptance rates: 83% for k=1, 72% for k=2, 61% for k=3.

Until additional MTP module weights are released, using this implementation with k=2 gives the best performance from my testing.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

github-actions · 2025-02-20T19:02:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/spec_decode/multi_step_worker.py

luyuzhe111 · 2025-02-24T02:35:47Z

vllm/spec_decode/multi_step_worker.py

also, should line 100-101 be moved outside the for-loop?

@benchislett another issue here is regarding a specific setting: draft tp = 1, and attention backend is not flash attention (mla, xformer). In this case, self.model_runner is TP1DraftModelRunner, but self.model_runner.supports_gpu_multi_step(expanded_request) will return false since this runner only supports flash attention backend. In this case, even if self.worker.model_runner.return_hidden_states is set to true, the runner won't return the hidden states for multiple draft steps since it's not currently supported, which results in errors when we get to self._maybe_update_previous_hidden_states() since the input hidden states is None. To reproduce this error, one can simply run the https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py. This issue doesn't undermine the usefulness of this PR but would like to see if we can figure out a better way to handle this, for example by throwing an error / exception early on?

I have refactored the snippet to the outside of the loop.

As for the attention backend, there are a few solutions I see to this problem, and all are simple. One easy fix is to add and model_config.use_mla to the selector in multi_step_worker so that TP1DraftModelRunner is never used in "fallback" mode for MTP speculative decoding.

However, it seems easier to just enable return_hidden_states in TP1DraftModelRunner. I am not sure why this was omitted originally, but the code from vllm/worker/model_runner.py seems to work just fine. The test_eagle_correctness.py tests are now passing locally. Please take a look at the updated implementation.

Thanks again for your great work! Regarding returning the hidden states, I think we might be able to simplify it a bit more? Do we have to consider prefill & decoding, with / without cuda graph here? It seems adding

if self.return_hidden_states and is_fallback: outputs[0].hidden_states = hidden_states

at the end of execute_model works just fine at my end.

I think this would be sufficient for most cases. My motivation was to maintain compatibility with ModelRunner so that TP1DraftModelRunner is as flexible as possible for future extensions.

However given that this runner is exclusively used for the draft model, the hidden states are only needed right now for the EAGLE code path. So, the code as proposed should work. I'm comfortable with either implementation.

I have updated the code to the shorter diff.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

mergify · 2025-02-25T05:58:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

luccafong · 2025-02-25T17:52:55Z

vllm/spec_decode/spec_decode_worker.py

            else:
-                if draft_tp == 1 or draft_model_config.hf_config.model_type ==\
-                        "deepseek_mtp":
+                if draft_tp == 1:


what's the concern for using TP1DraftModelRunner for deepseek_mtp where tp > 1 here?

@luccafong
I don't think that TP1DraftModelRunner should be used at all for MTP. Since the advance-step is not compatible with the MLA backend, it isn't going to work for k > 1 without some significant changes. In this PR I maintain compatibility with TP1DraftModelRunner so that it can be used in the future when the advance-step is implemented, but even then I believe it will only be TP=1 as the broadcast limitations of TP1DraftModelRunner haven't been added either.

Using the plain ModelRunner for this branch makes the most sense to me , as we are using it as intended for single-step execution.

I was planning to make TP1DraftModelRunner compatible for both cases, so name is probably confusing, actually changing model runner will be harder to support the real multi-step k>1 MTP. Good for this PR's use case.

That sounds challenging. Best of luck!

@luccafong QQ: does TP1DraftModelRunner work when the draft model has TP=8? If not, then this PR also has the benefit of greatly relieving the memory pressure I believe, since otherwise the full MTP module (~16GB weights) will all reside on the first device, which limit the max context length that can be supported?

@luccafong QQ: does TP1DraftModelRunner work when the draft model has TP=8? If not, then this PR also has the benefit of greatly relieving the memory pressure I believe, since otherwise the full MTP module (~16GB weights) will all reside on the first device, which limit the max context length that can be supported?

previous implementation support TP=8 in draft model runner for MTP in case of k=1. and we have attached the benchmark there for TP=8 #12755.

@luccafong ah this makes total sense now. thanks for the clarification!

@pyc96 Yes, there is some special handling there and in a few other places. Right now multi-step is hard-coded in several ways to be specific to the Flash Attention backend. The advance_step operation handles some attention-specific state processing, and so its behaviour is specific to the attention backend being used.

I expect that much of this logic could be re-used for MLA, but this will have to be its own contribution. In the past I have tried to forcibly enable this code path for MLA, and there were several mismatches in state that would need to be handled specifically. I don't think that adding this compatibility is a trivial change. If you disable MLA, then the multi-step code is usable.

@benchislett Thank you for your reply! On top of your PR, I changed

vllm/vllm/spec_decode/draft_model_runner.py

Line 89 in 627a8ca

assert isinstance(attn_metadata, FlashAttentionMetadata)

to

assert isinstance(attn_metadata, (FlashAttentionMetadata, MLACommonMetadata))

and it just worked fine for drafter TP=1 case.

Actually without this change, for TP=1 case, I believe current code will throw this assertion error during request serving. Is that intended?

@pyc96 I can confirm that it does seem to work with multi-step now. Even vllm/worker/multi_step_model_runner.py seems to be successful with the MLA backend. However, changing the allowed backends list is still required there as well, which leads me to believe this support wasn't specifically added but rather that the blocking restrictions were independently eliminated.

For this PR, I would prefer to merge sooner and have a follow-up work (cc @luccafong) add the multi-step functionality for DeepSeek with some thorough testing and sign-off.

For now, I've pushed a commit which removes the MLA backend from supports_gpu_multi_step so that the fallback path must be used. This will avoid the assertion failure you identified. If the community is confident that multi-step is safe with MLA, I am happy to turn the support back on.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

LiuXiaoxuanPKU · 2025-02-25T22:55:21Z

Hi @benchislett, sorry for the delay here. I was thinking through the implementation on a more high level. I list my understanding below and please correct me if I am wrong.

How does the current implementation handle the kv cache for MTP modules? Since their KV are not aligned, I feel some complexity here.

luccafong · 2025-02-25T23:28:05Z

Hi @benchislett, sorry for the delay here. I was thinking through the implementation on a more high level. I list my understanding below and please correct me if I am wrong.

How does the current implementation handle the kv cache for MTP modules? Since their KV are not aligned, I feel some complexity here.

I think this PR is to reuse the same module for the k > 1, correct me if I am wrong. @benchislett

benchislett · 2025-02-26T15:20:08Z

That's right. This PR is just to reuse the same module for k > 1, so the KV cache is handled the same as an ordinary draft model (at least for n_predict == 1, which is the only one supported right now).

For handling KV cache for n_predict > 1, this does indeed seem like a difficult challenge. I'm not sure how it would be best accomplished in practice. Please see #13805 to discuss such matters.

LiuXiaoxuanPKU

LGTM! Let's get this PR in first since it already brings some performance improvement.

vllm/config.py

luyuzhe111 · 2025-02-26T20:28:25Z

@LiuXiaoxuanPKU since you were sharing the diagrams, I want to use this thread to point out that the current handling of the first token for MTP / EAGLE by initializing them as zeros is actually wrong in vLLM. Due to the presence of attention modules, even though the Q, K, V tensors are all going to be zeros, the attention score normalizer will still be messed up because of the softmax operator (e^0 = 1). The correct way, per both EAGLE and MTP, is to get rid of the first hidden state and adjust attention metadata correspondingly so that both the attention scores and the position encoding are correct (the second token from the target model should be the first token to EAGLE and get the first position encoding). cc @benchislett @luccafong.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Neo9061 · 2025-02-28T05:45:31Z

Hi @benchislett just a quick question, by With the following reported acceptance rates: 83% for k=1, 72% for k=2, 61% for k=3 I assume you are looking at System efficiency rather than Draft acceptance rate in the VLLM log?

benchislett · 2025-02-28T16:01:55Z

@Neo9061 This is the draft acceptance rate, actually.

If you are surprised by the falloff as k increases then I might offer some intuition as to why I think this is the case. Since the MTP module was only trained with k=1, it was never used autoregressively during training (hidden states returned by MTP module 1 are never seen as inputs to the same module). So as the hidden states are fed into itself, the distribution drifts further from the training data and therefore the quality falls off. I think this is expected behaviour given the setup.

… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

#1144) Expand DeepSeek MTP code to support k > n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>

… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

TianTengya · 2025-05-31T05:19:47Z

This PR extends the DeepSeek MTP implementation such that the number of speculative tokens may be greater than the number of prediction modules. In these such cases, the layers are reused by feeding the hidden states from one into the next.

This will also fix a few issues with the original implementation that would not have been able to handle k > 1:

Disable TP1DraftModelRunner usage. As I understand DeepSeek MLA is not compatible with multi-step at all, so this pathway would not be compatible with k > 1. I do not expect there would be any benefits to using this path for k=1.

Enable hidden states forwarding in the fallback model-runner case, so that either code path can be used for EAGLE-style speculative decoding. I believe this will enable TP > 1 on all EAGLE models, but I have not tested this claim beyond MTP.

Below are benchmark results on 8xH200 with sample requests from ShareGPT:

Max Concurrency Baseline TPOT k=1 TPOT k=2 TPOT k=3 TPOT
1 22.97ms 17.31ms 14.98ms 14.30ms
2 27.45ms 17.50ms 16.17ms 16.72ms
4 29.86ms 22.62ms 20.53ms 25.30ms
8 35.09ms 38.43ms 25.25ms 32.39ms
With the following reported acceptance rates: 83% for k=1, 72% for k=2, 61% for k=3.

Until additional MTP module weights are released, using this implementation with k=2 gives the best performance from my testing.

@benchislett Hi, I have a question I'd like to ask for clarification. If k=1, that means MTP (Multi Token Prediction) can only predict one additional token ahead. How exactly does this achieve inference acceleration? Based on my understanding, the principle behind speeding up speculative decode is that the main model can verify in parallel the multiple tokens predicted by the draft model. But if it only predicts one additional token, wouldn't verifying even that single token actually take extra time? Please feel free to correct me if I've misunderstood something. Thanks!

benchislett · 2025-05-31T11:20:13Z

Hi @TianTengya , speculative decoding still accelerates with a single predicted token due to the "bonus token". The main idea is that token generation is based on sampling a "next" token from a sequence. For k=1, we have two sequences we can sample from: the current completion sequence, and the sequence plus the speculated token. We sample from the end of both sequences, and if we keep the speculated token then we also get to keep the "bonus" token which will come next. This means that with an acceptance rate of X for the first token, each decoding step will sample 1+X tokens.

TianTengya · 2025-05-31T13:06:46Z

Hi @TianTengya , speculative decoding still accelerates with a single predicted token due to the "bonus token". The main idea is that token generation is based on sampling a "next" token from a sequence. For k=1, we have two sequences we can sample from: the current completion sequence, and the sequence plus the speculated token. We sample from the end of both sequences, and if we keep the speculated token then we also get to keep the "bonus" token which will come next. This means that with an acceptance rate of X for the first token, each decoding step will sample 1+X tokens.

Thanks for the clarification! I see now how k=1 still yields a bonus token when accepted and speeds up decoding. Appreciate it.

expand mtp code to support k>n_predict

3692f83

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

mergify bot added the speculative-decoding label Feb 20, 2025

benchislett mentioned this pull request Feb 20, 2025

[Model][Speculative Decoding] Add EAGLE-style MTP module reference code for DeepSeek-R1 #12915

Closed

simon-mo requested a review from LiuXiaoxuanPKU February 21, 2025 01:43

Merge branch 'main' into deepseek-multi-mtp

570da36

luyuzhe111 reviewed Feb 24, 2025

View reviewed changes

vllm/spec_decode/multi_step_worker.py Show resolved Hide resolved

luyuzhe111 reviewed Feb 24, 2025

View reviewed changes

benchislett added 5 commits February 24, 2025 16:18

Enable returning hidden states in draft model runner

637dc63

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Merge branch 'main' into deepseek-multi-mtp

f329694

tiny refactor

0f7d679

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

formatting fixes

0eb2fad

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

simplify return hidden states in TP1DraftModelRunner

c308daa

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

mergify bot added the needs-rebase label Feb 25, 2025

benchislett mentioned this pull request Feb 25, 2025

[Model][Speculative Decoding] support k > 1 for MTP #13805

Draft

hmellor added this to DeepSeek V3/R1 Feb 25, 2025

github-project-automation bot moved this to Backlog in DeepSeek V3/R1 Feb 25, 2025

hmellor moved this from Backlog to In review in DeepSeek V3/R1 Feb 25, 2025

luccafong reviewed Feb 25, 2025

View reviewed changes

Merge branch 'main' into deepseek-multi-mtp

627a8ca

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

mergify bot removed the needs-rebase label Feb 25, 2025

LiuXiaoxuanPKU approved these changes Feb 26, 2025

View reviewed changes

vllm/config.py Show resolved Hide resolved

benchislett added 3 commits February 27, 2025 09:57

clarifying comment in config.py

8143c90

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

make sure draft multi-step with mla uses fallback path

b5772f4

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Merge branch 'main' into deepseek-multi-mtp

997ee09

LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2025

LiuXiaoxuanPKU merged commit 9804145 into vllm-project:main Feb 27, 2025
51 checks passed

github-project-automation bot moved this from In review to Done in DeepSeek V3/R1 Feb 27, 2025

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[Model][Speculative Decoding] Expand DeepSeek MTP code to support k >…

b5ae7d9

… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

pyc96 mentioned this pull request Mar 3, 2025

[Feature]: Support multi step drafting for DeepSeek MTP when k > n_predict #14160

Closed

1 task

mengwei805 mentioned this pull request Mar 10, 2025

v0.7.3 support speculative decoding vllm-project/vllm-ascend#252

Merged

luyuzhe111 mentioned this pull request Mar 12, 2025

[Bug]: EAGLE / MTP Doesn't Overwrite Approximated Hidden States / KV Cache, 8%- 15% Acceptance Length Degradation #14649

Closed

1 task

hmellor mentioned this pull request Apr 2, 2025

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

czhu15 pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 23, 2025

Expand DeepSeek MTP code to support k > n_predict (vllm-project#13626)

52546bb

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Model][Speculative Decoding] Expand DeepSeek MTP code to support k >…

7ecb417

… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

inkcherry pushed a commit to inkcherry/vllm that referenced this pull request May 27, 2025

Expand DeepSeek MTP code to support k > n_predict (vllm-project#13626)

e792c2a

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Uh oh!

[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict #13626

[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict #13626

Uh oh!

Conversation

benchislett commented Feb 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luyuzhe111 Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luccafong Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiuXiaoxuanPKU commented Feb 25, 2025

Uh oh!

luccafong commented Feb 25, 2025

Uh oh!

benchislett commented Feb 26, 2025

Uh oh!

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luyuzhe111 commented Feb 26, 2025

Uh oh!

Uh oh!

Neo9061 commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Feb 28, 2025

Uh oh!

TianTengya commented May 31, 2025

Uh oh!

benchislett commented May 31, 2025

Uh oh!

TianTengya commented May 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

benchislett commented Feb 20, 2025 •

edited by github-actions bot

Loading

luyuzhe111 Feb 24, 2025 •

edited

Loading

luccafong Feb 25, 2025 •

edited

Loading

Neo9061 commented Feb 28, 2025 •

edited

Loading