-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict #13626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, should line 100-101 be moved outside the for-loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benchislett another issue here is regarding a specific setting: draft tp = 1, and attention backend is not flash attention (mla, xformer). In this case, self.model_runner is TP1DraftModelRunner, but self.model_runner.supports_gpu_multi_step(expanded_request) will return false since this runner only supports flash attention backend. In this case, even if self.worker.model_runner.return_hidden_states is set to true, the runner won't return the hidden states for multiple draft steps since it's not currently supported, which results in errors when we get to self._maybe_update_previous_hidden_states() since the input hidden states is None. To reproduce this error, one can simply run the https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py. This issue doesn't undermine the usefulness of this PR but would like to see if we can figure out a better way to handle this, for example by throwing an error / exception early on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored the snippet to the outside of the loop.
As for the attention backend, there are a few solutions I see to this problem, and all are simple. One easy fix is to add and model_config.use_mla to the selector in multi_step_worker so that TP1DraftModelRunner is never used in "fallback" mode for MTP speculative decoding.
However, it seems easier to just enable return_hidden_states in TP1DraftModelRunner. I am not sure why this was omitted originally, but the code from vllm/worker/model_runner.py seems to work just fine. The test_eagle_correctness.py tests are now passing locally. Please take a look at the updated implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your great work! Regarding returning the hidden states, I think we might be able to simplify it a bit more? Do we have to consider prefill & decoding, with / without cuda graph here? It seems adding
if self.return_hidden_states and is_fallback:
outputs[0].hidden_states = hidden_states
at the end of execute_model works just fine at my end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be sufficient for most cases. My motivation was to maintain compatibility with ModelRunner so that TP1DraftModelRunner is as flexible as possible for future extensions.
However given that this runner is exclusively used for the draft model, the hidden states are only needed right now for the EAGLE code path. So, the code as proposed should work. I'm comfortable with either implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to the shorter diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
This pull request has merge conflicts that must be resolved before it can be |
| else: | ||
| if draft_tp == 1 or draft_model_config.hf_config.model_type ==\ | ||
| "deepseek_mtp": | ||
| if draft_tp == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the concern for using TP1DraftModelRunner for deepseek_mtp where tp > 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luccafong
I don't think that TP1DraftModelRunner should be used at all for MTP. Since the advance-step is not compatible with the MLA backend, it isn't going to work for k > 1 without some significant changes. In this PR I maintain compatibility with TP1DraftModelRunner so that it can be used in the future when the advance-step is implemented, but even then I believe it will only be TP=1 as the broadcast limitations of TP1DraftModelRunner haven't been added either.
Using the plain ModelRunner for this branch makes the most sense to me , as we are using it as intended for single-step execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to make TP1DraftModelRunner compatible for both cases, so name is probably confusing, actually changing model runner will be harder to support the real multi-step k>1 MTP. Good for this PR's use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds challenging. Best of luck!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luccafong QQ: does TP1DraftModelRunner work when the draft model has TP=8? If not, then this PR also has the benefit of greatly relieving the memory pressure I believe, since otherwise the full MTP module (~16GB weights) will all reside on the first device, which limit the max context length that can be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luccafong QQ: does TP1DraftModelRunner work when the draft model has TP=8? If not, then this PR also has the benefit of greatly relieving the memory pressure I believe, since otherwise the full MTP module (~16GB weights) will all reside on the first device, which limit the max context length that can be supported?
previous implementation support TP=8 in draft model runner for MTP in case of k=1. and we have attached the benchmark there for TP=8 #12755.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luccafong ah this makes total sense now. thanks for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyc96 Yes, there is some special handling there and in a few other places. Right now multi-step is hard-coded in several ways to be specific to the Flash Attention backend. The advance_step operation handles some attention-specific state processing, and so its behaviour is specific to the attention backend being used.
I expect that much of this logic could be re-used for MLA, but this will have to be its own contribution. In the past I have tried to forcibly enable this code path for MLA, and there were several mismatches in state that would need to be handled specifically. I don't think that adding this compatibility is a trivial change. If you disable MLA, then the multi-step code is usable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benchislett Thank you for your reply! On top of your PR, I changed
| assert isinstance(attn_metadata, FlashAttentionMetadata) |
assert isinstance(attn_metadata, (FlashAttentionMetadata, MLACommonMetadata))
and it just worked fine for drafter TP=1 case.
Actually without this change, for TP=1 case, I believe current code will throw this assertion error during request serving. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyc96 I can confirm that it does seem to work with multi-step now. Even vllm/worker/multi_step_model_runner.py seems to be successful with the MLA backend. However, changing the allowed backends list is still required there as well, which leads me to believe this support wasn't specifically added but rather that the blocking restrictions were independently eliminated.
For this PR, I would prefer to merge sooner and have a follow-up work (cc @luccafong) add the multi-step functionality for DeepSeek with some thorough testing and sign-off.
For now, I've pushed a commit which removes the MLA backend from supports_gpu_multi_step so that the fallback path must be used. This will avoid the assertion failure you identified. If the community is confident that multi-step is safe with MLA, I am happy to turn the support back on.
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
Hi @benchislett, sorry for the delay here. I was thinking through the implementation on a more high level. I list my understanding below and please correct me if I am wrong.
How does the current implementation handle the kv cache for MTP modules? Since their KV are not aligned, I feel some complexity here. |
I think this PR is to reuse the same module for the k > 1, correct me if I am wrong. @benchislett |
|
That's right. This PR is just to reuse the same module for k > 1, so the KV cache is handled the same as an ordinary draft model (at least for n_predict == 1, which is the only one supported right now). For handling KV cache for n_predict > 1, this does indeed seem like a difficult challenge. I'm not sure how it would be best accomplished in practice. Please see #13805 to discuss such matters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let's get this PR in first since it already brings some performance improvement.
|
@LiuXiaoxuanPKU since you were sharing the diagrams, I want to use this thread to point out that the current handling of the first token for MTP / EAGLE by initializing them as zeros is actually wrong in vLLM. Due to the presence of attention modules, even though the Q, K, V tensors are all going to be zeros, the attention score normalizer will still be messed up because of the softmax operator (e^0 = 1). The correct way, per both EAGLE and MTP, is to get rid of the first hidden state and adjust attention metadata correspondingly so that both the attention scores and the position encoding are correct (the second token from the target model should be the first token to EAGLE and get the first position encoding). cc @benchislett @luccafong. |
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
Hi @benchislett just a quick question, by |
|
@Neo9061 This is the draft acceptance rate, actually. If you are surprised by the falloff as k increases then I might offer some intuition as to why I think this is the case. Since the MTP module was only trained with k=1, it was never used autoregressively during training (hidden states returned by MTP module 1 are never seen as inputs to the same module). So as the hidden states are fed into itself, the distribution drifts further from the training data and therefore the quality falls off. I think this is expected behaviour given the setup. |
… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
#1144) Expand DeepSeek MTP code to support k > n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
… n_predict (vllm-project#13626) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
@benchislett Hi, I have a question I'd like to ask for clarification. If k=1, that means MTP (Multi Token Prediction) can only predict one additional token ahead. How exactly does this achieve inference acceleration? Based on my understanding, the principle behind speeding up speculative decode is that the main model can verify in parallel the multiple tokens predicted by the draft model. But if it only predicts one additional token, wouldn't verifying even that single token actually take extra time? Please feel free to correct me if I've misunderstood something. Thanks! |
|
Hi @TianTengya , speculative decoding still accelerates with a single predicted token due to the "bonus token". The main idea is that token generation is based on sampling a "next" token from a sequence. For k=1, we have two sequences we can sample from: the current completion sequence, and the sequence plus the speculated token. We sample from the end of both sequences, and if we keep the speculated token then we also get to keep the "bonus" token which will come next. This means that with an acceptance rate of X for the first token, each decoding step will sample 1+X tokens. |
Thanks for the clarification! I see now how k=1 still yields a bonus token when accepted and speeds up decoding. Appreciate it. |




This PR extends the DeepSeek MTP implementation such that the number of speculative tokens may be greater than the number of prediction modules. In these such cases, the layers are reused by feeding the hidden states from one into the next.
This will also fix a few issues with the original implementation that would not have been able to handle
k > 1:TP1DraftModelRunnerusage. As I understand DeepSeek MLA is not compatible with multi-step at all, so this pathway would not be compatible withk > 1. I do not expect there would be any benefits to using this path fork=1.TP > 1on all EAGLE models, but I have not tested this claim beyond MTP.Below are benchmark results on 8xH200 with sample requests from ShareGPT:
With the following reported acceptance rates: 83% for k=1, 72% for k=2, 61% for k=3.
Until additional MTP module weights are released, using this implementation with
k=2gives the best performance from my testing.