Skip to content

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Feb 20, 2025

This PR extends the DeepSeek MTP implementation such that the number of speculative tokens may be greater than the number of prediction modules. In these such cases, the layers are reused by feeding the hidden states from one into the next.

This will also fix a few issues with the original implementation that would not have been able to handle k > 1:

  • Disable TP1DraftModelRunner usage. As I understand DeepSeek MLA is not compatible with multi-step at all, so this pathway would not be compatible with k > 1. I do not expect there would be any benefits to using this path for k=1.
  • Enable hidden states forwarding in the fallback model-runner case, so that either code path can be used for EAGLE-style speculative decoding. I believe this will enable TP > 1 on all EAGLE models, but I have not tested this claim beyond MTP.

Below are benchmark results on 8xH200 with sample requests from ShareGPT:

Max Concurrency Baseline TPOT k=1 TPOT k=2 TPOT k=3 TPOT
1 22.97ms 17.31ms 14.98ms 14.30ms
2 27.45ms 17.50ms 16.17ms 16.72ms
4 29.86ms 22.62ms 20.53ms 25.30ms
8 35.09ms 38.43ms 25.25ms 32.39ms

With the following reported acceptance rates: 83% for k=1, 72% for k=2, 61% for k=3.

Until additional MTP module weights are released, using this implementation with k=2 gives the best performance from my testing.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, should line 100-101 be moved outside the for-loop?

Copy link
Contributor

@luyuzhe111 luyuzhe111 Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benchislett another issue here is regarding a specific setting: draft tp = 1, and attention backend is not flash attention (mla, xformer). In this case, self.model_runner is TP1DraftModelRunner, but self.model_runner.supports_gpu_multi_step(expanded_request) will return false since this runner only supports flash attention backend. In this case, even if self.worker.model_runner.return_hidden_states is set to true, the runner won't return the hidden states for multiple draft steps since it's not currently supported, which results in errors when we get to self._maybe_update_previous_hidden_states() since the input hidden states is None. To reproduce this error, one can simply run the https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py. This issue doesn't undermine the usefulness of this PR but would like to see if we can figure out a better way to handle this, for example by throwing an error / exception early on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored the snippet to the outside of the loop.

As for the attention backend, there are a few solutions I see to this problem, and all are simple. One easy fix is to add and model_config.use_mla to the selector in multi_step_worker so that TP1DraftModelRunner is never used in "fallback" mode for MTP speculative decoding.

However, it seems easier to just enable return_hidden_states in TP1DraftModelRunner. I am not sure why this was omitted originally, but the code from vllm/worker/model_runner.py seems to work just fine. The test_eagle_correctness.py tests are now passing locally. Please take a look at the updated implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your great work! Regarding returning the hidden states, I think we might be able to simplify it a bit more? Do we have to consider prefill & decoding, with / without cuda graph here? It seems adding

        if self.return_hidden_states and is_fallback:
            outputs[0].hidden_states = hidden_states

at the end of execute_model works just fine at my end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be sufficient for most cases. My motivation was to maintain compatibility with ModelRunner so that TP1DraftModelRunner is as flexible as possible for future extensions.

However given that this runner is exclusively used for the draft model, the hidden states are only needed right now for the EAGLE code path. So, the code as proposed should work. I'm comfortable with either implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code to the shorter diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
@mergify
Copy link

mergify bot commented Feb 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

else:
if draft_tp == 1 or draft_model_config.hf_config.model_type ==\
"deepseek_mtp":
if draft_tp == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the concern for using TP1DraftModelRunner for deepseek_mtp where tp > 1 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luccafong
I don't think that TP1DraftModelRunner should be used at all for MTP. Since the advance-step is not compatible with the MLA backend, it isn't going to work for k > 1 without some significant changes. In this PR I maintain compatibility with TP1DraftModelRunner so that it can be used in the future when the advance-step is implemented, but even then I believe it will only be TP=1 as the broadcast limitations of TP1DraftModelRunner haven't been added either.

Using the plain ModelRunner for this branch makes the most sense to me , as we are using it as intended for single-step execution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to make TP1DraftModelRunner compatible for both cases, so name is probably confusing, actually changing model runner will be harder to support the real multi-step k>1 MTP. Good for this PR's use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds challenging. Best of luck!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luccafong QQ: does TP1DraftModelRunner work when the draft model has TP=8? If not, then this PR also has the benefit of greatly relieving the memory pressure I believe, since otherwise the full MTP module (~16GB weights) will all reside on the first device, which limit the max context length that can be supported?

Copy link
Collaborator

@luccafong luccafong Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luccafong QQ: does TP1DraftModelRunner work when the draft model has TP=8? If not, then this PR also has the benefit of greatly relieving the memory pressure I believe, since otherwise the full MTP module (~16GB weights) will all reside on the first device, which limit the max context length that can be supported?

previous implementation support TP=8 in draft model runner for MTP in case of k=1. and we have attached the benchmark there for TP=8 #12755.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luccafong ah this makes total sense now. thanks for the clarification!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyc96 Yes, there is some special handling there and in a few other places. Right now multi-step is hard-coded in several ways to be specific to the Flash Attention backend. The advance_step operation handles some attention-specific state processing, and so its behaviour is specific to the attention backend being used.

I expect that much of this logic could be re-used for MLA, but this will have to be its own contribution. In the past I have tried to forcibly enable this code path for MLA, and there were several mismatches in state that would need to be handled specifically. I don't think that adding this compatibility is a trivial change. If you disable MLA, then the multi-step code is usable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benchislett Thank you for your reply! On top of your PR, I changed

assert isinstance(attn_metadata, FlashAttentionMetadata)
to

assert isinstance(attn_metadata, (FlashAttentionMetadata, MLACommonMetadata))

and it just worked fine for drafter TP=1 case.

Actually without this change, for TP=1 case, I believe current code will throw this assertion error during request serving. Is that intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyc96 I can confirm that it does seem to work with multi-step now. Even vllm/worker/multi_step_model_runner.py seems to be successful with the MLA backend. However, changing the allowed backends list is still required there as well, which leads me to believe this support wasn't specifically added but rather that the blocking restrictions were independently eliminated.

For this PR, I would prefer to merge sooner and have a follow-up work (cc @luccafong) add the multi-step functionality for DeepSeek with some thorough testing and sign-off.

For now, I've pushed a commit which removes the MLA backend from supports_gpu_multi_step so that the fallback path must be used. This will avoid the assertion failure you identified. If the community is confident that multi-step is safe with MLA, I am happy to turn the support back on.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
@mergify mergify bot removed the needs-rebase label Feb 25, 2025
@LiuXiaoxuanPKU
Copy link
Collaborator

Hi @benchislett, sorry for the delay here. I was thinking through the implementation on a more high level. I list my understanding below and please correct me if I am wrong.

Screenshot 2025-02-25 at 2 53 45 PM Screenshot 2025-02-25 at 2 53 50 PM

How does the current implementation handle the kv cache for MTP modules? Since their KV are not aligned, I feel some complexity here.

@luccafong
Copy link
Collaborator

Hi @benchislett, sorry for the delay here. I was thinking through the implementation on a more high level. I list my understanding below and please correct me if I am wrong.

Screenshot 2025-02-25 at 2 53 45 PM Screenshot 2025-02-25 at 2 53 50 PM
How does the current implementation handle the kv cache for MTP modules? Since their KV are not aligned, I feel some complexity here.

I think this PR is to reuse the same module for the k > 1, correct me if I am wrong. @benchislett

@benchislett
Copy link
Collaborator Author

That's right. This PR is just to reuse the same module for k > 1, so the KV cache is handled the same as an ordinary draft model (at least for n_predict == 1, which is the only one supported right now).

For handling KV cache for n_predict > 1, this does indeed seem like a difficult challenge. I'm not sure how it would be best accomplished in practice. Please see #13805 to discuss such matters.

Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's get this PR in first since it already brings some performance improvement.

@luyuzhe111
Copy link
Contributor

@LiuXiaoxuanPKU since you were sharing the diagrams, I want to use this thread to point out that the current handling of the first token for MTP / EAGLE by initializing them as zeros is actually wrong in vLLM. Due to the presence of attention modules, even though the Q, K, V tensors are all going to be zeros, the attention score normalizer will still be messed up because of the softmax operator (e^0 = 1). The correct way, per both EAGLE and MTP, is to get rid of the first hidden state and adjust attention metadata correspondingly so that both the attention scores and the position encoding are correct (the second token from the target model should be the first token to EAGLE and get the first position encoding). cc @benchislett @luccafong.

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
@LiuXiaoxuanPKU LiuXiaoxuanPKU added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2025
@LiuXiaoxuanPKU LiuXiaoxuanPKU merged commit 9804145 into vllm-project:main Feb 27, 2025
51 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in DeepSeek V3/R1 Feb 27, 2025
@Neo9061
Copy link

Neo9061 commented Feb 28, 2025

Hi @benchislett just a quick question, by With the following reported acceptance rates: 83% for k=1, 72% for k=2, 61% for k=3 I assume you are looking at System efficiency rather than Draft acceptance rate in the VLLM log?

@benchislett
Copy link
Collaborator Author

@Neo9061 This is the draft acceptance rate, actually.

If you are surprised by the falloff as k increases then I might offer some intuition as to why I think this is the case. Since the MTP module was only trained with k=1, it was never used autoregressively during training (hidden states returned by MTP module 1 are never seen as inputs to the same module). So as the hidden states are fed into itself, the distribution drifts further from the training data and therefore the quality falls off. I think this is expected behaviour given the setup.

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025
… n_predict (vllm-project#13626)

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
… n_predict (vllm-project#13626)

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
czhu15 pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 23, 2025
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
czhu15 added a commit to HabanaAI/vllm-fork that referenced this pull request Apr 24, 2025
#1144)

Expand DeepSeek MTP code to support k > n_predict (vllm-project#13626)

    Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
… n_predict (vllm-project#13626)

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
inkcherry pushed a commit to inkcherry/vllm that referenced this pull request May 27, 2025
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
@TianTengya
Copy link

This PR extends the DeepSeek MTP implementation such that the number of speculative tokens may be greater than the number of prediction modules. In these such cases, the layers are reused by feeding the hidden states from one into the next.

This will also fix a few issues with the original implementation that would not have been able to handle k > 1:

  • Disable TP1DraftModelRunner usage. As I understand DeepSeek MLA is not compatible with multi-step at all, so this pathway would not be compatible with k > 1. I do not expect there would be any benefits to using this path for k=1.
  • Enable hidden states forwarding in the fallback model-runner case, so that either code path can be used for EAGLE-style speculative decoding. I believe this will enable TP > 1 on all EAGLE models, but I have not tested this claim beyond MTP.

Below are benchmark results on 8xH200 with sample requests from ShareGPT:

Max Concurrency Baseline TPOT k=1 TPOT k=2 TPOT k=3 TPOT
1 22.97ms 17.31ms 14.98ms 14.30ms
2 27.45ms 17.50ms 16.17ms 16.72ms
4 29.86ms 22.62ms 20.53ms 25.30ms
8 35.09ms 38.43ms 25.25ms 32.39ms
With the following reported acceptance rates: 83% for k=1, 72% for k=2, 61% for k=3.

Until additional MTP module weights are released, using this implementation with k=2 gives the best performance from my testing.

@benchislett Hi, I have a question I'd like to ask for clarification. If k=1, that means MTP (Multi Token Prediction) can only predict one additional token ahead. How exactly does this achieve inference acceleration? Based on my understanding, the principle behind speeding up speculative decode is that the main model can verify in parallel the multiple tokens predicted by the draft model. But if it only predicts one additional token, wouldn't verifying even that single token actually take extra time? Please feel free to correct me if I've misunderstood something. Thanks!

@benchislett
Copy link
Collaborator Author

Hi @TianTengya , speculative decoding still accelerates with a single predicted token due to the "bonus token". The main idea is that token generation is based on sampling a "next" token from a sequence. For k=1, we have two sequences we can sample from: the current completion sequence, and the sequence plus the speculated token. We sample from the end of both sequences, and if we keep the speculated token then we also get to keep the "bonus" token which will come next. This means that with an acceptance rate of X for the first token, each decoding step will sample 1+X tokens.

@TianTengya
Copy link

Hi @TianTengya , speculative decoding still accelerates with a single predicted token due to the "bonus token". The main idea is that token generation is based on sampling a "next" token from a sequence. For k=1, we have two sequences we can sample from: the current completion sequence, and the sequence plus the speculated token. We sample from the end of both sequences, and if we keep the speculated token then we also get to keep the "bonus" token which will come next. This means that with an acceptance rate of X for the first token, each decoding step will sample 1+X tokens.

Thanks for the clarification! I see now how k=1 still yields a bonus token when accepted and speeds up decoding. Appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants