-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
feat: spec decode with draft models #24322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: spec decode with draft models #24322
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for speculative decoding using a draft model. The changes are comprehensive, touching configuration, model loading, scheduling, and the core speculative decoding logic. New tests and benchmark modifications are also included to validate and measure the new feature. The overall implementation appears solid. However, I've identified a critical issue in a refactoring of the bind_kv_cache utility function, which removes an important safety check and could lead to incorrect behavior for certain model architectures.
|
@tomasruizt - Thank you for the PR!
|
What is the TP you are using for Qwen3-32B? By default, draft model TP is equal to target model TP. Since Qwen3-1.7B is a small model, running it on high TP might be incurring nccl communication cost. Try setting draft TP to 1. |
I ran the benchmarks with TP=1 and num_draft_tokens=3. So we can rule out TP communication issues. |
|
This pull request has merge conflicts that must be resolved before it can be |
7de2ae1 to
2e0fb65
Compare
|
@ggg-s setting different TP sizes for target and draft model will raise an error in the latest commit. If you don't specify the TP size for the draft model, it will by default be equal to the target model TP. |
|
@tomasruizt yep. I pulled the latest commit and locally removed to test heterogenous TP. With set in , the service starts without errors, but the drafter still gets instantiated on every TP rank — logs show each worker printing “Loading drafter model…”, and GPU memory usage is identical across GPUs.self._raise_if_draft_tp_mismatch()"draft_tensor_parallel_size": 1speculative-config So even though the check is bypassed, the behavior remains “replicated-per-rank” rather than a single TP=1 drafter. If there’s a flag or code path to pin the drafter to rank0 (and broadcast results), I’m happy to try it. |
|
@ggg-s that's expected. As I said:
The only way to have TP = 1 on the drafter is to have TP = 1 on the target model, atm. |
|
@tomasruizt Got it, thank you for your explanation! |
|
I encountered a new issue involving tensor dimension mismatches that occur under high concurrency conditions. What causes this problem? vllm bench serve --model qwen3fd Error message: |
|
I think I know precisely the line causing it. But to reproduce I would need the questions.jsonl file. Would it be possible to share it? @QingNagi |
Sure! Download the dataset using: wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl |
|
@QingNagi can you share what commit you are testing on? |
vllm==0.11.0 |
|
@QingNagi I meant which commit of this branch? |
|
37f013e this commit. |
When I add the --request-rate parameter, the problem doesn't occur, which is strange. |
|
@QingNagi Are you on the vLLM Slack? We can chat over there Edit: I'm not able to reproduce the issue. Can you try with the latest commit of this branch? It would help with reproduction if the error happens with a smaller model. e.g. target=Qwen3-1.7B-FP8, draft=Qwen3-0.6B, also if the error happens with --enforce-eager, and if you let me what GPU you are running on. |
|
Ok, I will try it on the latest commit. |
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
|
@tomasruizt I was wondering if this PR (#26937) could help improve the decoding speed of the draft model? |
|
@ggg-s currently the EAGLE/draft model is forced to be piecewise. We can optimize this aspect as a follow-up feature. |
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
|
Got it! |
|
“ I measured online throughput metrics using the commands below. Hardware was an RTX PRO 6000 96GB. After making sure the draft model also uses CUDA graph, SD has higher throughput than not using SD. See tables below.” @tomasruizt Does using CUDA graph here refer to using the FULL CUDA graph for SD? If so, could you please let me know which parameters need to be set to enable SD to use it? Thank you! |
|
@ggg-s Full cuda graphs are not yet supported. That comment referred to piecewise CUDA graphs. They are used by default unless you pass |
|
@tomasruizt Can I contact you on Slack? |
|
Here is my link: https://join.slack.com/shareDM/zt-3g4c4k5qe-o_KDj9TXkNCtZBRsumhwsA |
|
@LiuXiaoxuanPKU I ran throughput benchmarks on a multi-GPU (TP=4) setup with |
|
@tomasruizt do you have any comparison to EAGLE3? There's a head for Llama 3.3 70B: https://huggingface.co/yuhuili/EAGLE3-LLaMA3.3-Instruct-70B |
|
@benchislett I haven't run these benchmarks for EAGLE3, but I could compute that tomorrow for comparison. |
|
@benchislett I added the EAGLE3 benchmark for reference. It shows very good acceleration, strongest with small batch sizes. It's faster than the draft_model contributed here, but using draft_model doesn't require training a separate model, like EAGLE. Furthermore, there are follow-up optimizations that should substantially improve draft_model performance:
I tried both for this PR, but found them non-trivial, so I stepped back. I will be away in November, so I think we should try to merge this PR before November. |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Enabling draft models for speculative decoding (SD).
E.g.
Qwen3-1.7Bas draft model andQwen3-32Bas target model.This type of SD requires no special trained heads (like EAGLE, or Medusa).
Example usage:
vllm serve \ --model=Qwen/Qwen3-4B \ --speculative-config '{"model": "Qwen/Qwen3-0.6B", "method": "draft_model", "num_speculative_tokens": 3, "max-model-len": 2000, "disable_padded_drafter_batch": true}' \ --max-model-len 2000Get a generation:
curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"prompt": "Capital of France", "max_tokens": 16}'Status
Acceptance Length
As suggested by @ekagra-ranjan, I benchmarked acceptance length (AL) with the command below:
VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py \ --model-dir Qwen/Qwen3-32B \ --draft-model Qwen/Qwen3-1.7B \ --method draft_model \ --num_spec_tokens 3 \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num_prompts 100 \ --temp 1.0 \ --gpu-memory-utilization 0.9The AL values within the Qwen3 family seem good, both with temperatures of 0.0 (greedy) and 1.0.
As a sanity check, I benchmarked LLama-3.2-1B as both target and draft, which had almost perfect AL (3.97/4), suggesting its working as intended.
I have not run the default model
meta-llama/Llama-3.1-8B-Instruct, because I didn't find a good draft model for it, but feel free to suggest one and I can run the benchmarks.Temperature t=0:
Temperature t=1.0:
Using t=1.0, the AL metric degrades. However, spec-decode with probabilities is not yet implemented, needed for lossless rejection sampling. This is being addressed atm: #20459. After that PR is merged, the AL for non-greedy spec-decode should improve.
All scripts and logs used for the benchmarks can be found in this Google Drive.
Online Throughput Metrics
I measured online throughput metrics using the commands below. Hardware was an RTX PRO 6000 96GB. After making sure the draft model also uses CUDA graph, SD has higher throughput than not using SD. See tables below.
The metrics (lower is better) are:
Batch Size = 1
For Temperature = 0.0:Using SD runtimes and TPOT are shorter by ~50%.
Batch Size = 100
For Temperature = 0.0:For Temperature = 1.0:
This scenario with batch size 100 is a more realistic inference case.
Using SD runtimes and TPOT are shorter.
Profiling
This section was removed, since using CUDA graphs on the draft model significantly improved its speed.
Profiling script
I used the command below to profile the generation process and identify that the draft model was running too slow before.Note: The command uses the
--profileflag, which I introduce in this PR: #24575Test Plan
The added unit test check the correctness metrics. To run it:
cd tests/v1/e2e/ pytest test_spec_decode.py -k test_draft_model_correctnessEAGLE testing
I tested that the EAGLE implementation stays unaffected the command below
VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py \ --model-dir meta-llama/Llama-3.1-8B-Instruct \ --eagle-dir yuhuili/EAGLE3-LLaMA3.1-Instruct-8B \ --method eagle3 \ --num_spec_tokens 7 \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num_prompts 80 \ --temp 0.0 \ --gpu-memory-utilization 0.9The results are in line with previous measurements like #17504 (comment)
Follow-up Optimizations
next_token_idstogether withtarget_token_idsin the first forward pass of the draft model. This reduces the number of forward passes needed in each drafting phase by one, speeding up drafting.Qwen3 Metrics
I compare Qwen3-32B against Qwen3-32B with Qwen3-1.7B as drafter model, on a single H100 GPU (TP=1).
The benchmarks show (left to right)
Broken down we find:
Llama3 Multi-GPU Metrics
I benchmarked
meta-llama/Meta-Llama-3-70B, withmeta-llama/Llama-3.2-1Bas a draft model on 4 x H100 GPUs (TP=4). The goal was to measure the effect of tensor parallelism on throughput, since the small draft model is also running on the same level of tensor parallelism as the target model. I found an acceleration in this setup only up to bsz=32, and a slowdown for bsz=64. This setup shows that using TP > 1 degrades the acceleration of draft_model. Presumably, because of the inter-GPU communication overhead for the draft model. In a follow-up PR this should be optimized.serving-script.sh
benchmark-script.sh
The TPOT metric improves in general, showing similar results to Qwen3.
EAGLE3 Metrics (Reference)
Below are benchmark values using method="eagle3", with target model
meta-llama/Llama-3.3-70B-Instructand draft modelyuhuili/EAGLE3-LLaMA3.3-Instruct-70Bon 4 x H100 GPUs (TP=4). Eagle achieves over 2x acceleration in batch_size=1. In contrast to draft_model, the eagle drafter is running withtensor_parallelism = 1.serve-script.sh
bench-script.sh