-
-
Couldn't load subscription status.
- Fork 10.8k
[Spec-decode] Refoctor cudagraphs for spec-decode;support uniform_alignment of cudagraph sizes. #23679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl <2410591650@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a comprehensive fix and refactoring for speculative decoding with CUDA graphs. The changes are well-structured and address several key issues, including the broken CUDA graph for the EAGLE drafter, by introducing a separate dispatcher and a unified planning mechanism. The refactoring of the CudagraphDispatcher to support multiple uniform query lengths and the consolidation of padding and dispatching logic into a single plan method significantly improve code clarity and maintainability. The removal of build_for_cudagraph_capture from attention backends is a good simplification. Overall, the code quality is high, and the changes appear to be correct and robust. I have reviewed the changes in detail and could not find any issues of high or critical severity.
|
Thanks for doing this, will take a look soon.
This might have issues, currently triton forces seq_lens to 1 (we could ask the builder what seq_lens it wants, or make it optional, not sure) and AITER FA does something with total tokens - although a) that seems unused and b) it was reported broken. |
I have confirmed that forcing seq_lens to 1 is not necessary for speed up Triton attention capturing. Currently, with the frozen CG feature, the time is almost the same as FA2, and has no influence on the generated contents. Yeah, the |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
cc @elvischenv for vis |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
Documentation preview: https://vllm--23679.org.readthedocs.build/en/23679/ |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
Already prepared for the next round of viewing! The main updates are for resolving the potential cudagraph missing issue when DP size >1. cc @ProExpertProg @LucasWilkinson @SageMoore. Edit: also add a new flag to disable the uniform_alignment feature introduced in this PR. |
Purpose
This PR has done several things to fix/refactor speculative decoding.
Fix the broken cudagraph of the eagle drafter.(now temporarily fixed by [Spec-Decode] Support piecewise cudagraphs for Eagle head #25109)The cudagraph ability of the EAGLE(3) model in V1 has been broken for a while since the launch of [Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer #20059, because for now vllm torch.compile would never implicitly run a cudagraph until we explicitly dispatch it via the cudagraph dispatcher. This PR fixes that by appending a new dispatcher for drafter that is separate from the main model. The reason we don't share a single dispatcher is that for the spec decode scenario, the uniform decode query len for the main model is
1+ num_spec_tokenswhile for a drafer it is just 1 (generally speaking).Make "cudagraph_capture_size" of uniform decode divisible by
uniform_query_len = 1+ num_spec_decodewhen enabled spec_decode. This requires a new non-configurable propertyuniform_cudagraph_capture_sizesto be added to Compilation_config. This is post-computed from and also separated fromcudagraph_capture_sizes, which is configurable.The reason for doing this is that some attention backends support
AttentionCGSupport.UNIFORM_BATCHrequires token uniformity strictly, i.e., we couldn't pad to a batch size that is not divisible byuniform_query_len, unlike varlen support backends like FlashAttention(v2/v3) and Triton Attention. Also reference to the "Padded speculation" for drafer proposed in [Performance]: Padded Speculative Decoding #21984 and the example in [Spec Decode] Efficient padded speculation #24539The cudagraph capturing of the drafter is also separated from the capturing of the main model, since the main model uses "uniform_cudagraph_capture_sizes" for the validation phase (uniform_decode_query_len>1), while the eagle drafter uses chunked "cudagraph_capture_sizes" as it is generally uniform_decode_query_len=1 (after the first draft token).
The cudagraph_dispatcher and batch_descriptor are further refactored to support multiple uniform_decode_query_len together. This is to pave the way for full cudagraph support of "Padded speculation" in [Performance]: Padded Speculative Decoding #21984 and [Spec Decode] Efficient padded speculation #24539, where for a pure spec-decode loop (no prefill tokens), the
uniform_decode_query_lencan be1+num_spec_tokenfor the first draft token and be only 1 for the consecutive draft tokens.The prototype of batch_descriptor is now:
For non-uniform decode, uniform_query_len is forced to 0 to ensure the uniqueness of a cudagraph batch.
For more details, please see the codes.
Test Plan
To be planned! (call for suggestions, as we can not do much for a full cudagraph of spec-decode currently and most codes are prepared for full cudagraph feature of spec-decode in the near future)
Can confirm that the piecewise cudagraph ability of drafter is fixed!
*Note for its compatibility with DP:
While enabling DP, we have to pad the token sizes to the same size across DP for cudagraph (mainly for the collective communication of MOEs). That makes it hard to hit a cudagraph size to fit across all DP ranks when we both have uniform-decode batches in some DP ranks and non-uniform batches in others, since we have a different list of cudagraph sizes.
To address this:
(Thanks for the contexts from @SageMoore and ideas from @LucasWilkinson, respectively.)
So a uniform-decode batch is potentially padded to the size of a non-uniform batch and falls back into piecewise cudagraph.
The attention part is still on eager execution, so no problem for supporting "Padded speculation"
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.