-
-
Couldn't load subscription status.
- Fork 10.9k
[DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttention #24864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces Decode Context Parallel (DCP) support for GQA models using FlashAttention. The approach separates attention computation for the context and new tokens, computing partial attention on each rank for the distributed context KV cache and then combining them. While the overall logic is sound, I've identified a critical issue in the implementation concerning tensor shape manipulation that will lead to incorrect results. Specifically, there are incorrect transpose operations on the log-sum-exp (LSE) tensor that must be addressed. The other changes for integrating DCP, such as disabling cascade attention and adding necessary metadata, appear to be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz add GQA+DCP ut in test_context_parallel.py.
Others, LGTM. Clean and good job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix pre-commit linter error, and add tests. otherwise LGTM, thanks for the great work!
cc @LucasWilkinson to give the final sign off.
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay! LGTM! Thanks for the contribution!
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
Signed-off-by: FENP <32334296+FENP@users.noreply.github.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <yuanyongjie.yyj@antgroup.com> Signed-off-by: FENP <32334296+FENP@users.noreply.github.com> Signed-off-by: Jaya Yuan <yuanyongjie.yyj@antgroup.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
This PR adds Decode Context Parallel (DCP) support for GQA following PR #23734. Current implementation based on FlashAttention.
Unlike MLA inference, GQA does not distinguish between prefill and decode during forward pass. To support DCP, this PR separately computes the attention scores for the context and query KV within a sequence and then merges the results.
Test Plan
Qwen/Qwen3-235B-A22B-Instruct-2507
Test Result
[kv_cache_utils.py:859] Multiplying the GPU KV cache size by the dcp_world_size 2. [kv_cache_utils.py:864] GPU KV cache size: 1,723,584 tokens [kv_cache_utils.py:868] Maximum concurrency for 262,144 tokens per request: 6.57xAdditionally, I think we can reduce latency by overlapping query computation with context communication.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.