-
-
Couldn't load subscription status.
- Fork 10.8k
[Kernel] Support DCP for Triton backend #25132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request extends Distributed Context Parallelism (DCP) support to the Triton backend by enabling the return of Log-Sum-Exp (LSE) values from attention kernels. The changes are generally well-implemented, but I have identified a critical issue where the Multi-Head Attention (MHA) path appears to be broken due to an incomplete function signature update. Additionally, a temporary test script with user-specific configurations seems to have been included by mistake and should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a perf comparison?
I do not have a perf comparison. This is more on functional support especially adding LSE return in triton kernel. It won't impact the existing triton kernel performance. Now, I think vllm has completed flashMLA, FA, triton backend support for CP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! this will help running MLA on ampere or lower end GPUs.
please fix the pre-commit errors.
c63600f to
9502928
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
9502928 to
8d69e8c
Compare
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
7e1eb6c to
b4e0647
Compare
Signed-off-by: Wei Wei <wwei6@meta.com>
Head branch was pushed to by a user without write access
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wei Wei <wwei6@meta.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
As a follow up for #23734, this PR made some changes to support triton backend for DCP.
Specifically, 1) return the LSE from triton kernel 2) fix a bug in deepseekV2 which could potentially modify the
residualvariable.Test Plan
export CUDA_VISIBLE_DEVICES=4,5,6,7
export VLLM_USE_V1=1
export VLLM_ATTENTION_BACKEND=TRITON_MLA
export VLLM_LOG_LEVEL=DEBUG
pytest tests/distributed/test_context_parallel.py -s
Test Result
=============================== warnings summary ===============================
:488
:488: DeprecationWarning: builtin type SwigPyPacked has no module attribute
:488
:488: DeprecationWarning: builtin type SwigPyObject has no module attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 2 passed, 2 warnings in 212.42s (0:03:32) ===================
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.