-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Fp8 paged attention update #22222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fp8 paged attention update #22222
Conversation
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces updates for FP8 paged attention on ROCm. I've identified two critical compilation errors that need to be addressed. The first is a stray #endif directive that would break the build. The second is a variable scoping issue within a loop due to incorrect placement of preprocessor directives, which would also cause a compilation failure. Addressing these issues will ensure the code compiles and functions as intended.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
cc @gshtras for review and comments |
|
@tlrmchlsmth Can you please help me review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid job overall.
Could we have unit tests to cover the new fp8 path? Extend the additional rocm attention test.
Another major point is that we need a way to actually enable this, through command line/env/heuristic/etc., and not leave as a dead code.
c851085 to
94018d1
Compare
|
|
This pull request has merge conflicts that must be resolved before it can be |
ae384dc to
4c0c4ed
Compare
b7ecc8f to
d2fceda
Compare
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Head branch was pushed to by a user without write access
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com> Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com> Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com> Co-authored-by: Xiao Yu <xiao.yu@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Support fp8 mfma instruction with per wrap dynamic quantization for Query to improve performance, which reduces fp8 to fp16 data type conversion cost and improve mfma throughput for MI300x or later accelerators.
Test Plan
export VLLM_ROCM_FP8_MFMA_PAGE_ATTN=1
pytest -s tests/kernels/attention/test_attention.py
Sample script:
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MODEL_DIR=Meta-Llama-3.1-8B-Instruct
export VLLM_USE_V1=0
lm_eval --model vllm
--model_args pretrained=$MODEL_DIR
--tasks gsm8k
--trust_remote_code
--batch_size 8
Test Result
(Optional) Documentation Update