-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c
and bugfix APIs
#20880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @tjtanaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily focuses on enhancing vLLM's support for ROCm by upgrading the underlying AITER library to a more recent commit. Concurrently, it includes essential API adjustments within vLLM to align with the updated AITER interfaces, specifically for 8-bit quantization (W8A8) GEMM operations, ensuring continued compatibility and optimal performance on ROCm platforms.
Highlights
- AITER Library Upgrade: The AITER (AMD Instinct Training Engine for ROCm) library has been upgraded to commit
916bf3c
within the ROCm Dockerfile, ensuring vLLM utilizes the latest improvements and fixes from AITER. - API Refactoring for W8A8 GEMM: The
AiterScaledMMLinearKernel
now leverages a newly registered customvllm
operation (rocm_aiter_gemm_w8a8
) for 8-bit weight, 8-bit activation (W8A8) scaled GEMM operations. This change replaces direct calls toaiter.gemm_a8w8_CK
, improving integration and adherence to vLLM's custom op registration mechanism. - AITER Block-Scaled GEMM API Update: An API call for block-scaled W8A8 GEMM within
fp8_utils.py
was updated fromrocm_aiter.gemm_a8w8_blockscale_CK
torocm_aiter.gemm_a8w8_blockscale
, reflecting a necessary adjustment due to changes in the AITER library's interface.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request upgrades the AITER dependency to a newer commit and refactors the code to use the updated APIs. The changes include registering a new custom PyTorch op for the AITER GEMM kernel. I have provided suggestions to update comments and rename a variable for better code clarity.
# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects | ||
# a to be [M, K] | ||
# b to be [N, K] | ||
# CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should focus on the expected input format for gemm_a8w8_CK
only, as the CutlassScaledMMLinearKernel
detail is no longer relevant here.
# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects | |
# a to be [M, K] | |
# b to be [N, K] | |
# CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format | |
# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects: | |
# a: [M, K] | |
# b: [N, K] |
return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s, | ||
bias, out_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comment to reflect the use of torch.ops.vllm.rocm_aiter_gemm_w8a8
instead of the direct gemm_a8w8_CK
call, explaining the weight tensor transposition for the new custom op.
return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s, | |
bias, out_dtype) | |
# The AITER GEMM kernel expects the weight tensor to be in [N, K] format. | |
# `CutlassScaledMMLinearKernel` prepares the weight `w_q` in [K, N] format, | |
# so we transpose it before passing it to the kernel. |
import aiter as rocm_aiter | ||
|
||
return rocm_aiter.gemm_a8w8_blockscale_CK(A, B, As, Bs, dtype=output_dtype) | ||
return rocm_aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
916bf3c
and bugfix APIs
…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: x22x22 <wadeking@qq.com>
…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Upgrading AITER commits to newer version
916bf3c
(Jul 12, 2025).Test Plan
Perform lm-eval on all of the representative models that are using AITER kernels (all lm-evals are evaluated on V1 Engine).
from aiter import gemm_a8w8_CK
(RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8)aiter.flash_attn_varlen_func
(mistral, mistral-fp8, Llama-3, Llama-4 Bf16)aiter.paged_attention_v1
(mistral, mistral-fp8, Llama-3, Llama-4 Bf16)from aiter import topk_softmax
(mistral, mistral-fp8, Llama-4 Bf16)from aiter.fused_moe_bf16_asm import asm_moe_tkw1
(Llama-4 FP8)from aiter import biased_grouped_topk
(DeepSeek-R1)from aiter import grouped_topk
(mistral, mistral-fp8, Llama-4 Bf16)from aiter.fused_moe import fused_moe
(DeepSeek-R1, mistral, mistral-fp8, Llama-4 Bf16)from aiter.ops.shuffle import shuffle_weight
(DeepSeek-R1, mistral, Llama-4)aiter.gemm_a8w8_blockscale
(DeepSeek-R1)from aiter.mla import mla_decode_fwd
(DeepSeek-R1)Test Result
Test AITER Flash Attention Kernel:
LongBench Dataset
RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
Test Local and Global Attention Mechanism
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,tensor_parallel_size=4,max_model_len=131072,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
Other Affected Models
Evaluate on GSM8K on Affected Models:
deepseek-ai/DeepSeek-R1
V1
vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
V0
vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
mistralai/Mixtral-8x7B-Instruct-v0.1
vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
mistralai/Mixtral-8x7B-Instruct-v0.1 fp8 per tensor dynamic quantization
vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=2,quantization=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
meta-llama/Llama-3.3-70B-Instruct
vllm (pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
meta-llama/Llama-3.3-70B-Instruct fp8 per tensor dynamic quantization
vllm (pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=2,quantization=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8 (V1 PTPC INT8 a8w8)
vllm (pretrained=RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8,tensor_parallel_size=4,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,tensor_parallel_size=8,max_model_len=100000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128
(Optional) Documentation Update