[V1] [ROCm] [AITER] Upgrade AITER to commit `916bf3c` and bugfix APIs #20880

tjtanaa · 2025-07-13T11:57:55Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Upgrading AITER commits to newer version 916bf3c (Jul 12, 2025).

Test Plan

Perform lm-eval on all of the representative models that are using AITER kernels (all lm-evals are evaluated on V1 Engine).

from aiter import gemm_a8w8_CK (RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8)
aiter.flash_attn_varlen_func (mistral, mistral-fp8, Llama-3, Llama-4 Bf16)
aiter.paged_attention_v1 (mistral, mistral-fp8, Llama-3, Llama-4 Bf16)
from aiter import topk_softmax (mistral, mistral-fp8, Llama-4 Bf16)
from aiter.fused_moe_bf16_asm import asm_moe_tkw1 (Llama-4 FP8)
from aiter import biased_grouped_topk (DeepSeek-R1)
from aiter import grouped_topk (mistral, mistral-fp8, Llama-4 Bf16)
from aiter.fused_moe import fused_moe (DeepSeek-R1, mistral, mistral-fp8, Llama-4 Bf16)
from aiter.ops.shuffle import shuffle_weight (DeepSeek-R1, mistral, Llama-4)
aiter.gemm_a8w8_blockscale (DeepSeek-R1)
from aiter.mla import mla_decode_fwd (DeepSeek-R1)

Test Result

Test AITER Flash Attention Kernel:

LongBench Dataset
RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
Test Local and Global Attention Mechanism

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,tensor_parallel_size=4,max_model_len=131072,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
longbench_passage_retrieval_en	3	none	5	retrieval_score	↑	0.8667	±	0.0233

Other Affected Models

Evaluate on GSM8K on Affected Models:

deepseek-ai/DeepSeek-R1

V1

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9469	±	0.0062
		strict-match	5	exact_match	↑	0.9469	±	0.0062

V0

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9265	±	0.0072
		strict-match	5	exact_match	↑	0.9257	±	0.0072

mistralai/Mixtral-8x7B-Instruct-v0.1

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.6414	±	0.0132
		strict-match	5	exact_match	↑	0.6376	±	0.0132

mistralai/Mixtral-8x7B-Instruct-v0.1 fp8 per tensor dynamic quantization

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=2,quantization=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.6027	±	0.0135
		strict-match	5	exact_match	↑	0.5989	±	0.0135

meta-llama/Llama-3.3-70B-Instruct

vllm (pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9393	±	0.0066
		strict-match	5	exact_match	↑	0.9136	±	0.0077

meta-llama/Llama-3.3-70B-Instruct fp8 per tensor dynamic quantization

vllm (pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=2,quantization=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9340	±	0.0068
		strict-match	5	exact_match	↑	0.8992	±	0.0083

RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8 (V1 PTPC INT8 a8w8)

vllm (pretrained=RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8,tensor_parallel_size=4,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9212	±	0.0074
		strict-match	5	exact_match	↑	0.9151	±	0.0077

RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,tensor_parallel_size=8,max_model_len=100000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9158	±	0.0076
		strict-match	5	exact_match	↑	0.9037	±	0.0081

(Optional) Documentation Update

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

github-actions · 2025-07-13T11:58:02Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @tjtanaa, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing vLLM's support for ROCm by upgrading the underlying AITER library to a more recent commit. Concurrently, it includes essential API adjustments within vLLM to align with the updated AITER interfaces, specifically for 8-bit quantization (W8A8) GEMM operations, ensuring continued compatibility and optimal performance on ROCm platforms.

Highlights

AITER Library Upgrade: The AITER (AMD Instinct Training Engine for ROCm) library has been upgraded to commit 916bf3c within the ROCm Dockerfile, ensuring vLLM utilizes the latest improvements and fixes from AITER.
API Refactoring for W8A8 GEMM: The AiterScaledMMLinearKernel now leverages a newly registered custom vllm operation (rocm_aiter_gemm_w8a8) for 8-bit weight, 8-bit activation (W8A8) scaled GEMM operations. This change replaces direct calls to aiter.gemm_a8w8_CK, improving integration and adherence to vLLM's custom op registration mechanism.
AITER Block-Scaled GEMM API Update: An API call for block-scaled W8A8 GEMM within fp8_utils.py was updated from rocm_aiter.gemm_a8w8_blockscale_CK to rocm_aiter.gemm_a8w8_blockscale, reflecting a necessary adjustment due to changes in the AITER library's interface.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request upgrades the AITER dependency to a newer commit and refactors the code to use the updated APIs. The changes include registering a new custom PyTorch op for the AITER GEMM kernel. I have provided suggestions to update comments and rename a variable for better code clarity.

gemini-code-assist · 2025-07-13T11:59:28Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+    # gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects
+    # a to be [M, K]
+    # b to be [N, K]
+    # CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format


The comment should focus on the expected input format for gemm_a8w8_CK only, as the CutlassScaledMMLinearKernel detail is no longer relevant here.

Suggested change

# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects

# a to be [M, K]

# b to be [N, K]

# CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format

# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects:

# a: [M, K]

# b: [N, K]

gemini-code-assist · 2025-07-13T11:59:28Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+        return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s,
+                                                   bias, out_dtype)


Update the comment to reflect the use of torch.ops.vllm.rocm_aiter_gemm_w8a8 instead of the direct gemm_a8w8_CK call, explaining the weight tensor transposition for the new custom op.

Suggested change

return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s,

bias, out_dtype)

# The AITER GEMM kernel expects the weight tensor to be in [N, K] format.

# `CutlassScaledMMLinearKernel` prepares the weight `w_q` in [K, N] format,

# so we transpose it before passing it to the kernel.

gemini-code-assist · 2025-07-13T11:59:28Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

    import aiter as rocm_aiter

-    return rocm_aiter.gemm_a8w8_blockscale_CK(A, B, As, Bs, dtype=output_dtype)
+    return rocm_aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)


Consider renaming rocm_aiter to aiter to align with the import alias used in the function.

Suggested change

return rocm_aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

return aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: x22x22 <wadeking@qq.com>

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

upgrade aiter to commit 916bf3c; bugfix APIs

d15fbf3

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

gemini-code-assist bot reviewed Jul 13, 2025

View reviewed changes

mergify bot added ci/build rocm Related to AMD ROCm labels Jul 13, 2025

gemini-code-assist bot reviewed Jul 13, 2025

View reviewed changes

tjtanaa marked this pull request as ready for review July 13, 2025 12:07

tjtanaa requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners July 13, 2025 12:07

tjtanaa changed the title ~~[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs~~ [V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs Jul 13, 2025

DarkLight1337 approved these changes Jul 13, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) July 13, 2025 12:21

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 13, 2025

DarkLight1337 merged commit 80d38b8 into vllm-project:main Jul 13, 2025
85 of 87 checks passed

tjtanaa deleted the upgrade-aiter branch July 13, 2025 15:22

tjtanaa mentioned this pull request Jul 13, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

61 tasks

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs (…

1ad5bd7

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs (…

0cc7cff

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs (…

37da21f

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs (…

d593230

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

vllmellm mentioned this pull request Aug 26, 2025

[Feature] [ROCm]: AITER Kernel Integration vllmellm/vllm#51

Open

61 tasks

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025

[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs (…

345d523

…vllm-project#20880) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] [ROCm] [AITER] Upgrade AITER to commit `916bf3c` and bugfix APIs #20880

[V1] [ROCm] [AITER] Upgrade AITER to commit `916bf3c` and bugfix APIs #20880

Uh oh!

tjtanaa commented Jul 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 13, 2025

Uh oh!

gemini-code-assist bot Jul 13, 2025

Uh oh!

gemini-code-assist bot Jul 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s,
		bias, out_dtype)

-        return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s,
-                                                   bias, out_dtype)
+        # The AITER GEMM kernel expects the weight tensor to be in [N, K] format.
+        # `CutlassScaledMMLinearKernel` prepares the weight `w_q` in [K, N] format,
+        # so we transpose it before passing it to the kernel.

	return rocm_aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)
	return aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

Uh oh!

[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs #20880

[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs #20880

Uh oh!

Conversation

tjtanaa commented Jul 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Test AITER Flash Attention Kernel:

Other Affected Models

deepseek-ai/DeepSeek-R1

mistralai/Mixtral-8x7B-Instruct-v0.1

mistralai/Mixtral-8x7B-Instruct-v0.1 fp8 per tensor dynamic quantization

meta-llama/Llama-3.3-70B-Instruct

meta-llama/Llama-3.3-70B-Instruct fp8 per tensor dynamic quantization

RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8 (V1 PTPC INT8 a8w8)

RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[V1] [ROCm] [AITER] Upgrade AITER to commit `916bf3c` and bugfix APIs #20880

[V1] [ROCm] [AITER] Upgrade AITER to commit `916bf3c` and bugfix APIs #20880

tjtanaa commented Jul 13, 2025 •

edited by github-actions bot

Loading