Skip to content

Conversation

tjtanaa
Copy link
Contributor

@tjtanaa tjtanaa commented Jul 13, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Upgrading AITER commits to newer version 916bf3c (Jul 12, 2025).

Test Plan

Perform lm-eval on all of the representative models that are using AITER kernels (all lm-evals are evaluated on V1 Engine).

  • from aiter import gemm_a8w8_CK (RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8)
  • aiter.flash_attn_varlen_func (mistral, mistral-fp8, Llama-3, Llama-4 Bf16)
  • aiter.paged_attention_v1 (mistral, mistral-fp8, Llama-3, Llama-4 Bf16)
  • from aiter import topk_softmax (mistral, mistral-fp8, Llama-4 Bf16)
  • from aiter.fused_moe_bf16_asm import asm_moe_tkw1 (Llama-4 FP8)
  • from aiter import biased_grouped_topk (DeepSeek-R1)
  • from aiter import grouped_topk (mistral, mistral-fp8, Llama-4 Bf16)
  • from aiter.fused_moe import fused_moe (DeepSeek-R1, mistral, mistral-fp8, Llama-4 Bf16)
  • from aiter.ops.shuffle import shuffle_weight (DeepSeek-R1, mistral, Llama-4)
  • aiter.gemm_a8w8_blockscale (DeepSeek-R1)
  • from aiter.mla import mla_decode_fwd (DeepSeek-R1)

Test Result

Test AITER Flash Attention Kernel:

LongBench Dataset
RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
Test Local and Global Attention Mechanism

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,tensor_parallel_size=4,max_model_len=131072,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
longbench_passage_retrieval_en 3 none 5 retrieval_score 0.8667 ± 0.0233

Other Affected Models

Evaluate on GSM8K on Affected Models:

deepseek-ai/DeepSeek-R1

V1

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9469 ± 0.0062
strict-match 5 exact_match 0.9469 ± 0.0062

V0

vllm (pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,max_model_len=32768,block_size=1,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9265 ± 0.0072
strict-match 5 exact_match 0.9257 ± 0.0072

mistralai/Mixtral-8x7B-Instruct-v0.1

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.6414 ± 0.0132
strict-match 5 exact_match 0.6376 ± 0.0132

mistralai/Mixtral-8x7B-Instruct-v0.1 fp8 per tensor dynamic quantization

vllm (pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,tensor_parallel_size=2,quantization=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.6027 ± 0.0135
strict-match 5 exact_match 0.5989 ± 0.0135

meta-llama/Llama-3.3-70B-Instruct

vllm (pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9393 ± 0.0066
strict-match 5 exact_match 0.9136 ± 0.0077

meta-llama/Llama-3.3-70B-Instruct fp8 per tensor dynamic quantization

vllm (pretrained=meta-llama/Llama-3.3-70B-Instruct,tensor_parallel_size=2,quantization=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9340 ± 0.0068
strict-match 5 exact_match 0.8992 ± 0.0083

RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8 (V1 PTPC INT8 a8w8)

vllm (pretrained=RedHat/Meta-Llama-3.1-405B-Instruct-quantized.w8a8,tensor_parallel_size=4,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9212 ± 0.0074
strict-match 5 exact_match 0.9151 ± 0.0077

RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

vllm (pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,tensor_parallel_size=8,max_model_len=100000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 128

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9158 ± 0.0076
strict-match 5 exact_match 0.9037 ± 0.0081

(Optional) Documentation Update

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @tjtanaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing vLLM's support for ROCm by upgrading the underlying AITER library to a more recent commit. Concurrently, it includes essential API adjustments within vLLM to align with the updated AITER interfaces, specifically for 8-bit quantization (W8A8) GEMM operations, ensuring continued compatibility and optimal performance on ROCm platforms.

Highlights

  • AITER Library Upgrade: The AITER (AMD Instinct Training Engine for ROCm) library has been upgraded to commit 916bf3c within the ROCm Dockerfile, ensuring vLLM utilizes the latest improvements and fixes from AITER.
  • API Refactoring for W8A8 GEMM: The AiterScaledMMLinearKernel now leverages a newly registered custom vllm operation (rocm_aiter_gemm_w8a8) for 8-bit weight, 8-bit activation (W8A8) scaled GEMM operations. This change replaces direct calls to aiter.gemm_a8w8_CK, improving integration and adherence to vLLM's custom op registration mechanism.
  • AITER Block-Scaled GEMM API Update: An API call for block-scaled W8A8 GEMM within fp8_utils.py was updated from rocm_aiter.gemm_a8w8_blockscale_CK to rocm_aiter.gemm_a8w8_blockscale, reflecting a necessary adjustment due to changes in the AITER library's interface.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added ci/build rocm Related to AMD ROCm labels Jul 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades the AITER dependency to a newer commit and refactors the code to use the updated APIs. The changes include registering a new custom PyTorch op for the AITER GEMM kernel. I have provided suggestions to update comments and rename a variable for better code clarity.

Comment on lines +28 to +31
# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects
# a to be [M, K]
# b to be [N, K]
# CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment should focus on the expected input format for gemm_a8w8_CK only, as the CutlassScaledMMLinearKernel detail is no longer relevant here.

Suggested change
# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects
# a to be [M, K]
# b to be [N, K]
# CutlassScaledMMLinearKernel prepare weight `w_q` in [K, N] format
# gemm_a8w8_CK(a, b, scale_a, scale_b, bias) expects:
# a: [M, K]
# b: [N, K]

Comment on lines +162 to +163
return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s,
bias, out_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the comment to reflect the use of torch.ops.vllm.rocm_aiter_gemm_w8a8 instead of the direct gemm_a8w8_CK call, explaining the weight tensor transposition for the new custom op.

Suggested change
return torch.ops.vllm.rocm_aiter_gemm_w8a8(x_q, w_q.t(), x_s, w_s,
bias, out_dtype)
# The AITER GEMM kernel expects the weight tensor to be in [N, K] format.
# `CutlassScaledMMLinearKernel` prepares the weight `w_q` in [K, N] format,
# so we transpose it before passing it to the kernel.

import aiter as rocm_aiter

return rocm_aiter.gemm_a8w8_blockscale_CK(A, B, As, Bs, dtype=output_dtype)
return rocm_aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider renaming rocm_aiter to aiter to align with the import alias used in the function.

Suggested change
return rocm_aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)
return aiter.gemm_a8w8_blockscale(A, B, As, Bs, dtype=output_dtype)

@tjtanaa tjtanaa marked this pull request as ready for review July 13, 2025 12:07
@tjtanaa tjtanaa changed the title [V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs [V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs Jul 13, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) July 13, 2025 12:21
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 13, 2025
@DarkLight1337 DarkLight1337 merged commit 80d38b8 into vllm-project:main Jul 13, 2025
85 of 87 checks passed
@tjtanaa tjtanaa deleted the upgrade-aiter branch July 13, 2025 15:22
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
…vllm-project#20880)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…vllm-project#20880)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…vllm-project#20880)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…vllm-project#20880)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants