-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[FEAT] [ROCm]: Support AITER Linear #14916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please split up the linear between unquantized and quantized? That way we can land unquantized now and the quantized soon after, once I'm done with the FP8 scaledmm refactor.
vllm/model_executor/layers/linear.py
Outdated
return tgemm.mm(x, weight, bias) | ||
|
||
|
||
def dipsatch_unquantized_linear_func() -> Callable[..., torch.Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to specify the exact signature here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ProExpertProg We have fix the typo and specified the exact signature
def dispatch_unquantized_linear_func(
) -> Callable[[torch.Tensor, torch.Tensor, torch.Tensor], torch.Tensor]:
from vllm._aiter_ops import is_rocm_aiter_linear_enabled
if is_rocm_aiter_linear_enabled():
return aiter_ops.rocm_aiter_tuned_gemm
return F.linear
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch generally looks fine to me. Lets rebase it and clean it up a bit. Same as the other AITER PRs, can you post lm_eval results from models that this kernel should support?
This kernel is for generic use case. So we will pick just one model and compute its lmeval to show the correctness of the implementation which is |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…move redundant env variables Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…y and the torch compile mode works Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@SageMoore @ProExpertProg We have updated the PR description with the latest information. Should we keep the |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
||
|
||
def cutlass_w8a8_scaled_mm(*, qinput: torch.Tensor, weight: torch.Tensor, | ||
def cutlass_w8a8_scaled_mm(qinput: torch.Tensor, weight: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cutlass_w8a8_scaled_mm(*, qinput: torch.Tensor
->
cutlass_w8a8_scaled_mm(qinput: torch.Tensor
why did we remove the starting *
? is it on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is on purpose. The input argument name of the AITER tuned gemm is not call qinput
thus will break the dispatcher function logic. So removing the *
and avoid specifying the first argument as keyword argument.
This pull request has merge conflicts that must be resolved before it can be |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
Description
This PR is to introduce AITER Linear kernel so that any up-coming optimization in AITER kernel could be directly use and evaluated within vLLM framework.
(Updates) This PR has been updated to also support V1.
Given that the AITER ops is being used by multiple files, and requires to be registered as custom op through
direct_register_custom_op
for it to be used with V1, we proposed to keep all AITER related flags (helper functions) and ops withinvllm/_aiter_ops.py
. We would like to propose that moving forward, AITER ops will be first defined in this file and registered as custom ops usingdirect_register_custom_op
. AITER related flags (helper functions) will be defined here as well.We have added
tests/v1/rocm/test_aiter_ops.py
. It tests if the aiter ops areThis file will be skipped if AITER is not installed and the platform is not ROCm.
NOTE:
This unit tests is by no means to check the correctness of the AITER ops. It only checks if the AITER ops are correctly registered and if torch.compile can be used with the AITER ops. The correctness of the AITER ops is tested in the https://github.com/ROCm/aiter
To meet those two criteria, the following checks are done.
Criteria 1
To check for fake tensor implementation is correct or not, it can be done using
torch.library.check
as follows:Criteria 2
Ensure the torch.compile mode and the eager mode generates the same results
What is AITER Linear?
AITER Linear is a kernel from the AI Tensor Engine for ROCm (AITER) that has been integrated into vLLM for the unquantized linear method and for per-tensor-weight and per-tensor-activation quantization FP8 Scaled GEMM.
How to Enable AITER Linear
VLLM_ROCM_USE_AITER=1
is setVLLM_ROCM_USE_AITER_LINEAR
Out-dated-results
Performance Comparison with No-AITER
All tests are V0 Engine
Llama-3.1-8B-Instruct (with FP8 per-tensor dynamic quantization)
Llama-3.1-8B-Instruct-BF16
Llama-3.1-70B-Instruct (with FP8 per-tensor dynamic quantization)
Llama-3.1-70B-Instruct-BF16
Throughput Performance Comparison with No-AITER
Before PR: [Performance][ROCm] Add skinny gemms for unquantized linear on ROCm #15830
Settings:
Llama-3.1-8B-Instruct (with FP8 per-tensor dynamic quantization)
Llama-3.1-8B-Instruct-BF16
V1 Accuracy Test
Settings:
Unquantized No AITER
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,max_model_len=10000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
AITER Linear unquantized
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,max_model_len=10000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
AITER Linear per-tensor FP8
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1,max_model_len=10000,quantization=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto