-
Notifications
You must be signed in to change notification settings - Fork 544
[Perf] Optimize perf of Qwen3 #1245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can you add e2e testing? I want to try it out. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: rjg-lyh <1318825571@qq.com>
|
|
||
|
|
||
| class AddRMSNormQuant(RMSNorm): | ||
| """Root mean square normalization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment
| self.post_attention_layernorm = RMSNorm(config.hidden_size, | ||
| eps=config.rms_norm_eps) | ||
| else: | ||
| from vllm_ascend.quantization.quant_config import AscendQuantConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the mainly changes on CustomQwen3DecoderLayer is the AddRMSNormQuant layer. I prefer to inheiret from Qwen3DecoderLayer and add the logic of AddRMSNormQuant. This could make the optimization point clearly and reduce redundant code
| import torch_npu | ||
|
|
||
| if residual is not None: | ||
| x, _, residual = torch_npu.npu_add_rms_norm_quant(x, residual, self.weight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: what does "add" mean here?
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
Yes, we should avoid to add big paste code in vllm-ascend. Please also paste perf results here. |
What this PR does / why we need it?
Optimize the performance of Qwen3 model by registering a custom model.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
CI passed with existing test.