-
Notifications
You must be signed in to change notification settings - Fork 555
[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[main][Prefill Perf] Optimize Quantized MoE Performance by Reducing All2All Communication #2195
Conversation
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2195 +/- ##
==========================================
+ Coverage 76.41% 76.72% +0.30%
==========================================
Files 113 111 -2
Lines 12553 12516 -37
==========================================
+ Hits 9593 9603 +10
+ Misses 2960 2913 -47
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ll2All Communication (vllm-project#2195) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) *before* quantization, resulting in substantial communication overhead. By performing quantization on each EP rank **first** and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@7175817 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
…ll2All Communication (vllm-project#2195) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) *before* quantization, resulting in substantial communication overhead. By performing quantization on each EP rank **first** and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@7175817 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
…ll2All Communication (vllm-project#2195) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) *before* quantization, resulting in substantial communication overhead. By performing quantization on each EP rank **first** and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@7175817 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
…ll2All Communication (vllm-project#2195) This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations. In the previous implementation, the `all2all` operation was performed on unquantized `hidden_states` (in FP16/BF16) *before* quantization, resulting in substantial communication overhead. By performing quantization on each EP rank **first** and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%. Additionally, this PR includes a minor optimization to cast `int` inputs to `float` for the `argsort` operation, forcing it to run on a faster NPU core instead of the AICPU. These changes lead to a clear and significant performance gain in MoE quantization scenarios. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@7175817 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
This PR significantly optimizes performance for quantized Mixture of Experts (MoE) layers by changing the order of quantization and communication operations.
In the previous implementation, the
all2alloperation was performed on unquantizedhidden_states(in FP16/BF16) before quantization, resulting in substantial communication overhead. By performing quantization on each EP rank first and then sending the much smaller quantized data, we reduce the communication volume by nearly 50%.Additionally, this PR includes a minor optimization to cast
intinputs tofloatfor theargsortoperation, forcing it to run on a faster NPU core instead of the AICPU.These changes lead to a clear and significant performance gain in MoE quantization scenarios.