-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
TP2 <style> </style>
|
|
TP4 <style> </style>
|
<style>
</style>
|
4e2a9ee to
d0e1358
Compare
|
@youkaichao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the ease of user, should we try to make to be some special value (None, -1 or 0) which means user didn't explicitly specify the value?
When world size =8 (which is tensor-parallel-size =8, I presume) and user didn't specify the value explicity, VLLM_QUICK_ALLREDUCE_LEVEL is set to value of 5 as your finding shows that when world_size=8, VLLM_QUICK_ALLREDUCE_LEVEL=5 is the best config.
|
Does this feature directly overwrites the old custom all reduce? Or there is a way to fallback to old custom all reduce? |
The qr is some complement to the cr, we will try the qr first and if we find that this is not a scenario where the qr excels, we will go back to the cr |
|
This pull request has merge conflicts that must be resolved before it can be |
a333b8a to
024f2fc
Compare
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
cbd05c1 to
d80ccd0
Compare
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
d80ccd0 to
517dc5d
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
closing in favor of #19744 |
1.With its low-granularity quantization, https://github.com/mk1-project/quickreduce brings huge performance gains to allreduce on tp2 and tp4 on rocm, and does not significantly degrade the model's performance.
2.We integrated quick allreduce into vllm to support 1stage(f16 ), and 2stage(f16, fp8, Q8, Q6, Q4).
3.It is worth mentioning that the speedup of qr is brought about by sacrificing a certain amount of precision, and custom_qr is significantly better than qr's 1stage and 2stage methods at low data volumes (less than 128kb), so we need to judge whether to choose qr or cr or rccl by some conditions.(According to the results of the following experimental kernel, oneshot has no advantageous scenario, so we remove it)
base on #18473
4.Considering that qr has limited usage scenarios and that the interfaces of qr and cr are very similar, we merge qr into cr to minimize user confusion.
5.Q4 scenarios can cause serious accuracy problems on some models, so we default to fp8 quantization.