Integrate quick allreduce and select the best allreduce implementation #18473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

lihaoyang-amd wants to merge 9 commits into vllm-project:main from lihaoyang-amd:amd/add_quick_allreduce

Contributor

lihaoyang-amd commented May 21, 2025 •

edited by github-actions bot

Loading

With its low-granularity quantization, https://github.com/mk1-project/quickreduce brings huge performance gains to allreduce on tp2 and tp4 on rocm, and does not significantly degrade the model's performance.
We integrated quick allreduce into vllm to support 1stage(f16 ), and 2stage(f16, fp8, Q8, Q6, Q4).
It is worth mentioning that the speedup of qr is brought about by sacrificing a certain amount of precision, and custom_qr is significantly better than qr's 1stage and 2stage methods at low data volumes (less than 128kb), so we need to judge whether to choose qr or cr or rccl by some conditions.

mergify bot added documentation ci/build frontend v1 labels

github-actions bot commented May 21, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

lihaoyang-amd marked this pull request as ready for review

May 21, 2025 10:41

lihaoyang-amd requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tlrmchlsmth, youkaichao, ywang96 and zhuohan123 as code owners

May 21, 2025 10:41

lihaoyang-amd changed the title ~~add quick allreduce for vllm~~ Integrate quick allreduce and select the best allreduce implementation

lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from 5a6a5a8 to e272658 Compare

May 21, 2025 12:22

Member

youkaichao commented May 21, 2025

what's the relationship between this and #16804 ?

Contributor Author

lihaoyang-amd commented May 21, 2025 •

edited

Loading

what's the relationship between this and #16804 ?

Aha, maybe we are in competition. We're from amd. We recently spent some time trying to integrate qr into vllm (because qr is very suitable for rocm)

Integrating qr makes the two pr have many similarities, but it seems that the pr you mentioned #16804 only supports Q8 and Q 4. There are no obvious boundary conditions, quantization seems to have some problems, and lack of experimental data.

Maybe we can work together to finish the work.

lihaoyang-amd force-pushed the amd/add_quick_allreduce branch 2 times, most recently from 08caa03 to 0989304 Compare

May 22, 2025 16:40

mergify bot commented May 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from 0989304 to 84b2ca1 Compare

May 23, 2025 07:17

mergify bot removed the needs-rebase label

mergify bot commented May 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from d280d21 to f194cac Compare

May 23, 2025 14:39

lihaoyang-amd requested a review from hmellor as a code owner

May 23, 2025 14:39

mergify bot removed the needs-rebase label

mergify bot commented May 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from f194cac to 50bd787 Compare

May 24, 2025 06:47

mergify bot removed the needs-rebase label

mergify bot commented May 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

lihaoyang-amd added 9 commits

May 25, 2025 16:56


          solve rebase conflict

9370db1

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          add quickreduce

e400201

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          boundary condition

fad70a5

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          Platform inspection

f66d957

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          precommit

0d97a56

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          adjust qr condition

910d366

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          rebase

780e355

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          adjust condition

bb8f513

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>


          rebase

3458abc

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from 50bd787 to 3458abc Compare

May 25, 2025 17:02

mergify bot removed the needs-rebase label

Contributor Author

lihaoyang-amd commented May 28, 2025 •

edited

Loading

@youkaichao hi, kaichao,
from the experimental results, qr has no acceleration for bfloat16 with tp4 and tp8 due to the lack of bfloat16 instruction in mi300, also qr supports rocm for the time being, so we still need custom allreduce.
I noticed custom supports graph capture, does quick allreduce need to support graph mode as well？This 35 times captured graph will be used in formal reasoning and reasoning that doesn't match the shape of this 35-capture graph then uses eager_mode, right?

mergify bot commented May 30, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

Member

youkaichao commented Jun 2, 2025

qr has no acceleration for bfloat16 with tp4 and tp8 due to the lack of bfloat16 instruction in mi300, also qr supports rocm for the time being, so we still need custom allreduce.

I wish qr itself contains the logic of selecting qr or custom allreduce, since their interface is quite the same. My request is that we don't touch the cuda code path, so that people reading the code will not need to think about quick reduce.

graph mode allreduce is necessary for some low-latency workloads where the batchsize is small.

lihaoyang-amd mentioned this pull request

[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094

Closed

Contributor Author

lihaoyang-amd commented Jun 3, 2025 •

edited

Loading

qr has no acceleration for bfloat16 with tp4 and tp8 due to the lack of bfloat16 instruction in mi300, also qr supports rocm for the time being, so we still need custom allreduce.

I wish qr itself contains the logic of selecting qr or custom allreduce, since their interface is quite the same. My request is that we don't touch the cuda code path, so that people reading the code will not need to think about quick reduce.

graph mode allreduce is necessary for some low-latency workloads where the batchsize is small.

Hi, @youkaichao
I made the initial changes you suggested and the code is now working.
I'm not sure if this meets your expectations, please leave me some comments so I can move on to the next step, thank you very much!
#19094

mergify bot added the rocm label

Member

tlrmchlsmth commented Jun 26, 2025

closing in favor of #19744

tlrmchlsmth closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

WoosukKwon Awaiting requested review from WoosukKwon

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat

njhill Awaiting requested review from njhill

ywang96 Awaiting requested review from ywang96

comaniac Awaiting requested review from comaniac

alexm-redhat Awaiting requested review from alexm-redhat

tlrmchlsmth Awaiting requested review from tlrmchlsmth tlrmchlsmth is a code owner

zhuohan123 Awaiting requested review from zhuohan123

youkaichao Awaiting requested review from youkaichao

hmellor Awaiting requested review from hmellor

Labels

ci/build documentation frontend needs-rebase rocm v1