Skip to content

Conversation

@lihaoyang-amd
Copy link
Contributor

@lihaoyang-amd lihaoyang-amd commented May 21, 2025

  1. With its low-granularity quantization, https://github.com/mk1-project/quickreduce brings huge performance gains to allreduce on tp2 and tp4 on rocm, and does not significantly degrade the model's performance.
  2. We integrated quick allreduce into vllm to support 1stage(f16 ), and 2stage(f16, fp8, Q8, Q6, Q4).
  3. It is worth mentioning that the speedup of qr is brought about by sacrificing a certain amount of precision, and custom_qr is significantly better than qr's 1stage and 2stage methods at low data volumes (less than 128kb), so we need to judge whether to choose qr or cr or rccl by some conditions.

@mergify mergify bot added documentation Improvements or additions to documentation ci/build frontend v1 labels May 21, 2025
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@lihaoyang-amd lihaoyang-amd marked this pull request as ready for review May 21, 2025 10:41
@lihaoyang-amd lihaoyang-amd changed the title add quick allreduce for vllm Integrate quick allreduce and select the best allreduce implementation May 21, 2025
@lihaoyang-amd lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from 5a6a5a8 to e272658 Compare May 21, 2025 12:22
@youkaichao
Copy link
Member

what's the relationship between this and #16804 ?

@lihaoyang-amd
Copy link
Contributor Author

lihaoyang-amd commented May 21, 2025

what's the relationship between this and #16804 ?

Aha, maybe we are in competition. We're from amd. We recently spent some time trying to integrate qr into vllm (because qr is very suitable for rocm)

Integrating qr makes the two pr have many similarities, but it seems that the pr you mentioned #16804 only supports Q8 and Q 4. There are no obvious boundary conditions, quantization seems to have some problems, and lack of experimental data.

Maybe we can work together to finish the work.

@lihaoyang-amd lihaoyang-amd force-pushed the amd/add_quick_allreduce branch 2 times, most recently from 08caa03 to 0989304 Compare May 22, 2025 16:40
@mergify
Copy link

mergify bot commented May 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 22, 2025
@lihaoyang-amd lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from 0989304 to 84b2ca1 Compare May 23, 2025 07:17
@mergify mergify bot removed the needs-rebase label May 23, 2025
@mergify
Copy link

mergify bot commented May 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 23, 2025
@lihaoyang-amd lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from d280d21 to f194cac Compare May 23, 2025 14:39
@lihaoyang-amd lihaoyang-amd requested a review from hmellor as a code owner May 23, 2025 14:39
@mergify mergify bot removed the needs-rebase label May 23, 2025
@mergify
Copy link

mergify bot commented May 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 23, 2025
@lihaoyang-amd lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from f194cac to 50bd787 Compare May 24, 2025 06:47
@mergify mergify bot removed the needs-rebase label May 24, 2025
@mergify
Copy link

mergify bot commented May 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 25, 2025
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
@lihaoyang-amd lihaoyang-amd force-pushed the amd/add_quick_allreduce branch from 50bd787 to 3458abc Compare May 25, 2025 17:02
@mergify mergify bot removed the needs-rebase label May 25, 2025
@lihaoyang-amd
Copy link
Contributor Author

lihaoyang-amd commented May 28, 2025

@youkaichao hi, kaichao,
from the experimental results, qr has no acceleration for bfloat16 with tp4 and tp8 due to the lack of bfloat16 instruction in mi300, also qr supports rocm for the time being, so we still need custom allreduce.
I noticed custom supports graph capture, does quick allreduce need to support graph mode as well?This 35 times captured graph will be used in formal reasoning and reasoning that doesn't match the shape of this 35-capture graph then uses eager_mode, right?

@mergify
Copy link

mergify bot commented May 30, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 30, 2025
@youkaichao
Copy link
Member

qr has no acceleration for bfloat16 with tp4 and tp8 due to the lack of bfloat16 instruction in mi300, also qr supports rocm for the time being, so we still need custom allreduce.

I wish qr itself contains the logic of selecting qr or custom allreduce, since their interface is quite the same. My request is that we don't touch the cuda code path, so that people reading the code will not need to think about quick reduce.

graph mode allreduce is necessary for some low-latency workloads where the batchsize is small.

@lihaoyang-amd
Copy link
Contributor Author

lihaoyang-amd commented Jun 3, 2025

qr has no acceleration for bfloat16 with tp4 and tp8 due to the lack of bfloat16 instruction in mi300, also qr supports rocm for the time being, so we still need custom allreduce.

I wish qr itself contains the logic of selecting qr or custom allreduce, since their interface is quite the same. My request is that we don't touch the cuda code path, so that people reading the code will not need to think about quick reduce.

graph mode allreduce is necessary for some low-latency workloads where the batchsize is small.

Hi, @youkaichao
I made the initial changes you suggested and the code is now working.
I'm not sure if this meets your expectations, please leave me some comments so I can move on to the next step, thank you very much!
#19094

@mergify mergify bot added the rocm Related to AMD ROCm label Jun 13, 2025
@tlrmchlsmth
Copy link
Member

closing in favor of #19744

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend needs-rebase rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants