Skip to content

Conversation

@lihaoyang-amd
Copy link
Contributor

@lihaoyang-amd lihaoyang-amd commented Jun 3, 2025

1.With its low-granularity quantization, https://github.com/mk1-project/quickreduce brings huge performance gains to allreduce on tp2 and tp4 on rocm, and does not significantly degrade the model's performance.
2.We integrated quick allreduce into vllm to support 1stage(f16 ), and 2stage(f16, fp8, Q8, Q6, Q4).
3.It is worth mentioning that the speedup of qr is brought about by sacrificing a certain amount of precision, and custom_qr is significantly better than qr's 1stage and 2stage methods at low data volumes (less than 128kb), so we need to judge whether to choose qr or cr or rccl by some conditions.(According to the results of the following experimental kernel, oneshot has no advantageous scenario, so we remove it)
base on #18473
4.Considering that qr has limited usage scenarios and that the interfaces of qr and cr are very similar, we merge qr into cr to minimize user confusion.
5.Q4 scenarios can cause serious accuracy problems on some models, so we default to fp8 quantization.

@github-actions
Copy link

github-actions bot commented Jun 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@lihaoyang-amd
Copy link
Contributor Author

TP2

<style> </style>
kernel float16   vllm qr0 qr1 qr2 qr4 qr5
time(us) size rccl custom allreduce 1stage 2stage f16 2stage f8 2stage Q6 2stage Q4
tp2 4k 12.152859 10.69 11.93 13.2 24.2 28.85 26.48
  8k 11.98536 8.18 12.51 13.51 24.53 28.71 26.94
  16k 12.496617 11.04 12.06 13.4 24.33 28.95 27.42
  32k 20.224781 11.55 11.8 13.45 24.19 29.11 26.86
  64k 20.611031 11.74 13.96 14.27 24.36 29.1 27.85
  128k 21.858852 12.12 14.62 14.7 25.2 29.24 27.34
  256k 24.634491 14.56 16.83 17.17 25.54 30.03 27.66
  512k 30.244204 20.4 23.1 23.44 28.75 30.59 28.72
  1M 42.48489 32.57 35.52 33.94 34.88 35.6 33.14
  2M 67.120636 56.82 60.02 58.87 47.63 45.91 38.48
  4M 111.62647 106.2 107.46 103.13 75.76 65.8 52.93
  8M 202.81882 202.2 205.25 194.85 132.02 122.05 89.92
  16M 387.97064 395.36 374.85 369.46 249.94 207.14 158.16
  32M 758.74219 786.6 729.02 728.41 453.03 368.07 296.14
  64M 1501.2136 1539.97 1442.51 1448.86 838.51 675.8 529
  128M 2986.8469 3068.5 2875.23 2885.5 1644.49 1335.28 1021.9
  256M 5838.4868 6179.18 5748.16 5773.99 3280.35 2675.8 2061.21
  512M 11543.985 12234.53 11510.27 11567.95 6555.42 5346.56 4112.88

@lihaoyang-amd
Copy link
Contributor Author

TP4

<style> </style>
kernel float16   vllm qr0 qr1 qr2 qr4 qr5
time(us) size rccl custom allreduce 1stage 2stage f16 2stage f8 2stage Q6 2stage Q4
tp4 4k 16.319765 11.07 19.6 18.94 29.35 33.3 31.09
  8k 18.678837 11.88 19.12 18.36 28.81 33.17 30.89
  16k 18.726336 12.84 19.33 17.44 29.15 34.34 30.97
  32k 19.181345 14.92 20.16 19.44 29.31 33.59 32.4
  64k 19.441343 15.6 21.4 18.71 33.71 33.63 32.93
  128k 34.290798 15.43 23.25 19.47 29.82 39.42 32.85
  256k 35.90361 19.03 32.88 19.35 30.23 35.42 32.31
  512k 39.845196 19.94 45.95 22.98 32.45 36.05 32.43
  1M 47.076485 26.62 74.5 28.71 37.6 39.61 34.86
  2M 59.289352 40.67 136.09 49.46 48.79 48.66 43.96
  4M 85.137909 69.28 209.41 67.77 71.51 61.9 52.07
  8M 133.56596 127.06 345.67 128.11 128.44 104.89 82.1
  16M 221.89734 234.79 619.63 220.47 200.7 190.04 138.04
  32M 412.81012 457.17 1034.53 410.26 341.27 307.03 243.56
  64M 791.31445 897.27 1969.94 810.27 603.46 531.76 458.29
  128M 1552.8333 1791.85 3764.65 1586.13 1140.76 1021.68 916.26
  256M 3042.7207 3549.68 7175.55 3129.78 2253.05 2060.62 1880.88
  512M 6036.2041 7152.43 err 6249.4 4469.68 4088.05 3788.14

@lihaoyang-amd
Copy link
Contributor Author

lihaoyang-amd commented Jun 3, 2025

<style> </style>
float16   vllm qr0 qr1 qr2 qr4 qr5 qr5
size rccl custom allreduce 1stage 2stage f16 2stage f8 2stage Q6 2stage Q4 Ilya Q4
4k 16.68663 19.01 28.78 22.92 41.14 39.56 37.62 18.87
8k 16.7382 13.28 37.23 21.87 35.46 40.89 36.63 18.47
16k 16.76633 14.19 29.23 24.34 34.44 43.04 35.31 14.18
32k 16.80445 22.02 30.89 24.91 38.39 39.91 38.71 14.53
64k 17.00008 15.31 30.01 25.01 36.04 39.63 36.85 16.72
128k 23.0151 19.99 37.14 25.2 36.9 45.83 39.08 36
256k 22.69166 23.89 54.06 32.74 37.91 44.62 41.62 36.7
512k 31.71264 28.23 73.72 24.58 40.18 48.38 38.48 38.68
1M 33.88234 34.36 128.95 27.44 42.3 42.67 38.93 40.9
2M 42.11831 41.5 235.02 38.02 44.3 46.93 42.31 41.06
4M 63.82403 62.35 330.7 53.81 60.75 65.01 53.3 51.05
8M 100.362 99.06 619.54 100.86 98.22 86.86 73.86 72.480003
16M 170.5773 166.98 972.1 173.93 159.44 166.03 120.28 112.48
32M 326.7224 311.75 1494.18 312.14 354.84 274.57 221.92 211.44
64M 509.2664 594.36 2742.64 604.88 522.4 495.92 442.2 438.98
128M 872.1246 1178.95 4983.3052 1176.42 1014.82 1059.4 916.15 930.36
256M 1607.178 2344.33 err 2359.55 2068.29 2026.74 1906.32 1917.07
512M 3093.578 4641.04 err 4728.71 4138.9 4070.6 3854 3875.09

@lihaoyang-amd lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from 4e2a9ee to d0e1358 Compare June 5, 2025 09:15
@lihaoyang-amd lihaoyang-amd marked this pull request as ready for review June 5, 2025 09:24
@lihaoyang-amd lihaoyang-amd marked this pull request as draft June 5, 2025 09:25
@lihaoyang-amd
Copy link
Contributor Author

@youkaichao
Hi, as per your suggestion, I have completed the merge of qr and cr, could you please give me some comments, thank you so much.

@lihaoyang-amd lihaoyang-amd marked this pull request as ready for review June 5, 2025 11:50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ease of user, should we try to make to be some special value (None, -1 or 0) which means user didn't explicitly specify the value?

When world size =8 (which is tensor-parallel-size =8, I presume) and user didn't specify the value explicity, VLLM_QUICK_ALLREDUCE_LEVEL is set to value of 5 as your finding shows that when world_size=8, VLLM_QUICK_ALLREDUCE_LEVEL=5 is the best config.

@tjtanaa
Copy link
Contributor

tjtanaa commented Jun 5, 2025

Does this feature directly overwrites the old custom all reduce? Or there is a way to fallback to old custom all reduce?

@lihaoyang-amd
Copy link
Contributor Author

Does this feature directly overwrites the old custom all reduce? Or there is a way to fallback to old custom all reduce?

The qr is some complement to the cr, we will try the qr first and if we find that this is not a scenario where the qr excels, we will go back to the cr

@mergify
Copy link

mergify bot commented Jun 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 5, 2025
@lihaoyang-amd lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from a333b8a to 024f2fc Compare June 11, 2025 10:55
@mergify mergify bot removed the needs-rebase label Jun 11, 2025
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
@lihaoyang-amd lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from cbd05c1 to d80ccd0 Compare June 13, 2025 09:23
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
@lihaoyang-amd lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from d80ccd0 to 517dc5d Compare June 13, 2025 09:26
@mergify
Copy link

mergify bot commented Jun 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 19, 2025
@tlrmchlsmth
Copy link
Member

closing in favor of #19744

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants