[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094

lihaoyang-amd · 2025-06-03T15:35:40Z

1.With its low-granularity quantization, https://github.com/mk1-project/quickreduce brings huge performance gains to allreduce on tp2 and tp4 on rocm, and does not significantly degrade the model's performance.
2.We integrated quick allreduce into vllm to support 1stage(f16 ), and 2stage(f16, fp8, Q8, Q6, Q4).
3.It is worth mentioning that the speedup of qr is brought about by sacrificing a certain amount of precision, and custom_qr is significantly better than qr's 1stage and 2stage methods at low data volumes (less than 128kb), so we need to judge whether to choose qr or cr or rccl by some conditions.(According to the results of the following experimental kernel, oneshot has no advantageous scenario, so we remove it)
base on #18473
4.Considering that qr has limited usage scenarios and that the interfaces of qr and cr are very similar, we merge qr into cr to minimize user confusion.
5.Q4 scenarios can cause serious accuracy problems on some models, so we default to fp8 quantization.

github-actions · 2025-06-03T15:35:49Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

lihaoyang-amd · 2025-06-03T15:39:05Z

TP2

kernel	float16		vllm	qr0	qr1	qr2	qr4	qr5
time(us)	size	rccl	custom allreduce	1stage	2stage f16	2stage f8	2stage Q6	2stage Q4
tp2	4k	12.152859	10.69	11.93	13.2	24.2	28.85	26.48
	8k	11.98536	8.18	12.51	13.51	24.53	28.71	26.94
	16k	12.496617	11.04	12.06	13.4	24.33	28.95	27.42
	32k	20.224781	11.55	11.8	13.45	24.19	29.11	26.86
	64k	20.611031	11.74	13.96	14.27	24.36	29.1	27.85
	128k	21.858852	12.12	14.62	14.7	25.2	29.24	27.34
	256k	24.634491	14.56	16.83	17.17	25.54	30.03	27.66
	512k	30.244204	20.4	23.1	23.44	28.75	30.59	28.72
	1M	42.48489	32.57	35.52	33.94	34.88	35.6	33.14
	2M	67.120636	56.82	60.02	58.87	47.63	45.91	38.48
	4M	111.62647	106.2	107.46	103.13	75.76	65.8	52.93
	8M	202.81882	202.2	205.25	194.85	132.02	122.05	89.92
	16M	387.97064	395.36	374.85	369.46	249.94	207.14	158.16
	32M	758.74219	786.6	729.02	728.41	453.03	368.07	296.14
	64M	1501.2136	1539.97	1442.51	1448.86	838.51	675.8	529
	128M	2986.8469	3068.5	2875.23	2885.5	1644.49	1335.28	1021.9
	256M	5838.4868	6179.18	5748.16	5773.99	3280.35	2675.8	2061.21
	512M	11543.985	12234.53	11510.27	11567.95	6555.42	5346.56	4112.88

lihaoyang-amd · 2025-06-03T15:39:25Z

TP4

kernel	float16		vllm	qr0	qr1	qr2	qr4	qr5
time(us)	size	rccl	custom allreduce	1stage	2stage f16	2stage f8	2stage Q6	2stage Q4
tp4	4k	16.319765	11.07	19.6	18.94	29.35	33.3	31.09
	8k	18.678837	11.88	19.12	18.36	28.81	33.17	30.89
	16k	18.726336	12.84	19.33	17.44	29.15	34.34	30.97
	32k	19.181345	14.92	20.16	19.44	29.31	33.59	32.4
	64k	19.441343	15.6	21.4	18.71	33.71	33.63	32.93
	128k	34.290798	15.43	23.25	19.47	29.82	39.42	32.85
	256k	35.90361	19.03	32.88	19.35	30.23	35.42	32.31
	512k	39.845196	19.94	45.95	22.98	32.45	36.05	32.43
	1M	47.076485	26.62	74.5	28.71	37.6	39.61	34.86
	2M	59.289352	40.67	136.09	49.46	48.79	48.66	43.96
	4M	85.137909	69.28	209.41	67.77	71.51	61.9	52.07
	8M	133.56596	127.06	345.67	128.11	128.44	104.89	82.1
	16M	221.89734	234.79	619.63	220.47	200.7	190.04	138.04
	32M	412.81012	457.17	1034.53	410.26	341.27	307.03	243.56
	64M	791.31445	897.27	1969.94	810.27	603.46	531.76	458.29
	128M	1552.8333	1791.85	3764.65	1586.13	1140.76	1021.68	916.26
	256M	3042.7207	3549.68	7175.55	3129.78	2253.05	2060.62	1880.88
	512M	6036.2041	7152.43	err	6249.4	4469.68	4088.05	3788.14

lihaoyang-amd · 2025-06-03T15:39:36Z

float16		vllm	qr0	qr1	qr2	qr4	qr5	qr5
size	rccl	custom allreduce	1stage	2stage f16	2stage f8	2stage Q6	2stage Q4	Ilya Q4
4k	16.68663	19.01	28.78	22.92	41.14	39.56	37.62	18.87
8k	16.7382	13.28	37.23	21.87	35.46	40.89	36.63	18.47
16k	16.76633	14.19	29.23	24.34	34.44	43.04	35.31	14.18
32k	16.80445	22.02	30.89	24.91	38.39	39.91	38.71	14.53
64k	17.00008	15.31	30.01	25.01	36.04	39.63	36.85	16.72
128k	23.0151	19.99	37.14	25.2	36.9	45.83	39.08	36
256k	22.69166	23.89	54.06	32.74	37.91	44.62	41.62	36.7
512k	31.71264	28.23	73.72	24.58	40.18	48.38	38.48	38.68
1M	33.88234	34.36	128.95	27.44	42.3	42.67	38.93	40.9
2M	42.11831	41.5	235.02	38.02	44.3	46.93	42.31	41.06
4M	63.82403	62.35	330.7	53.81	60.75	65.01	53.3	51.05
8M	100.362	99.06	619.54	100.86	98.22	86.86	73.86	72.480003
16M	170.5773	166.98	972.1	173.93	159.44	166.03	120.28	112.48
32M	326.7224	311.75	1494.18	312.14	354.84	274.57	221.92	211.44
64M	509.2664	594.36	2742.64	604.88	522.4	495.92	442.2	438.98
128M	872.1246	1178.95	4983.3052	1176.42	1014.82	1059.4	916.15	930.36
256M	1607.178	2344.33	err	2359.55	2068.29	2026.74	1906.32	1917.07
512M	3093.578	4641.04	err	4728.71	4138.9	4070.6	3854	3875.09

lihaoyang-amd · 2025-06-05T10:18:15Z

@youkaichao
Hi, as per your suggestion, I have completed the merge of qr and cr, could you please give me some comments, thank you so much.

tjtanaa · 2025-06-05T13:38:40Z

vllm/distributed/device_communicators/custom_all_reduce.py

For the ease of user, should we try to make to be some special value (None, -1 or 0) which means user didn't explicitly specify the value?

When world size =8 (which is tensor-parallel-size =8, I presume) and user didn't specify the value explicity, VLLM_QUICK_ALLREDUCE_LEVEL is set to value of 5 as your finding shows that when world_size=8, VLLM_QUICK_ALLREDUCE_LEVEL=5 is the best config.

tjtanaa · 2025-06-05T13:39:43Z

Does this feature directly overwrites the old custom all reduce? Or there is a way to fallback to old custom all reduce?

lihaoyang-amd · 2025-06-05T13:43:09Z

Does this feature directly overwrites the old custom all reduce? Or there is a way to fallback to old custom all reduce?

The qr is some complement to the cr, we will try the qr first and if we find that this is not a scenario where the qr excels, we will go back to the cr

mergify · 2025-06-05T18:22:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

mergify · 2025-06-19T04:55:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lihaoyang-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2025-06-26T18:58:57Z

closing in favor of #19744

mergify bot added the ci/build label Jun 3, 2025

lihaoyang-amd mentioned this pull request Jun 3, 2025

Integrate quick allreduce and select the best allreduce implementation #18473

Closed

lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from 4e2a9ee to d0e1358 Compare June 5, 2025 09:15

lihaoyang-amd marked this pull request as ready for review June 5, 2025 09:24

lihaoyang-amd requested a review from tlrmchlsmth as a code owner June 5, 2025 09:24

lihaoyang-amd marked this pull request as draft June 5, 2025 09:25

lihaoyang-amd marked this pull request as ready for review June 5, 2025 11:50

tjtanaa reviewed Jun 5, 2025

View reviewed changes

lihaoyang-amd mentioned this pull request Jun 5, 2025

[Feature] Integrate quick allreduce and select the best allreduce implementation sgl-project/sglang#6619

Merged

mergify bot added the needs-rebase label Jun 5, 2025

lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from a333b8a to 024f2fc Compare June 11, 2025 10:55

mergify bot removed the needs-rebase label Jun 11, 2025

lihaoyang-amd added 9 commits June 13, 2025 09:14

merge qr and cr

d902224

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

Initial completion of cr+qr

710fc2c

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

add condition for qr

a426f84

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

finish merger

3c0f441

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

add condition

6f7a4c6

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

add condition

f1741c4

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

add bf2float

6077932

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

add option for bf2half

9935961

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

for fmt

7470fc3

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from cbd05c1 to d80ccd0 Compare June 13, 2025 09:23

fix condition for mi300

517dc5d

Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>

lihaoyang-amd force-pushed the amd/integrate_qr_into_cr branch from d80ccd0 to 517dc5d Compare June 13, 2025 09:26

mergify bot added the needs-rebase label Jun 19, 2025

tlrmchlsmth closed this Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094

[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094

Uh oh!

lihaoyang-amd commented Jun 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

lihaoyang-amd commented Jun 3, 2025

Uh oh!

lihaoyang-amd commented Jun 3, 2025

Uh oh!

lihaoyang-amd commented Jun 3, 2025 •

edited

Loading

Uh oh!

lihaoyang-amd commented Jun 5, 2025

Uh oh!

tjtanaa Jun 5, 2025

Uh oh!

tjtanaa commented Jun 5, 2025

Uh oh!

lihaoyang-amd commented Jun 5, 2025

Uh oh!

mergify bot commented Jun 5, 2025

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

tlrmchlsmth commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094

[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation #19094

Uh oh!

Conversation

lihaoyang-amd commented Jun 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

lihaoyang-amd commented Jun 3, 2025

Uh oh!

lihaoyang-amd commented Jun 3, 2025

Uh oh!

lihaoyang-amd commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyang-amd commented Jun 5, 2025

Uh oh!

tjtanaa Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Jun 5, 2025

Uh oh!

lihaoyang-amd commented Jun 5, 2025

Uh oh!

mergify bot commented Jun 5, 2025

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

tlrmchlsmth commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lihaoyang-amd commented Jun 3, 2025 •

edited by github-actions bot

Loading

lihaoyang-amd commented Jun 3, 2025 •

edited

Loading