[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

bnellnm · 2024-10-31T18:59:49Z

Add an inductor pass to rewrite and fuse collective communication ops with gemms

See #9883 for version that includes llama hacks.

TODO:

find workaround for Infinite recursion in torch._inductor.ir.ExternKernel.__str__ pytorch/pytorch#139501
does not work with graph splitting because first/last subgraphs need special treatment for residual splitting. this could be worked around if we decide to not split the residual and compute on junk data.
try to support non-custom rms norm (nice to have)
does not work with new inductor caching mechanism
fix deadlock in benchmark serving mode.

cc @tlrmchlsmth , @ProExpertProg , @SageMoore , @youkaichao

Requires a special config to run:

config = CompilationConfig(
    level=3,
    custom_ops = ["+rms_norm"],
    splitting_ops = [],
)
config.pass_config.enable_collective_fusion = True

llm = LLM(model=model,
          enforce_eager=eager,
          tensor_parallel_size=tp_size,
          disable_custom_all_reduce=not custom_ar,
          dtype=torch.float16,
          max_num_batched_tokens=2048,
          compilation_config=config)

Some benchmark results:

model = meta-llama/Llama-3.1-70B-Instruct
tp_size = 4
chunked prefill size = 2048
batch_size = 1
input_len=2048
output_len=1

Eager mode + torch.compile

Avg latency: 0.16625802051508798 seconds
10% percentile latency: 0.16468927392270416 seconds
25% percentile latency: 0.16511811560485512 seconds
50% percentile latency: 0.16571794101037085 seconds
75% percentile latency: 0.16671031567966565 seconds
90% percentile latency: 0.1675790420267731 seconds
99% percentile latency: 0.17226817809045325 seconds

Eager mode + torch.compile + flux

Avg latency: 0.1583265809295699 seconds
10% percentile latency: 0.15630255101714283 seconds
25% percentile latency: 0.15688058221712708 seconds
50% percentile latency: 0.15789097198285162 seconds
75% percentile latency: 0.15932484721997753 seconds
90% percentile latency: 0.16147575441282241 seconds
99% percentile latency: 0.16223905643215403 seconds

cudagraphs + torch.compile

Avg latency: 0.17894838895183057 seconds
10% percentile latency: 0.17591054290533065 seconds
25% percentile latency: 0.176349236513488 seconds
50% percentile latency: 0.17722250788938254 seconds
75% percentile latency: 0.17862555047031492 seconds
90% percentile latency: 0.18074012212455273 seconds
99% percentile latency: 0.2171030258946121 seconds

cudagraphs + torch.compile + flux

Avg latency: 0.17262270329520107 seconds
10% percentile latency: 0.17164990142919123 seconds
25% percentile latency: 0.17196793673792854 seconds
50% percentile latency: 0.1724927049363032 seconds
75% percentile latency: 0.1730666920193471 seconds
90% percentile latency: 0.17406681017018855 seconds
99% percentile latency: 0.1758251654729247 seconds

github-actions · 2024-10-31T19:00:05Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-10-31T19:00:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. @bnellnm please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth

looking forward to this one!

vllm/compilation/collective_fusion.py

mergify · 2024-11-11T23:07:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-25T19:00:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-26T05:43:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-26T06:02:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/compilation/utils.py

vllm/compilation/collective_fusion.py

ProExpertProg · 2024-12-19T19:37:49Z

vllm/compilation/collective_fusion.py

+    device_group = group.device_group
+    rank = group.rank_in_group
+
+    if use_flux:


Could we maybe use a better abstraction than if statements based on use_flux?

vllm/compilation/collective_fusion.py

ProExpertProg · 2024-12-19T20:51:27Z

vllm/compilation/collective_fusion.py

+                fused_node = graph.call_function(fused_gemm_func,
+                                                 kwargs=kwargs)
+
+                graph.inserting_after(fused_node)
+                result_node_new = graph.call_function(operator.getitem,
+                                                      (fused_node, 0))
+                residual_node_new = graph.call_function(
+                    operator.getitem, (fused_node, 1))
+                my_residual_node_new = graph.call_function(
+                    operator.getitem, (fused_node, 2))


I think multi-output match has a utility that emits a function and tuple accessors.

ProExpertProg · 2024-12-19T20:52:40Z

vllm/compilation/collective_fusion.py

+                res_replacements.append(residual_node_new)
+                my_res_replacements.append(my_residual_node_new)


Any reason we save all of the residuals instead of just the previous one?

vllm/compilation/utils.py

vllm/config.py

ProExpertProg · 2024-12-19T21:03:33Z

vllm/compilation/collective_fusion.py

+                if gemm_1 is None or gemm_2 is None:
+                    raise ValueError("Missing 'val' in gemm weights meta data")


Wouldn't it be simpler if you just do meta["val"]

Signed-off-by: Bill Nell <bill@neuralmagic.com>

github-actions · 2025-04-18T02:06:55Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-05-18T02:14:00Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

mergify bot added the needs-rebase label Oct 31, 2024

bnellnm force-pushed the collective-fusion branch from b3200f8 to 5183999 Compare November 1, 2024 22:09

tlrmchlsmth reviewed Nov 4, 2024

View reviewed changes

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

vllm/compilation/collective_fusion.py Outdated Show resolved Hide resolved

bnellnm force-pushed the collective-fusion branch 2 times, most recently from 0a1f637 to 1c9d79c Compare November 8, 2024 23:36

mergify bot removed the needs-rebase label Nov 8, 2024

bnellnm force-pushed the collective-fusion branch from e164973 to 1683f80 Compare November 9, 2024 23:04

bnellnm marked this pull request as ready for review November 9, 2024 23:10

mergify bot added the needs-rebase label Nov 11, 2024

bnellnm force-pushed the collective-fusion branch from 1683f80 to 34de3a4 Compare November 25, 2024 16:46

mergify bot added frontend and removed needs-rebase labels Nov 25, 2024

mergify bot added the needs-rebase label Nov 25, 2024

bnellnm force-pushed the collective-fusion branch from ef2be0d to 7ebd94c Compare November 25, 2024 19:02

mergify bot removed the needs-rebase label Nov 25, 2024

mergify bot added needs-rebase and removed needs-rebase labels Nov 26, 2024

mergify bot added the needs-rebase label Nov 26, 2024

bnellnm force-pushed the collective-fusion branch from d713a7d to 7e2c490 Compare November 26, 2024 23:05

mergify bot removed the needs-rebase label Nov 26, 2024

tlrmchlsmth mentioned this pull request Nov 27, 2024

[Kernel] Prototype integration of bytedance/flux kernels #5917

Closed

bnellnm mentioned this pull request Dec 13, 2024

[RFC]: A Graph Optimization System in vLLM using torch.compile #6378

Closed

tlrmchlsmth mentioned this pull request Dec 13, 2024

[torch.compile] fast inductor #11108

Merged

ProExpertProg reviewed Dec 19, 2024

View reviewed changes

bnellnm added 21 commits January 7, 2025 17:19

fix mypy

729ed33

Signed-off-by: Bill Nell <bill@neuralmagic.com>

fix yapf

39b8769

Signed-off-by: Bill Nell <bill@neuralmagic.com>

disable collective fusion if TP is not on

a4ad545

Signed-off-by: Bill Nell <bill@neuralmagic.com>

remove cruft

acb4b80

Signed-off-by: Bill Nell <bill@neuralmagic.com>

disable collective fusion pass if TP is not enabled

4dfde96

Signed-off-by: Bill Nell <bill@neuralmagic.com>

wip

827d847

Signed-off-by: Bill Nell <bill@neuralmagic.com>

rebase + simplify

d8df758

Signed-off-by: Bill Nell <bill@neuralmagic.com>

rebase + simplify

cbb8434

Signed-off-by: Bill Nell <bill@neuralmagic.com>

cleanup

51e608f

Signed-off-by: Bill Nell <bill@neuralmagic.com>

add extra arg to all graphs

521ae93

Signed-off-by: Bill Nell <bill@neuralmagic.com>

add extra arg

56bc58d

Signed-off-by: Bill Nell <bill@neuralmagic.com>

hacking

635e615

Signed-off-by: Bill Nell <bill@neuralmagic.com>

don't over generate flux kernels

b7bf3c9

Signed-off-by: Bill Nell <bill@neuralmagic.com>

fix warmup loop

d406829

Signed-off-by: Bill Nell <bill@neuralmagic.com>

remove find_max_m, working again

4fb4ae3

Signed-off-by: Bill Nell <bill@neuralmagic.com>

trying to reduce runtime overhead

1495905

Signed-off-by: Bill Nell <bill@neuralmagic.com>

enable

3550e95

Signed-off-by: Bill Nell <bill@neuralmagic.com>

config

0004438

Signed-off-by: Bill Nell <bill@neuralmagic.com>

cleanups

7f3f4c9

cleanups

f0059cb

cleanups

590b3d2

bnellnm force-pushed the collective-fusion branch from 7e2c490 to 590b3d2 Compare January 17, 2025 15:44

more cleanups

9fd79bf

youkaichao mentioned this pull request Jan 21, 2025

[torch.compile] decouple compile sizes and cudagraph sizes #12243

Merged

hmellor mentioned this pull request Jan 30, 2025

[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms (WIP not for review) #9883

Closed

tlrmchlsmth mentioned this pull request Apr 1, 2025

[Feature] Support sequence parallelism #14908

Closed

github-actions bot added the stale Over 90 days of inactivity label Apr 18, 2025

github-actions bot closed this May 18, 2025

		res_replacements.append(residual_node_new)
		my_res_replacements.append(my_residual_node_new)

		if gemm_1 is None or gemm_2 is None:
		raise ValueError("Missing 'val' in gemm weights meta data")

Uh oh!

[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

[Kernels] Add an inductor pass to rewrite and fuse collective communication ops with gemms #9886

Uh oh!

Conversation

bnellnm commented Oct 31, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 31, 2024

Uh oh!

mergify bot commented Oct 31, 2024

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 11, 2024

Uh oh!

mergify bot commented Nov 25, 2024

Uh oh!

mergify bot commented Nov 26, 2024

Uh oh!

mergify bot commented Nov 26, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg Dec 19, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 18, 2025

Uh oh!

github-actions bot commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bnellnm commented Oct 31, 2024 •

edited by github-actions bot

Loading