[compile] Enable sequence parallelism matching w/o custom ops enabled #27126

angelayi · 2025-10-17T21:14:19Z

Purpose

Based on #24604, modified sequence-parallelism pass to do custom op matching w/o needing to enable the custom op

Test Plan

pytest -sv tests/compile/test_sequence_parallelism.py

Performance numbers

I did some benchmarking with the command on H100 w/o flashinfer

VLLM_DISABLE_COMPILE_CACHE=1 VLLM_USE_STANDALONE_COMPILE=1 VLLM_LOGGING_LEVEL=DEBUG vllm bench latency --model=nvidia/Llama-3.3-70B-Instruct-FP8 --output-len 1 --input-len 8192 --batch-size 1 --tensor-parallel-size 8 --load-format dummy --num_iters_warmup 5 --num_iters 15 -O '{"level": 3, "use_inductor_graph_partition": false, "splitting_ops":[], "cudagraph_mode": "FULL", }' --no-enable-prefix-caching

while varying

"pass_config": {"enable_async_tp": true, "enable_sequence_parallelism": true} vs. "pass_config": {"enable_async_tp": false, "enable_sequence_parallelism": false}
"custom_ops":["+quant_fp8", "+rms_norm"] vs. "custom_ops":[]

Signed-off-by: angelayi <yiangela7@gmail.com>

ProExpertProg

Thanks for taking this on! Could you just add me as a co-author on one of the commits?

tests/compile/test_sequence_parallelism.py

ProExpertProg · 2025-10-18T23:06:57Z

vllm/compilation/sequence_parallelism.py

-    """Base helper for RMSNorm and RMSNorm + Quantization functionalization."""
+def get_first_out_wrapper(fn):
+    @functools.wraps(fn)
+    def wrapper(*args):


Does this work? I thought that during tracing the pattern matching tracer will think that args is a single parameter

yes! updated the test to assert the number of all_reduce/all_gather ops in the graph!

Signed-off-by: angelayi <yiangela7@gmail.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg

@cascade812 could you take a look at this please?

ProExpertProg · 2025-10-20T22:18:32Z

Also @angelayi just noticed there's no e2e tests - could you make the existing E2E tests use no custom ops by default (tests/distributed/test_sequence_parallelism.py or something like that) as well as add tests to test_fusions_e2e.py (feel free to grab from #27062)

cascade812 · 2025-10-20T22:42:55Z

@cascade812 could you take a look at this please?

Sure!

Co-authored-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: angelayi <yiangela7@gmail.com>

cascade812 · 2025-10-26T21:38:20Z

@angelayi I have below error if not specify custom_ops=["+rms_norm"]

torch._inductor.exc.InductorError: RuntimeError: The size of tensor a (s72) must match the size of tensor b ((s72//2)) at non-singleton dimension 0)

cascade812 · 2025-10-26T21:49:07Z

@angelayi It seems odd to me that enabling AsyncTP results in higher latency for Llama-70B. From our earlier benchmark, we observed about a 10% reduce in average latency for prefill stage with AsyncTP enabled for the same model on 4XH200.

ProExpertProg

We no longer have to skip the FP4 tests!

ProExpertProg · 2025-10-28T23:53:12Z

tests/compile/test_fusions_e2e.py

+    if inductor_graph_partition and "fp4" in model_name.lower():
+        pytest.skip(
+            "Known bug for fp4 fusion & inductor partition: "
+            "https://github.com/vllm-project/vllm/issues/26988"
+        )


#26988 was fixed

Suggested change

if inductor_graph_partition and "fp4" in model_name.lower():

pytest.skip(

"Known bug for fp4 fusion & inductor partition: "

"https://github.com/vllm-project/vllm/issues/26988"

)

ProExpertProg · 2025-10-28T23:53:28Z

tests/compile/test_fusions_e2e.py

+    if inductor_graph_partition and "fp4" in model_name.lower():
+        pytest.skip(
+            "Known bug for fp4 fusion & inductor partition: "
+            "https://github.com/vllm-project/vllm/issues/26988"
+        )


Suggested change

if inductor_graph_partition and "fp4" in model_name.lower():

pytest.skip(

"Known bug for fp4 fusion & inductor partition: "

"https://github.com/vllm-project/vllm/issues/26988"

)

ProExpertProg · 2025-10-28T23:54:08Z

tests/compile/test_fusions_e2e.py

+    if inductor_graph_partition and "fp4" in model_name.lower():
+        pytest.skip(
+            "Known bug for fp4 fusion & inductor partition: "
+            "https://github.com/vllm-project/vllm/issues/26988"
+        )


Suggested change

if inductor_graph_partition and "fp4" in model_name.lower():

pytest.skip(

"Known bug for fp4 fusion & inductor partition: "

"https://github.com/vllm-project/vllm/issues/26988"

)

[compile] Enable sequence parallelism matching w/o custom ops enabled

0511091

Signed-off-by: angelayi <yiangela7@gmail.com>

angelayi force-pushed the sp_custom_op branch from c1efc65 to ed10d76 Compare October 17, 2025 21:15

angelayi marked this pull request as ready for review October 18, 2025 00:33

angelayi requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners October 18, 2025 00:33

ProExpertProg reviewed Oct 18, 2025

View reviewed changes

ProExpertProg added this to the vllm==v0.12.0/torch==2.9.0 compilation improvements milestone Oct 19, 2025

ProExpertProg added the torch.compile label Oct 19, 2025

github-project-automation bot added this to torch.compile integration Oct 19, 2025

github-project-automation bot moved this to To triage in torch.compile integration Oct 19, 2025

[compile] Fix rmsnorm

5d66118

Signed-off-by: angelayi <yiangela7@gmail.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com>

angelayi force-pushed the sp_custom_op branch from ed10d76 to 5d66118 Compare October 19, 2025 17:58

angelayi requested a review from ProExpertProg October 20, 2025 17:15

ProExpertProg approved these changes Oct 20, 2025

View reviewed changes

Add e2e tests

0d60ff5

Co-authored-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: angelayi <yiangela7@gmail.com>

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 22, 2025

Merge branch 'main' into sp_custom_op

9941e32

ProExpertProg enabled auto-merge (squash) October 23, 2025 06:26

ProExpertProg approved these changes Oct 28, 2025

View reviewed changes

ProExpertProg disabled auto-merge October 29, 2025 00:05

This was referenced Oct 29, 2025

[Bug]: Inductor fails to fuse pointwise ops with sequence parallelism + async TP #27699

Open

[Feature]: Enabling performance optimizations by default #25689

Open

Uh oh!

Uh oh!

[compile] Enable sequence parallelism matching w/o custom ops enabled #27126

Are you sure you want to change the base?

[compile] Enable sequence parallelism matching w/o custom ops enabled #27126

Conversation

angelayi commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Performance numbers

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

angelayi Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Oct 20, 2025

Uh oh!

cascade812 commented Oct 20, 2025

Uh oh!

cascade812 commented Oct 26, 2025

Uh oh!

cascade812 commented Oct 26, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

angelayi commented Oct 17, 2025 •

edited by github-actions bot

Loading