Skip to content

Conversation

@ProExpertProg
Copy link
Collaborator

@ProExpertProg ProExpertProg commented Sep 10, 2025

Purpose

This PR enables matching the torch implementations of custom ops QuantFP8 and RMSNorm. On main, fusion currently requires enabling custom ops, but they are slower than their torch counterparts, so the benefit of custom fusion passes is reduced.

We add a bunch of "matcher util" objects which can be called in patterns and get traced to the same fx nodes as the custom op they correspond to in both enabled and disabled form automatically.

This PR also adds additional debugging utilities and adds E2E fusion tests to verify fusions happen in models end-to-end instead of just in unit tests.

Test Plan

Unit tests, added more fusion E2E tests.

Test Result

Tests all pass

Performance numbers

Below are B200 numbers (with flashinfer) from vllm bench serve on the following serve command:

vllm serve redhatai/meta-llama-3.1-70B-Instruct-FP8 --no-enable-prefix-caching --load-format dummy --kv-cache-dtype=fp8 -O.splitting_ops=[] -O.cudagraph_mode=FULL_DECODE_ONLY

We test the following regimes with corresponding additional arguments:

  1. none: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":false,"enable_noop":true}
  2. none_fusion_attention: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":true,"enable_noop":true}
  3. none_fusion_attention_allreduce: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true}
  4. rms_quant: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":false,"enable_noop":true}
  5. rms_quant_fusion_attention: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":true,"enable_noop":true}
  6. rms_quant_fusion_attention_allreduce: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true}

2 (none_fusion_attention) and 3 (none_fusion_attention_allreduce) are newly possible with this PR. On main, results are similar except those two are worse as fusion cannot happen without custom ops enabled.

redhatai/meta-llama-3.1-70B-Instruct-FP8 (TP=1):

Past QPS=10 the server is overloaded so the latency spikes and becomes much more variable. Also note that allreduce fusion is a noop for tp=1.

📊 TTFT Median (ms)

Source 1.0 5.0 10.0 15.0 20.0 inf
none 80.24 85.20 148.05 3740.51 27131.90 29992.91
none_fusion_attention_allreduce 78.42 83.89 146.75 1739.83 24117.31 28886.46
rms_quant 80.38 87.31 162.54 5298.40 26009.12 35230.81
rms_quant_fusion_attention_allreduce 79.50 85.38 149.76 3904.91 24737.15 34760.88

📊 TPOT Median (ms)

Source 1.0 5.0 10.0 15.0 20.0 inf
none 16.86 22.06 41.13 287.30 293.71 285.44
none_fusion_attention_allreduce 16.61 21.86 39.95 285.85 284.93 275.04
rms_quant 17.17 23.12 44.32 234.08 232.37 227.34
rms_quant_fusion_attention_allreduce 16.97 22.45 42.21 228.79 228.76 225.60

📊 ITL Median (ms)

Source 1.0 5.0 10.0 15.0 20.0 inf
none 16.11 16.49 21.31 207.46 180.96 64.85
none_fusion_attention_allreduce 15.87 16.28 21.07 223.46 191.32 64.14
rms_quant 16.35 17.02 22.16 103.97 50.63 47.99
rms_quant_fusion_attention_allreduce 16.20 16.71 21.46 143.86 48.40 47.59
serving_metrics_llama70B_tp1

redhatai/meta-llama-3.1-70B-Instruct-FP8 (TP=4):

Note that allreduce fusion reduces TPOT at low QP but increases it at high QPS and increases TTFT across the board, this will be addressed in #24248 and #24252.

📊 TTFT Median (ms)

Source 1.0 5.0 10.0 15.0 20.0 inf
none 72.01 76.32 89.95 112.80 136.13 12852.81
none_fusion_attention 71.10 75.38 93.56 116.45 133.33 12662.96
none_fusion_attention_allreduce 74.16 77.85 94.50 123.77 139.41 21980.02
rms_quant 74.95 78.65 90.93 115.75 162.43 12620.21
rms_quant_fusion_attention 73.77 76.84 94.10 124.59 146.62 20634.74
rms_quant_fusion_attention_allreduce 75.12 77.69 86.11 112.16 199.10 13078.33

📊 TPOT Median (ms)

Source 1.0 5.0 10.0 15.0 20.0 inf
none 9.66 12.72 20.68 31.90 46.11 257.66
none_fusion_attention 9.39 12.44 20.69 29.16 43.08 142.36
none_fusion_attention_allreduce 8.28 11.16 20.84 34.21 47.54 220.19
rms_quant 9.92 13.13 21.95 32.97 62.92 146.71
rms_quant_fusion_attention 9.60 12.78 20.58 34.18 50.37 146.19
rms_quant_fusion_attention_allreduce 8.43 11.39 20.38 31.10 78.54 145.60

📊 ITL Median (ms)

Source 1.0 5.0 10.0 15.0 20.0 inf
none 9.23 9.43 11.45 15.04 63.60 194.04
none_fusion_attention 8.97 9.16 11.29 20.99 62.27 188.26
none_fusion_attention_allreduce 7.85 8.05 11.51 23.63 65.02 191.91
rms_quant 9.47 9.71 11.75 16.26 66.33 188.91
rms_quant_fusion_attention 9.18 9.45 11.51 24.76 68.46 187.65
rms_quant_fusion_attention_allreduce 8.01 8.28 11.60 14.71 73.63 194.55
serving_metrics_llama70B_tp4_all

@ProExpertProg ProExpertProg mentioned this pull request Sep 10, 2025
4 tasks
@mergify
Copy link

mergify bot commented Sep 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 10, 2025
@ProExpertProg ProExpertProg force-pushed the luka/custom-op-matching-2 branch from b374514 to 4a44829 Compare September 11, 2025 04:42
@mergify mergify bot removed the needs-rebase label Sep 11, 2025
@mergify
Copy link

mergify bot commented Sep 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 12, 2025
@ProExpertProg ProExpertProg force-pushed the luka/custom-op-matching-2 branch 2 times, most recently from 42f2231 to a8c9181 Compare September 12, 2025 19:20
@mergify mergify bot removed the needs-rebase label Sep 12, 2025
@mgoin mgoin self-assigned this Sep 15, 2025
@mergify
Copy link

mergify bot commented Sep 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 16, 2025
@ProExpertProg ProExpertProg force-pushed the luka/custom-op-matching-2 branch 6 times, most recently from 1e9326c to e3d0c83 Compare September 20, 2025 02:04
@mergify mergify bot removed the needs-rebase label Sep 20, 2025
@ProExpertProg ProExpertProg force-pushed the luka/custom-op-matching-2 branch from e3d0c83 to 9151d01 Compare September 20, 2025 13:49
simon-mo pushed a commit that referenced this pull request Sep 22, 2025
…g utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) (#24542)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: luka <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
@mergify
Copy link

mergify bot commented Sep 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 22, 2025
@ProExpertProg ProExpertProg force-pushed the luka/custom-op-matching-2 branch from 9151d01 to da3cb54 Compare September 23, 2025 14:24
@mergify mergify bot removed the needs-rebase label Sep 23, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: luka <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Copy link
Collaborator

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current understanding is that when we pattern match against the torch native implementation of a custom operator, we register a pattern in Inductor using that native implementation. I'm worried that this approach might be fragile. When the torch native implementation is passed through torch.compile, various graph passes can transform it, so by the time it reaches the post-grad phase (where vLLM’s pattern matching currently happens), the structure may look different.

For example, with rms_norm, it seems we’d need to modify the implementation in a non-trivial way to make it pattern match. I don't know if this is an issue in practice, but it suggests that this scheme could unintentionally constrain how custom operators need to be authored — in ways we might not fully understand yet.

It might be more robust to preserve the custom operator as-is (i.e., avoid decomposing it into torch native ops) and then perform pattern matching directly on the custom operator itself. That would make the process less sensitive to internal graph transformations.

I did see that you wanted this in for the release. Was there a specific reason? If we are turning on the allreduce+rmsnorm fusion by default, for example, then could the fusion instead imply "+rmsnorm"?

@mergify
Copy link

mergify bot commented Oct 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@ProExpertProg
Copy link
Collaborator Author

The reason this is needed is it lets us do fusion without having to enable custom ops (-O.custom_ops=["+quant_fp8"]). Enabling custom ops leads to lost performance, as demonstrated in the PR description. That's because there are 4 quant ops per layer, one per matmul, and +quant_fp8 leads to lost performance for the 3 remaining ones that did not get fused onto attention, for example. In the example of SequenceParallelismPass, we're not even fusing, just moving the ops around. In AllReduceRMSNormFusion, we're going to be only fusing for some shapes, so we're again left with an inefficient implementation for others.

I agree this is a somewhat fragile approach. I would be happy to work on a "lowering" approach where we preserve the high-level structure of ops until later. The downside would be that it would require more work (I think), and we might lose access to optimizations that currently happen before our passes . But I think it wouldn't hurt Inductor in general to have a more explicit sense of converting between higher-level and lower-level representations (or we just move where our custom passes happen). We can tie this work into the "autotuning custom op implementations" like done in pytorch/pytorch#164212.

@ProExpertProg
Copy link
Collaborator Author

As discussed offline, we are going to proceed by merging this PR. After PTC, we will move our custom op matching passes to post_grad_custom_pre_pass to reduce the risk of inductor transformations breaking custom op matching. Some of our matching depends on removing noop views and slices, so we'll have to figure out a solution for that.

@BoyuanFeng
Copy link
Contributor

view/slice noop eliminations were upstreamed to PyTorch so I'm wondering if this is sufficient pytorch/pytorch#151095 pytorch/pytorch#151175

…hing-2

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg
Copy link
Collaborator Author

@BoyuanFeng wouldn't that run after post_grad_custom_pre_pass though?

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@mergify mergify bot removed the needs-rebase label Oct 17, 2025
@mgoin mgoin merged commit bd7157a into vllm-project:main Oct 17, 2025
87 checks passed
@ProExpertProg ProExpertProg deleted the luka/custom-op-matching-2 branch October 17, 2025 14:41
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: luka <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ops enabled (vllm-project#24604)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
…ops enabled (vllm-project#24604)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ops enabled (vllm-project#24604)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: luka <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ops enabled (vllm-project#24604)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ops enabled (vllm-project#24604)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ops enabled (vllm-project#24604)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants