[DO NOT MERGE] 2.9, Inductor partition, standalone compile, monkeypatch fix(es) #26738

ProExpertProg · 2025-10-13T21:43:39Z

In-progress PR to test inductor partitioning in CI.

Includes:

turning on inductor partitioning by default

Past fixes in this PR now in main:

[BugFix] Work around graph partition x torch.compile cache issue #26956 AOT caching issue fix
[BugFix] Patch inductor partitioning logic #26735 monkeypatch
[BugFix] Patch inductor memory plan logic #26878 monkeypatch for memory plan output naming
[Graph Partition] pass tests for decorator #26831 decorator tests
[torch.compile] Fix tests for torch==2.9 inductor partition #26116 piecewise test fixes
disable graph partition in custom op #26952 fix for nested compilation (to replace fused_moe::grouped_topk fix)
[torch.compile] fix simple inductor graph partition test #27050 fix simple inductor partition test
[compile] Turn standalone_compile back on #27460 reenabling standalone compile
Update PyTorch to 2.9.0+cu129 #24994 torch==2.9
[BugFix] fix graph partition signature #27139 fix inductor partition with attention+quant_fp4 fusion

Past fixes no longer necessary:

disabling nested compilation on fused_moe::grouped_topk - superseeded by disable graph partition in custom op #26952
forcing spawn (both tests and vllm MP) to circumvent cuda init issue
- Likely a temporary transformers issue, fixed
- see [DO NOT MERGE] 2.9, Inductor partition, standalone compile, monkeypatch fix(es), spawn #26882 with spawn.

mergify · 2025-10-13T21:44:16Z

Documentation preview: https://vllm--26738.org.readthedocs.build/en/26738/

gemini-code-assist

Code Review

This pull request upgrades PyTorch to version 2.9 and introduces tests for Inductor graph partitioning. The changes span across CI configuration, Dockerfiles, build scripts, and test files to support the new PyTorch version and its features. My review focuses on the correctness and maintainability of these changes. I've identified a couple of high-severity issues related to a monkey patch with a misleading comment and an undocumented change in default behavior for a compilation flag. These should be addressed to ensure code clarity and prevent unexpected behavior for users.

gemini-code-assist · 2025-10-13T21:45:18Z

vllm/env_override.py

+    # Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches
+    # [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4724)
+    # so that we always return True.


The comment here is misleading. It states that the patch makes should_partition 'always return True', but the patched function does not always return True. It only changes the behavior for the case where torch._inductor.config.triton.cudagraphs is False. The original implementation returns False in this case, preventing partitioning, while the patch returns True to enable it.

This monkey patch on a PyTorch internal function is a significant change and should be documented with precision to ensure future maintainability. Please update the comment to accurately describe what the patch does and why it's necessary. For example:

# Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches # [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4715) # to force partitioning even when CUDA graphs are disabled. The original # implementation returns False in this case, which prevents partitioning. # This change is necessary to enable Inductor graph partitioning for vLLM's # piecewise CUDAGraph mode, which may operate without `config.triton.cudagraphs` # being globally enabled.

# Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches # [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4715) # to force partitioning even when CUDA graphs are disabled. The original # implementation returns False in this case, which prevents partitioning. # This change is necessary to enable Inductor graph partitioning for vLLM's # piecewise CUDAGraph mode, which may operate without `config.triton.cudagraphs` # being globally enabled.

gemini-code-assist · 2025-10-13T21:45:18Z

vllm/envs.py

    VLLM_DP_RANK_LOCAL: int = -1
    VLLM_DP_SIZE: int = 1
-    VLLM_USE_STANDALONE_COMPILE: bool = False
+    VLLM_USE_STANDALONE_COMPILE: bool = True


This change modifies the default behavior of VLLM_USE_STANDALONE_COMPILE from False to True. This is a significant change as it alters the default compilation path for users on PyTorch >= 2.8. While this might be intentional for the PyTorch 2.9 upgrade, it's a change in default behavior that should be clearly communicated. Could you confirm if this is the intended new default? If so, please consider adding a note to the release documentation about this change and updating the related comment on lines 493-494.

gemini-code-assist · 2025-10-13T21:45:18Z

vllm/envs.py

    # In torch <= 2.7 we ignore this flag; in torch >= 2.8 this is
    # disabled by default.


This comment is now outdated due to the change in the default value of VLLM_USE_STANDALONE_COMPILE to True. It should be updated to reflect that standalone compilation is now enabled by default for PyTorch >= 2.8.

Suggested change

# In torch <= 2.7 we ignore this flag; in torch >= 2.8 this is

# disabled by default.

# In torch <= 2.7 we ignore this flag; in torch >= 2.8 this is

# enabled by default.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-13T21:48:00Z

vllm/env_override.py

+def should_partition_patched(self, node, should_log: bool = False) -> bool:
+    # Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches
+    # [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4724)
+    # so that we always return True.
+    """Return True if we should partition the inductor graph on this node"""
+
+    import torch._inductor.ir as ir
+    from torch._inductor.scheduler import (
+        BaseSchedulerNode,
+        FusedSchedulerNode,
+        _custom_should_partition_fns,
+    )
+    from torch._inductor.utils import (
+        _unstable_customized_partition_wrapper,
+        is_cudagraph_unsafe_op,
+        maybe_log_cudagraph_partition,
+    )
+
+    # Allow users to manually specify if a node should be partitioned
+    # Can only do this for FallbackKernels
+    ir_node = node.node
+    if isinstance(ir_node, ir.FallbackKernel):
+        operator = ir_node.op_overload
+        if operator is not None and operator in _custom_should_partition_fns:
+            return True
+
+    # When not using cudagraphs, keep all kernels in the `call` function
+    # instead of graph partition functions, since graph partition only brings
+    # benefit to cudagraph
+    if (
+        not torch._inductor.config.triton.cudagraphs
+        and _unstable_customized_partition_wrapper.wrapper is None
+    ):
+        return True
+
+    # avoid duplicating logs when should_partition is called multiple times
+    # on the same node
+    def noop_log(msg: str, node: BaseSchedulerNode | None) -> None:
+        return
+
+    log_partition_reason = maybe_log_cudagraph_partition if should_log else noop_log
+
+    if isinstance(node, FusedSchedulerNode):
+        return any(self.should_partition(snode) for snode in node.snodes)
+
+    assert node.node is not None
+
+    if not node.is_gpu():
+        log_partition_reason("non gpu ops", node=node)
+
+        return True
+
+    if isinstance(node.node, ir.DeviceCopy):
+        log_partition_reason("DeviceCopy ops", node=node)
+        return True
+
+    if isinstance(node.node, ir.Conditional):
+        log_partition_reason("Conditional ops", node=node)
+        return True
+
+    if getattr(node.node, "unbacked_bindings", None):
+        log_partition_reason("unbacked binding ops", node=node)
+        return True
+
+    if is_cudagraph_unsafe_op(node.node):
+        log_partition_reason("CUDAGraph-unsafe custom ops", node=node)
+        return True
+
+    return False


Patched scheduler still skips inductor partitioning

The new should_partition_patched is commented as forcing Inductor to always partition, but it still ends with return False. When none of the earlier conditions match, the monkeypatched method behaves exactly like the upstream implementation and refuses to partition, so the monkeypatch has no effect in the cases the commit is trying to fix. As a result, cudagraph‑unsafe ops can still be left unpartitioned and the referenced issue remains unresolved.

Useful? React with 👍 / 👎.

mergify · 2025-10-13T22:56:58Z