-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[DO NOT MERGE] 2.9, Inductor partition, standalone compile, monkeypatch fix(es) #26738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[DO NOT MERGE] 2.9, Inductor partition, standalone compile, monkeypatch fix(es) #26738
Conversation
|
Documentation preview: https://vllm--26738.org.readthedocs.build/en/26738/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request upgrades PyTorch to version 2.9 and introduces tests for Inductor graph partitioning. The changes span across CI configuration, Dockerfiles, build scripts, and test files to support the new PyTorch version and its features. My review focuses on the correctness and maintainability of these changes. I've identified a couple of high-severity issues related to a monkey patch with a misleading comment and an undocumented change in default behavior for a compilation flag. These should be addressed to ensure code clarity and prevent unexpected behavior for users.
vllm/env_override.py
Outdated
| # Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches | ||
| # [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4724) | ||
| # so that we always return True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here is misleading. It states that the patch makes should_partition 'always return True', but the patched function does not always return True. It only changes the behavior for the case where torch._inductor.config.triton.cudagraphs is False. The original implementation returns False in this case, preventing partitioning, while the patch returns True to enable it.
This monkey patch on a PyTorch internal function is a significant change and should be documented with precision to ensure future maintainability. Please update the comment to accurately describe what the patch does and why it's necessary. For example:
# Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches
# [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4715)
# to force partitioning even when CUDA graphs are disabled. The original
# implementation returns False in this case, which prevents partitioning.
# This change is necessary to enable Inductor graph partitioning for vLLM's
# piecewise CUDAGraph mode, which may operate without `config.triton.cudagraphs`
# being globally enabled. # Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches
# [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4715)
# to force partitioning even when CUDA graphs are disabled. The original
# implementation returns False in this case, which prevents partitioning.
# This change is necessary to enable Inductor graph partitioning for vLLM's
# piecewise CUDAGraph mode, which may operate without `config.triton.cudagraphs`
# being globally enabled.| VLLM_DP_RANK_LOCAL: int = -1 | ||
| VLLM_DP_SIZE: int = 1 | ||
| VLLM_USE_STANDALONE_COMPILE: bool = False | ||
| VLLM_USE_STANDALONE_COMPILE: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change modifies the default behavior of VLLM_USE_STANDALONE_COMPILE from False to True. This is a significant change as it alters the default compilation path for users on PyTorch >= 2.8. While this might be intentional for the PyTorch 2.9 upgrade, it's a change in default behavior that should be clearly communicated. Could you confirm if this is the intended new default? If so, please consider adding a note to the release documentation about this change and updating the related comment on lines 493-494.
vllm/envs.py
Outdated
| # In torch <= 2.7 we ignore this flag; in torch >= 2.8 this is | ||
| # disabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is now outdated due to the change in the default value of VLLM_USE_STANDALONE_COMPILE to True. It should be updated to reflect that standalone compilation is now enabled by default for PyTorch >= 2.8.
| # In torch <= 2.7 we ignore this flag; in torch >= 2.8 this is | |
| # disabled by default. | |
| # In torch <= 2.7 we ignore this flag; in torch >= 2.8 this is | |
| # enabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| def should_partition_patched(self, node, should_log: bool = False) -> bool: | ||
| # Copied from torch._inductor.scheduler.Scheduler.should_partition. Patches | ||
| # [this code](https://github.com/pytorch/pytorch/blob/ecb53078faf86ca1b33277df33b82985675bb011/torch/_inductor/scheduler.py#L4712-L4724) | ||
| # so that we always return True. | ||
| """Return True if we should partition the inductor graph on this node""" | ||
|
|
||
| import torch._inductor.ir as ir | ||
| from torch._inductor.scheduler import ( | ||
| BaseSchedulerNode, | ||
| FusedSchedulerNode, | ||
| _custom_should_partition_fns, | ||
| ) | ||
| from torch._inductor.utils import ( | ||
| _unstable_customized_partition_wrapper, | ||
| is_cudagraph_unsafe_op, | ||
| maybe_log_cudagraph_partition, | ||
| ) | ||
|
|
||
| # Allow users to manually specify if a node should be partitioned | ||
| # Can only do this for FallbackKernels | ||
| ir_node = node.node | ||
| if isinstance(ir_node, ir.FallbackKernel): | ||
| operator = ir_node.op_overload | ||
| if operator is not None and operator in _custom_should_partition_fns: | ||
| return True | ||
|
|
||
| # When not using cudagraphs, keep all kernels in the `call` function | ||
| # instead of graph partition functions, since graph partition only brings | ||
| # benefit to cudagraph | ||
| if ( | ||
| not torch._inductor.config.triton.cudagraphs | ||
| and _unstable_customized_partition_wrapper.wrapper is None | ||
| ): | ||
| return True | ||
|
|
||
| # avoid duplicating logs when should_partition is called multiple times | ||
| # on the same node | ||
| def noop_log(msg: str, node: BaseSchedulerNode | None) -> None: | ||
| return | ||
|
|
||
| log_partition_reason = maybe_log_cudagraph_partition if should_log else noop_log | ||
|
|
||
| if isinstance(node, FusedSchedulerNode): | ||
| return any(self.should_partition(snode) for snode in node.snodes) | ||
|
|
||
| assert node.node is not None | ||
|
|
||
| if not node.is_gpu(): | ||
| log_partition_reason("non gpu ops", node=node) | ||
|
|
||
| return True | ||
|
|
||
| if isinstance(node.node, ir.DeviceCopy): | ||
| log_partition_reason("DeviceCopy ops", node=node) | ||
| return True | ||
|
|
||
| if isinstance(node.node, ir.Conditional): | ||
| log_partition_reason("Conditional ops", node=node) | ||
| return True | ||
|
|
||
| if getattr(node.node, "unbacked_bindings", None): | ||
| log_partition_reason("unbacked binding ops", node=node) | ||
| return True | ||
|
|
||
| if is_cudagraph_unsafe_op(node.node): | ||
| log_partition_reason("CUDAGraph-unsafe custom ops", node=node) | ||
| return True | ||
|
|
||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patched scheduler still skips inductor partitioning
The new should_partition_patched is commented as forcing Inductor to always partition, but it still ends with return False. When none of the earlier conditions match, the monkeypatched method behaves exactly like the upstream implementation and refuses to partition, so the monkeypatch has no effect in the cases the commit is trying to fix. As a result, cudagraph‑unsafe ops can still be left unpartitioned and the referenced issue remains unresolved.
Useful? React with 👍 / 👎.
|
This pull request has merge conflicts that must be resolved before it can be |
ee24dd3 to
a654b46
Compare
44de6d0 to
a46b7b3
Compare
37947bd to
a7d1db9
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
eff712d to
67c4d19
Compare
|
this (Quantization test) may relate to #26878 |
4e2976b to
e811cb5
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
@BoyuanFeng LORA test failure seems relevant: perhaps we need to review how we deal with weak_ref_tensors with inductor partitioning but also not sure why there's a CPU tensor there. |
e811cb5 to
b4518de
Compare
b4518de to
6f1222c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
6f1222c to
47bbdff
Compare
47bbdff to
06601b7
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
06601b7 to
f6b27e6
Compare
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
f2bde49 to
69bef4b
Compare
In-progress PR to test inductor partitioning in CI.
Includes:
Past fixes in this PR now in main:
fused_moe::grouped_topkfix)Past fixes no longer necessary:
fused_moe::grouped_topk- superseeded by disable graph partition in custom op #26952