[RFC]: A Graph Optimization System in vLLM using torch.compile

### Motivation.

At a high level, we at Neural Magic are writing a custom compiler for Torch Dynamo to define a system within vLLM where we can write graph transformations. The main goal is a separation of concerns between high-level model definitions and certain performance-critical low-level decisions. This is especially important for optimizations that are particularly invasive to the model definitions, that break abstractions, that cross boundaries between layers, or that aren't universally valid or useful. If these optimizations are made as part of the model definitions, it becomes much more difficult to add new models.

We are working on the following for an initial set of optimizations using this system, described in detail in the Proposed Passes section.
* Fusing quantize operations onto LayerNorm kernels (both for fp8 and int8 and both static and dynamic quantization)
* Fusing the MLP section containing GEMM, SiLU, Mul, and quantize operations
* Rewriting Gemm + AllReduce + Layer Norm + Gemm to a Fused Gemm-ReduceScatter + LayerNorm + Fused AllGather Gemm, in order to take advantage of the Flux kernels from ByteDance

Although this system operates as a custom compiler inside of Torch Dynamo, it’s best to think of it as an optimization system in vLLM rather than a compiler. Rather than a vertical compiler stack that lowers high-level tensor operations through successive layers of IR, we are taking the simple and pragmatic approach of improving vLLM’s ecosystem of custom kernels rather than replacing it.

Going forward, based on our experience at Neural Magic of [what worked well in DeepSparse](https://docs.google.com/document/d/1CvbJ0LOotlfTjR6RmlQKLO4zcEvN2deoDacSjQ31Xiw/edit#heading=h.3dtm97fld9gn), we have a perspective on how graph optimizations should fit into vLLM and how it should fit in with the PyTorch team’s plans with torch.compile. In short we think:  
* A graph optimization/compilation system can be a power multiplier for vLLM developers.
* Torch.compile is not likely to be good enough to replace custom kernels at least for linear layers.
* vLLM should not treat torch.compile as a black box.
* We should build a system that vLLM developers control that interoperates well with Torch Inductor.
* This graph optimization system should be kept lightweight – vLLM should not try to become a graph compiler.

[[RFC] A Graph Optimization System in vLLM using torch.compile](https://docs.google.com/document/d/1CvbJ0LOotlfTjR6RmlQKLO4zcEvN2deoDacSjQ31Xiw)

### Proposed Change.

#6377, #9886

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: A Graph Optimization System in vLLM using torch.compile #6378

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: A Graph Optimization System in vLLM using torch.compile #6378

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions