Skip to content

[RFC]: A Graph Optimization System in vLLM using torch.compile #6378

@bnellnm

Description

@bnellnm

Motivation.

At a high level, we at Neural Magic are writing a custom compiler for Torch Dynamo to define a system within vLLM where we can write graph transformations. The main goal is a separation of concerns between high-level model definitions and certain performance-critical low-level decisions. This is especially important for optimizations that are particularly invasive to the model definitions, that break abstractions, that cross boundaries between layers, or that aren't universally valid or useful. If these optimizations are made as part of the model definitions, it becomes much more difficult to add new models.

We are working on the following for an initial set of optimizations using this system, described in detail in the Proposed Passes section.

  • Fusing quantize operations onto LayerNorm kernels (both for fp8 and int8 and both static and dynamic quantization)
  • Fusing the MLP section containing GEMM, SiLU, Mul, and quantize operations
  • Rewriting Gemm + AllReduce + Layer Norm + Gemm to a Fused Gemm-ReduceScatter + LayerNorm + Fused AllGather Gemm, in order to take advantage of the Flux kernels from ByteDance

Although this system operates as a custom compiler inside of Torch Dynamo, it’s best to think of it as an optimization system in vLLM rather than a compiler. Rather than a vertical compiler stack that lowers high-level tensor operations through successive layers of IR, we are taking the simple and pragmatic approach of improving vLLM’s ecosystem of custom kernels rather than replacing it.

Going forward, based on our experience at Neural Magic of what worked well in DeepSparse, we have a perspective on how graph optimizations should fit into vLLM and how it should fit in with the PyTorch team’s plans with torch.compile. In short we think:

  • A graph optimization/compilation system can be a power multiplier for vLLM developers.
  • Torch.compile is not likely to be good enough to replace custom kernels at least for linear layers.
  • vLLM should not treat torch.compile as a black box.
  • We should build a system that vLLM developers control that interoperates well with Torch Inductor.
  • This graph optimization system should be kept lightweight – vLLM should not try to become a graph compiler.

[RFC] A Graph Optimization System in vLLM using torch.compile

Proposed Change.

#6377, #9886

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions