-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Description
Motivation.
At a high level, we at Neural Magic are writing a custom compiler for Torch Dynamo to define a system within vLLM where we can write graph transformations. The main goal is a separation of concerns between high-level model definitions and certain performance-critical low-level decisions. This is especially important for optimizations that are particularly invasive to the model definitions, that break abstractions, that cross boundaries between layers, or that aren't universally valid or useful. If these optimizations are made as part of the model definitions, it becomes much more difficult to add new models.
We are working on the following for an initial set of optimizations using this system, described in detail in the Proposed Passes section.
- Fusing quantize operations onto LayerNorm kernels (both for fp8 and int8 and both static and dynamic quantization)
- Fusing the MLP section containing GEMM, SiLU, Mul, and quantize operations
- Rewriting Gemm + AllReduce + Layer Norm + Gemm to a Fused Gemm-ReduceScatter + LayerNorm + Fused AllGather Gemm, in order to take advantage of the Flux kernels from ByteDance
Although this system operates as a custom compiler inside of Torch Dynamo, it’s best to think of it as an optimization system in vLLM rather than a compiler. Rather than a vertical compiler stack that lowers high-level tensor operations through successive layers of IR, we are taking the simple and pragmatic approach of improving vLLM’s ecosystem of custom kernels rather than replacing it.
Going forward, based on our experience at Neural Magic of what worked well in DeepSparse, we have a perspective on how graph optimizations should fit into vLLM and how it should fit in with the PyTorch team’s plans with torch.compile. In short we think:
- A graph optimization/compilation system can be a power multiplier for vLLM developers.
- Torch.compile is not likely to be good enough to replace custom kernels at least for linear layers.
- vLLM should not treat torch.compile as a black box.
- We should build a system that vLLM developers control that interoperates well with Torch Inductor.
- This graph optimization system should be kept lightweight – vLLM should not try to become a graph compiler.
[RFC] A Graph Optimization System in vLLM using torch.compile
Proposed Change.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response