-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[issue tracker] make vllm compatible with dynamo #8821
Comments
Hello, how do you feel about this #8398? In general, it's extermely hard to make all models running torch.compile with cuda graph. This PR proposes a solution to selectively compile nn.Module's inside a model with <100 lines of actual code. |
why is it the case? we can run cudagraph with torch.compile without any problems. |
If you call torch.compile on them and launch vllm, you will see some errors. Phi was one case. |
The problem is that not all models can be compiled as a whole. For example, ~50% of huggingface models can't. |
please take a look at #8949 for our integration plan. we will not just add one line of |
Anything you want to discuss about vllm.
The first step to enable
torch.compile
, is to use dynamo to capture the graph. while dynamo can handle many python features, every time there is a python side change, dynamo will try to re-compile the code.for example:
running the code with
TORCH_LOGS=recompiles_verbose python test.py
, we can get:every function call is a re-compilation, because pytorch will embed the constant into the graph, and the graph is only re-usable when
i
equals to that value.this is because
torch.compile
aims to compile tensor-program, a program that only generalizes to tensors. it does not generalize to Python integers.to solve the problem, we need to wrap the integer into a tensor, so that pytorch will re-use the graph as long as the tensor metadata (device, shape, dtype, etc) matches, the graph can be re-used:
this code will not teigger re-compilation.
to integrate with dynamo, we need to carefully design the warmup scheme, so that we have compiled for all use cases, and future run will not trigger compilation. (if a new user request triggers compilation, the TTFT will be several seconds because of compilation).
our first goal, is to remove unnecessary Python side changes every time we run the model. the changes can be found from the following code:
We use two different batches of requests to warm up the compilation, and then pytorch should capture and compile graphs for all the tensor variations. the final run will reveal all the python side variation we have, which we need to remove.
after warmup, we can see the following re-compilation:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: