[issue tracker] make vllm compatible with dynamo #8821

youkaichao · 2024-09-25T20:43:11Z

Anything you want to discuss about vllm.

The first step to enable torch.compile, is to use dynamo to capture the graph. while dynamo can handle many python features, every time there is a python side change, dynamo will try to re-compile the code.

for example:

# test.py
import torch

@torch.compile
def f(x, i):
    return (x + i) * i

x = torch.randn(5, 5).cuda()
f(x, 1)
f(x, 2)
f(x, 3)

running the code with TORCH_LOGS=recompiles_verbose python test.py , we can get:

V0925 13:16:45.714159 140477991954240 torch/_dynamo/guards.py:2609] [0/1] [__recompiles_verbose] Recompiling function f in /data/youkaichao/vllm/testb.py:3
V0925 13:16:45.714159 140477991954240 torch/_dynamo/guards.py:2609] [0/1] [__recompiles_verbose]     triggered by the following guard failure(s):
V0925 13:16:45.714159 140477991954240 torch/_dynamo/guards.py:2609] [0/1] [__recompiles_verbose]     guard 0 failures:
V0925 13:16:45.714159 140477991954240 torch/_dynamo/guards.py:2609] [0/1] [__recompiles_verbose]     - L['i'] == 1                                                 
V0925 13:16:45.882663 140477991954240 torch/_dynamo/guards.py:2609] [0/2] [__recompiles_verbose] Recompiling function f in /data/youkaichao/vllm/testb.py:3
V0925 13:16:45.882663 140477991954240 torch/_dynamo/guards.py:2609] [0/2] [__recompiles_verbose]     triggered by the following guard failure(s):
V0925 13:16:45.882663 140477991954240 torch/_dynamo/guards.py:2609] [0/2] [__recompiles_verbose]     guard 0 failures:
V0925 13:16:45.882663 140477991954240 torch/_dynamo/guards.py:2609] [0/2] [__recompiles_verbose]     - L['i'] == 2                                                 
V0925 13:16:45.882663 140477991954240 torch/_dynamo/guards.py:2609] [0/2] [__recompiles_verbose] 
V0925 13:16:45.882663 140477991954240 torch/_dynamo/guards.py:2609] [0/2] [__recompiles_verbose]     guard 1 failures:
V0925 13:16:45.882663 140477991954240 torch/_dynamo/guards.py:2609] [0/2] [__recompiles_verbose]     - L['i'] == 1

every function call is a re-compilation, because pytorch will embed the constant into the graph, and the graph is only re-usable when i equals to that value.

this is because torch.compile aims to compile tensor-program, a program that only generalizes to tensors. it does not generalize to Python integers.

to solve the problem, we need to wrap the integer into a tensor, so that pytorch will re-use the graph as long as the tensor metadata (device, shape, dtype, etc) matches, the graph can be re-used:

# test.py
import torch

@torch.compile
def f(x, i):
    i = i.cuda()
    return (x + i) * i

x = torch.randn(5, 5).cuda()
f(x, torch.tensor(1))
f(x, torch.tensor(2))
f(x, torch.tensor(3))

this code will not teigger re-compilation.

to integrate with dynamo, we need to carefully design the warmup scheme, so that we have compiled for all use cases, and future run will not trigger compilation. (if a new user request triggers compilation, the TTFT will be several seconds because of compilation).

our first goal, is to remove unnecessary Python side changes every time we run the model. the changes can be found from the following code:

import os

os.environ["VLLM_TEST_DYNAMO_GRAPH_CAPTURE"] = "1"

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0)
llm = LLM(model="meta-llama/Meta-Llama-3-8B",
            enforce_eager=True,
            tensor_parallel_size=1,
            disable_custom_all_reduce=True)

# the first batch will compile
outputs = llm.generate(prompts[:1], sampling_params)

# the second batch might also compile, and enable dynamic shape automatically
outputs = llm.generate(prompts[:2], sampling_params)

print("warm up done" + "\n" * 10)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

We use two different batches of requests to warm up the compilation, and then pytorch should capture and compile graphs for all the tensor variations. the final run will reveal all the python side variation we have, which we need to remove.

after warmup, we can see the following re-compilation:

warm up done


Processed prompts:   0%|                                             | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose] Recompiling function forward in /data/youkaichao/vllm/vllm/model_executor/models/llama.py:440
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     triggered by the following guard failure(s):
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     guard 0 failures:
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     - L['attn_metadata'].num_decode_tokens == 0                   
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose] 
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     guard 1 failures:
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     - tensor 'L['input_ids']' size mismatch at index 0. expected 2, actual 4
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose] 
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     guard 2 failures:
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     - L['attn_metadata'].num_decode_tokens == 2                   
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose] 
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     guard 3 failures:
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     - tensor 'L['input_ids']' size mismatch at index 0. expected 1, actual 4
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose] 
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     guard 4 failures:
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     - tensor 'L['attn_metadata']._cached_decode_metadata.block_tables' size mismatch at index 0. expected 1, actual 4
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose] 
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     guard 5 failures:
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     - L['attn_metadata'].num_prefills == 1                        
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose] 
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     guard 6 failures:
[rank0]:V0925 13:34:27.040958 139966991710016 torch/_dynamo/guards.py:2609] [0/7] [__recompiles_verbose]     - tensor 'L['input_ids']' dispatch key set mismatch. expected DispatchKeySet(CUDA, BackendSelect), actual DispatchKeySet(CUDA, BackendSelect, ADInplaceOrView)
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] Recompiling function forward in /data/youkaichao/vllm/vllm/model_executor/models/llama.py:440
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     triggered by the following guard failure(s):
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 0 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - tensor 'L['attn_metadata']._cached_decode_metadata.block_tables' stride mismatch at index 0. expected 1, actual 2
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] 
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 1 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - L['attn_metadata'].num_decode_tokens == 0                   
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] 
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 2 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - tensor 'L['input_ids']' size mismatch at index 0. expected 2, actual 4
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] 
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 3 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - tensor 'L['attn_metadata']._cached_decode_metadata.block_tables' stride mismatch at index 0. expected 1, actual 2
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] 
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 4 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - tensor 'L['input_ids']' size mismatch at index 0. expected 1, actual 4
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] 
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 5 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - tensor 'L['attn_metadata']._cached_decode_metadata.block_tables' size mismatch at index 0. expected 1, actual 4
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] 
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 6 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - L['attn_metadata'].num_prefills == 1                        
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose] 
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     guard 7 failures:
[rank0]:V0925 13:34:39.292751 139966991710016 torch/_dynamo/guards.py:2609] [0/8] [__recompiles_verbose]     - tensor 'L['input_ids']' dispatch key set mismatch. expected DispatchKeySet(CUDA, BackendSelect), actual DispatchKeySet(CUDA, BackendSelect, ADInplaceOrView)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

wschin · 2024-10-01T16:56:45Z

Hello, how do you feel about this #8398? In general, it's extermely hard to make all models running torch.compile with cuda graph. This PR proposes a solution to selectively compile nn.Module's inside a model with <100 lines of actual code.

youkaichao · 2024-10-01T18:19:58Z

it's extermely hard to make all models running torch.compile with cuda graph

why is it the case? we can run cudagraph with torch.compile without any problems.

wschin · 2024-10-01T18:49:43Z

If you call torch.compile on them and launch vllm, you will see some errors. Phi was one case.

wschin · 2024-10-01T18:51:14Z

The problem is that not all models can be compiled as a whole. For example, ~50% of huggingface models can't.

youkaichao · 2024-10-01T20:21:33Z

please take a look at #8949 for our integration plan. we will not just add one line of torch.compile . we will control the compilation process.

youkaichao added the misc label Sep 25, 2024

This was referenced Sep 26, 2024

[torch.compile] use empty tensor instead of None for profiling #8875

Merged

[torch.compile] fix tensor alias #8982

Merged

This was referenced Oct 3, 2024

[misc] add forward context for attention #9029

Merged

[torch.compile] integration with compilation control #9058

Merged

[torch.compile] improve allreduce registration #9061

Merged

youkaichao closed this as completed in #9058 Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[issue tracker] make vllm compatible with dynamo #8821

[issue tracker] make vllm compatible with dynamo #8821

youkaichao commented Sep 25, 2024

wschin commented Oct 1, 2024

youkaichao commented Oct 1, 2024

wschin commented Oct 1, 2024

wschin commented Oct 1, 2024

youkaichao commented Oct 1, 2024

[issue tracker] make vllm compatible with dynamo #8821

[issue tracker] make vllm compatible with dynamo #8821

Comments

youkaichao commented Sep 25, 2024

Anything you want to discuss about vllm.

Before submitting a new issue...

wschin commented Oct 1, 2024

youkaichao commented Oct 1, 2024

wschin commented Oct 1, 2024

wschin commented Oct 1, 2024

youkaichao commented Oct 1, 2024