[torch.compile] initial integration #8949

youkaichao · 2024-09-29T19:11:29Z

TODOs (can be future PRs):

support embedding model, encoder-decoder model, multi-modality model
support attention backend other than flash attention
support models other than llama
support TP
support PP
test and integrate lora and quantization
perf testing
profile and investigate compilation time reduction

github-actions · 2024-09-29T19:11:40Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-09-29T20:21:43Z

simple test on H100:

throughput:

$ # main branch
$ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B
Throughput: 28.99 requests/s, 14843.59 tokens/s

$ # this branch
$ python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B
Throughput: 28.89 requests/s, 14792.03 tokens/s

$ # this branch
$ VLLM_TORCH_COMPILE_LEVEL=2 python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B
Throughput: 29.90 requests/s, 15309.14 tokens/s

about 3.5% throughput improvement

single request serving (Output token throughput (tok/s)):

Multi-step	Torch Compile Level 0 ( no compilation)	Torch Compile Level 2	Torch Compile Level 3
1	114.32	115.61 (+1.1%)	116.92 (+2.3%)
8	119.37	120.39 (+0.8%)	122.15 (+2.3%)
16	119.82	N/A	N/A

youkaichao · 2024-09-30T01:07:16Z

pipeline parallel

when I enable pipeline parallel, there's a dynamo error:

[rank0]:     var = tx.output.side_effects.track_object_new(
[rank0]:   File "/data/youkaichao/miniconda/envs/vllm/lib/python3.9/site-packages/torch/_dynamo/side_effects.py", line 243, in track_object_new
[rank0]:     obj = object_new(user_cls)
[rank0]: torch._dynamo.exc.InternalTorchDynamoError: object.__new__(IntermediateTensors) is not safe, use IntermediateTensors.__new__()

[rank0]: from user code:
[rank0]:    File "/data/youkaichao/vllm/vllm/model_executor/models/llama.py", line 450, in forward
[rank0]:     model_output = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/data/youkaichao/miniconda/envs/vllm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/youkaichao/vllm/vllm/model_executor/models/llama.py", line 339, in forward
[rank0]:     return IntermediateTensors({

cc @anijain2305

it turns out to be caused by msgpack:

vllm/vllm/sequence.py

Line 1152 in f13a07b

class IntermediateTensors(

when I change it to normal dataclass , it works.

tensor parallel

when I enable tensor parallel, it runs but the output is wrong. I'm still investigating.

Seen in vllm-project/vllm#8949 [ghstack-poisoned]

Seen in vllm-project/vllm#8949 ghstack-source-id: 9772ad284d8cbe809147943d2f39da701cd85686 Pull Request resolved: #137044

Seen in vllm-project/vllm#8949 ghstack-source-id: 785c59a2b4c04c5bab91eefc0fbb25f946dbe96d Pull Request resolved: #137044

…taclass has untouched __new__" Seen in vllm-project/vllm#8949 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

…uched __new__" Seen in vllm-project/vllm#8949 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

…taclass has untouched __new__" Seen in vllm-project/vllm#8949 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

…uched __new__" Seen in vllm-project/vllm#8949 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

Seen in vllm-project/vllm#8949 ghstack-source-id: 70445f10233dadf74abe0e31b3151a374bea711b Pull Request resolved: #137044

…rch#137044) Seen in vllm-project/vllm#8949 Pull Request resolved: pytorch#137044 Approved by: https://github.com/jansel

youkaichao · 2024-10-03T23:16:32Z

close as it has been moved to #9058

youkaichao added 24 commits September 28, 2024 21:28

set_current_metadata

4854110

add unified_flash_attention

02929ef

expose attention_backend from attention metadata

383b51c

init draft

861a65e

finish

d751293

warning for overwritten config

dc5e931

unify flags

558ea39

fix code

1074d7a

store forward context

6f65ec5

fix

e6c21c7

fix

ae97d2c

get symint

2b4fe53

fix bugs

a6f0e3b

fix the rest

99a281e

fix tpu

44328eb

leave todo

500430b

add tests

5b50c68

run 3 tests

55d54fe

rename

954caf8

support pp

ee2100e

move to decorators

b5fc0f1

fix mro

246e6e5

add comments

49aa7cc

fix mutates_args

99144b3

youkaichao requested a review from WoosukKwon September 29, 2024 19:11

youkaichao added 3 commits September 29, 2024 14:02

fix forward context

6ae09bd

surface errors

ec2191f

fix more

889794e

youkaichao added 9 commits September 29, 2024 16:55

simplification, model runner set context, model does not

ca79dd5

fix tests

fad55cb

add compare_all_settings

e195841

change tests

fbd3231

repurpose smoke tests

4781c14

remove

cbc9229

restore

5970a6f

restore

ca587a8

restore

7ea321c

youkaichao added 6 commits September 29, 2024 19:04

fix for pp

f3a5a5e

add tests

a864475

rename

1d9aacd

update tests

f233087

prepare for tp test

d2f1b97

early error

1b8ee5a

anijain2305 added a commit to pytorch/pytorch that referenced this pull request Sep 30, 2024

[dynamo][user-defined-class] Check that metaclass has untouched __new__

36c673f

Seen in vllm-project/vllm#8949 [ghstack-poisoned]

anijain2305 mentioned this pull request Sep 30, 2024

[dynamo][user-defined-class] Fallback when object.__new__ fails pytorch/pytorch#137044

Closed

anijain2305 added a commit to pytorch/pytorch that referenced this pull request Oct 1, 2024

[dynamo][user-defined-class] Check that metaclass has untouched __new__

961e42a

Seen in vllm-project/vllm#8949 ghstack-source-id: 70445f10233dadf74abe0e31b3151a374bea711b Pull Request resolved: #137044

youkaichao mentioned this pull request Oct 1, 2024

[issue tracker] make vllm compatible with dynamo #8821

Closed

1 task

youkaichao mentioned this pull request Oct 3, 2024

[torch.compile] integration with compilation control #9058

Merged

youkaichao closed this Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch.compile] initial integration #8949

[torch.compile] initial integration #8949

youkaichao commented Sep 29, 2024 •

edited

Loading

github-actions bot commented Sep 29, 2024

youkaichao commented Sep 29, 2024 •

edited

Loading

youkaichao commented Sep 30, 2024 •

edited

Loading

youkaichao commented Oct 3, 2024

[torch.compile] initial integration #8949

[torch.compile] initial integration #8949

Conversation

youkaichao commented Sep 29, 2024 • edited Loading

github-actions bot commented Sep 29, 2024

youkaichao commented Sep 29, 2024 • edited Loading

youkaichao commented Sep 30, 2024 • edited Loading

pipeline parallel

tensor parallel

youkaichao commented Oct 3, 2024

youkaichao commented Sep 29, 2024 •

edited

Loading

youkaichao commented Sep 29, 2024 •

edited

Loading

youkaichao commented Sep 30, 2024 •

edited

Loading