-
-
Couldn't load subscription status.
- Fork 10.8k
Log how much time loading a compiled artifact takes #16848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code change looks good to me. I didn't expect it takes so long though.
When people say they don't like the "torch.compile startup time", they
mean two things:
1) the cold start time
2) the warm start time (when the vLLM disk cache has already been
populated).
We had logging for (1), we didn't have (2). This PR adds (2)
Test Plan:
I ran `VLLM_USE_V1=1 python benchmark_latency.py --model meta-llama/Meta-Llama-3-8B --batch-size 1 -O '{"level": 3, "compile_sizes": {1, 2}}'`
And observed the following logs:
```
INFO 04-18 08:26:11 [backends.py:431] Dynamo bytecode transform time:
5.03 s
INFO 04-18 08:26:15 [backends.py:120] Directly load the compiled
graph(s) for shape None from the cache, took 4.190 s
INFO 04-18 08:26:18 [kv_cache_utils.py:634] GPU KV cache size: 532,032
tokens
```
Side note: it's probably not good that loading from the cache takes 4
seconds?
Signed-off-by: rzou <zou3519@gmail.com>
Signed-off-by: rzou <zou3519@gmail.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: rzou <zou3519@gmail.com>
Signed-off-by: rzou <zou3519@gmail.com>
Signed-off-by: rzou <zou3519@gmail.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Signed-off-by: rzou <zou3519@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
When people say they don't like the "torch.compile startup time", they mean two things:
populated).
We had logging for (1), we didn't have (2). This PR adds (2)
Test Plan:
I ran
VLLM_USE_V1=1 python benchmark_latency.py --model meta-llama/Meta-Llama-3-8B --batch-size 1 -O '{"level": 3, "compile_sizes": {1, 2}}'And observed the following logs:
Side note: it's probably not good that loading from the cache takes 4 seconds?