Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: The impact of CPU on vLLM performance is significant. #8147

Open
skylee-01 opened this issue Sep 4, 2024 · 11 comments
Open
Labels
performance Performance-related issues

Comments

@skylee-01
Copy link

skylee-01 commented Sep 4, 2024

Proposal to improve performance

We used the same GPU on two machines but different CPUs. The following experimental conclusions were drawn:
Experimental results: The GPU is 3090, and the CPU was upgraded from Xeon Gold 6240 to i9-12900k. The impact is as follows.
a. vLLM achieved a 3.8x speedup in the agent scenario.
b. TGi achieved a 1.23x speedup in the agent scenario.
c. vLLM still has latency issues, but the time has been reduced to 100ms (previously 300ms).
e. GPU utilization has increased from 70% to 90%.

From the stress test data, it is evident that vLLM heavily relies on the performance of the CPU.
What are the main factors affecting CPU performance, and how can they be optimized?

@skylee-01 skylee-01 added the performance Performance-related issues label Sep 4, 2024
@skylee-01
Copy link
Author

Related experiments: #7540

@skylee-01
Copy link
Author

@WoosukKwon @youkaichao Please provide some assistance.

@youkaichao
Copy link
Member

what is the vllm version you use?

@skylee-01
Copy link
Author

what is the vllm version you use?

0.5.5

@youkaichao
Copy link
Member

we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.

@skylee-01
Copy link
Author

we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.

What is the reason for VLLM's current heavy dependence on CPU, and what are the directions for optimization?
Our team is also trying to participate in the work of VLLM, hoping to contribute to the VLLM community. We hope to be able to submit code for VLLM.

@youkaichao
Copy link
Member

cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).

for some examples on this line of optimization, see #7000 and #8092

contributions are definitely welcome!

@skylee-01
Copy link
Author

cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).

for some examples on this line of optimization, see #7000 and #8092

contributions are definitely welcome!

Our team has developed some spec decoding features based on VLLM, which have been used internally and have yielded good performance benefits. How can we join the VLLM project, and where would be a good place to start?

@youkaichao
Copy link
Member

welcome to send emails to vllm-questions@lists.berkeley.edu

@robertgshaw2-neuralmagic
Copy link
Collaborator

Really interesting. Thanks for reporting. The GPUs are getting fast :)

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Sep 5, 2024

Hi @skylee-01 Thanks for reporting this! We also recently discovered the same problem. We plan to do more optimizations to mitigate the CPU effect.

vLLM is a fully open community-driven project, so we'd appreciate any contributions, including submitting or reviewing PRs, answering questions, and helping documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

4 participants