-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Loading LoRA is super slow when using tensor parallel #6072
Comments
You may want to refer to this troubleshooting guide. |
I think you maybe able to fix this by changing the following line to vllm/vllm/lora/worker_manager.py Line 180 in 7c008c5
Be-aware though there is some instability. I was facing crashes in the custom all reduce CUDA kernels when trying to use LoRAs of rank=2, but it worked fine with a LoRA of rank = 8. Another option is to disable memory pinning here: https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#L219 Just set this line to be |
Thanks for your suggestions @sampritipanda ! I tried them, but they all did not help in my case. What did help though is adding I also set |
Hello. I have been investigating this behavior as well. I believe the root cause of the slow down is CPU contention and throttling, particularly in an env like Kubernetes with containers with CPU limits. Setting My test env was in Kubernetes on a node with 80 total cores, cpu requests set to 8 and cpu limits set to 16. Without |
Your current environment
🐛 Describe the bug
Build openai docker from https://github.com/vllm-project/vllm/releases/tag/v0.5.0.post1
Started vLLM for llama-2-70b model with LoRA support and tensor-parallel=4. First lora-request will take more than 1 minute. Problem is in this function: it is very fast on first process and super slow (>40second) on all other processes. Here is log output:
As you can see, all 4 processes start loading lora at 14:44:43. But only first one finished at 14:44:46 and 3 other finish only at 14:45:23. What is the problem?
The text was updated successfully, but these errors were encountered: