-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Custom all reduce not work. #3688
Comments
When does this problem occur? Is it related to #2152 ? |
@youkaichao Nope, it's related custom-all-reduce feature. After Upgrade to nccl=2.19.3, everything is ok. |
@esmeetu I guess it's NCCL's problem? Let me know if a fixed is needed from my side |
@hanzhi713 Yes, i thinking so. I will test again after vllm upgrade pytorch to v2.2.2. |
@hanzhi713 Why your custom all reduce kernel is influenced by nccl? IIUC, yours doesn't use nccl.🤔 |
Allreduce with larger size (>=8mb) and other collectives (like gather) still need NCCL |
any one tried vllm 0.3.3 + torch 2.1.1+cu118 with nccl==2.19.3? . By default vllm 0.3.3 + torch 2.1.1+cu118 installs nccl==2.18.3 that is giving the all_reduce error with multiple nodes |
Can you show your environment and error trace? |
|
|
@youkaichao here is the error file with |
@Sande33p I took a look at your error log, and I find the following lines might be relevant:
It seems your cuda version is too old. Can you try to upgrade your cuda version? |
Yes, I faced the same error when I tried vllm 0.4.1+cu11,torch 2.2.1+cu11,nccl==2.19.2,vllm-nccl-cu11 2.18.1.0.4.0 with multiple gpus. Line 618 in ee3eea0
|
Same issue here, using docker image vllm/vllm-openai:latest.
|
Maybe. What's your host driver info? |
Can't know it, i am running vllm/vllm-openai:latest docker image inside RunPod.io and I'm having the same issue as the topic |
you can follow the issue template https://github.com/vllm-project/vllm/issues/new/choose to run an environment collection script. |
@hanzhi713 @youkaichao May I ask, what was the original intention behind vLLM's development of custom allreduce? |
Your current environment
🐛 Describe the bug
llm_engine
is good.async_llm_engine
not work.Log:
The text was updated successfully, but these errors were encountered: