-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: see connection to gpu node timeout issue when initializing ray vllm multi-node serving #13052
Comments
Note: it is ray serving in k8 |
confirm this bug.
ray version 2.41, with vllm version 0.7.2 and 0.7.1 both find the bug. I also checked the container network,the target pod try to connect,the pod's other port is available.but not the destination port,so I think it doesn't the container network problem. from the log message I found
the different rank init address could different compare with rank 0. I also try to set the VLLM_HOST_IP and VLLM_PORT,the problem still exist. |
@youkaichao @Jeffwan @houseroad |
I also tested with one node 4 gpus, instead of two node each 1 gpu. The error persists. |
update,back to version 0.6.4.post1 solved the problem.I found in the my env,the 0.7.x PyTorch NCCL test could pass but vllm nccl test fail. back to 0.6.4.post1 success, so I guess the problem is new version bring some communication problems. @meqiangxu maye this also works for you. |
Thanks! Yesterday we used vllm 0.6.5 which could resolve the issue too. |
Do we want this issue be prioritized? Seems it is regression. |
for this, it might be that the test script is obsolete. i just updated it yesterday in #13487 |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
I am setting up serving Meta-Llama-3.1-70B-Instruct-GPTQ-INT4 model for inference using two g6.2xlarge instances (each node has one 1gpu). I created placement group like this:
Placement Group Bundles: [{'CPU': 1.0}, {'GPU': 1.0, 'CPU': 1.0, 'memory': 9000000000.0}, {'memory': 9000000000.0, 'CPU': 1.0, 'GPU': 1.0}], and configured it to the serve deployment actor.
The AsyncEngineArgs is using tensor_parallel_size=1, pipeline_parallel_size=2. Other args can be seen in the logs below. But I can also attach relevant python code.
Errors
I tried using debug script on one gpu node.
It reports the same error:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: