-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bugfix] fix client socket timeout when serve multi-node model in Ray #15850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Hi @youkaichao, would you please help take a look at this PR? as you might have most context. I'm not sure if this will affect SPMD workers. I think a better way would be fix rank assignment directly, which will need your expertise there. thank you. |
|
Can you merge from main to fix the Docker build issue? |
can you explain more about it? |
Signed-off-by: <> Signed-off-by: yangli5t <yangli5t@users.noreply.github.com>
Signed-off-by: yangli5t <yangli5t@users.noreply.github.com>
seems when the selected driver_dummy_worker's created_rank is not 0, it will trigger the error. https://github.com/vllm-project/vllm/blob/main/vllm/executor/ray_distributed_executor.py#L240-L248 |
this is possible, but i still don't get it why this would cause error. |
|
Any news on this PR? |
|
Hello guys, did you forget about this one? It's really important for our team |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This PR reverted the rank assignment during workers' initialization, introduced in ad34c0d. When deploy multi-node models (e.g. DeepSeek R1) with Ray serve, there is a chance that this rank assignment get into conflicts across workers thus cause client socket timeout. This RP fix this issue and unblock model deployments
FIX #15744