Skip to content

Conversation

@yangli5t
Copy link

@yangli5t yangli5t commented Apr 1, 2025

This PR reverted the rank assignment during workers' initialization, introduced in ad34c0d. When deploy multi-node models (e.g. DeepSeek R1) with Ray serve, there is a chance that this rank assignment get into conflicts across workers thus cause client socket timeout. This RP fix this issue and unblock model deployments

FIX #15744

@github-actions
Copy link

github-actions bot commented Apr 1, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@yangli5t
Copy link
Author

yangli5t commented Apr 1, 2025

Hi @youkaichao, would you please help take a look at this PR? as you might have most context. I'm not sure if this will affect SPMD workers. I think a better way would be fix rank assignment directly, which will need your expertise there. thank you.

@yangli5t yangli5t changed the title [Bugfix] fix client socket timeout when serve multi-node model in Ray (#15744) [Bugfix] fix client socket timeout when serve multi-node model in Ray Apr 1, 2025
@DarkLight1337
Copy link
Member

Can you merge from main to fix the Docker build issue?

@youkaichao
Copy link
Member

there is a chance that this rank assignment get into conflicts across workers thus cause client socket timeout

can you explain more about it?

yangli5t added 2 commits April 3, 2025 16:58
Signed-off-by:  <>

Signed-off-by: yangli5t <yangli5t@users.noreply.github.com>
Signed-off-by: yangli5t <yangli5t@users.noreply.github.com>
@yangli5t
Copy link
Author

yangli5t commented Apr 3, 2025

there is a chance that this rank assignment get into conflicts across workers thus cause client socket timeout

can you explain more about it?

seems when the selected driver_dummy_worker's created_rank is not 0, it will trigger the error. https://github.com/vllm-project/vllm/blob/main/vllm/executor/ray_distributed_executor.py#L240-L248
@youkaichao

@youkaichao
Copy link
Member

when the selected driver_dummy_worker's created_rank is not 0

this is possible, but i still don't get it why this would cause error.

@andreapairon
Copy link

Any news on this PR?

@AntonioGr7
Copy link

Hello guys, did you forget about this one? It's really important for our team

@mergify
Copy link

mergify bot commented Aug 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yangli5t.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 1, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase stale Over 90 days of inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: client socket has timed out while trying to connect to GPU node, when initializing DeepSeek R1 in ray vllm serving

5 participants