-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Optimize ray worker initialization time #11275
[Misc] Optimize ray worker initialization time #11275
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
dfa2cb8
to
0f453a7
Compare
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
0f453a7
to
8254b41
Compare
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix!
@@ -179,7 +188,7 @@ def sort_by_driver_then_worker_ip(worker): | |||
3. Finally, if the work is on a node with smaller IP address, it | |||
should be placed first. | |||
""" | |||
ip = ray.get(worker.get_node_ip.remote()) | |||
ip = worker_to_ip[worker] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruisearch42 this one looks concerning to me. we should change the tuple to sort, instead of using worker as the key. see the code from #11256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Can you elaborate a bit on the concern? The pattern of using an external dict for sorting is not uncommon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using an arbitrary python object as a key introduces quite unpredictable behavior and can have silent bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not about using an external dict, it's about using the worker object as a dict key, which implicitly calls its __hash__
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default behavior without a custom __hash__
function is to use the object's identity (memory address) as __hash__
and __eq__
, so it's pretty safe unless there is some non-standard user overridden __hash__
and __eq__
?
I think your implementation also makes sense.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
This PR optimizes ray worker initialization time.
In the current code base,
ray.get(worker.get_node_ip.remote())
is called for each worker right after we get its handle, and it takes ~3s. This call is expensive because whenRayWorkerWrapper.remote()
just returns, we get an actor handle, but the actor itself may not be fully initialized yet. At this time, any method call on the actor would need to wait for actor initialization to happen, which can take some time (~3s in this case).And since we are calling
ray.get(worker.get_node_ip.remote())
in a serialized manner for each newly created actor handle, this time adds up. For example, when we have TP=4, this would take ~12 seconds.We optimize this by making
ray.get(worker.get_node_ip.remote())
calls on all the actor handles after they are created. And since these run in parallel, the total time taken is ~3s. So for TP = 4, this reduces ~9 seconds.I tested the following command:
Without this PR,
_init_workers_ray
takes ~18 seconds. And with it, it takes ~9 seconds.FIX #10283