Skip to content

Conversation

@OneSizeFitsQuorum
Copy link
Contributor

@OneSizeFitsQuorum OneSizeFitsQuorum commented Nov 3, 2025

Description

When starting a Ray cluster in a Kuberay environment, the startup process may sometimes be slow. In such cases, it is necessary to increase the timeout duration for proper startup, otherwise, the error "ray client connection timeout" will occur. Therefore, we need to make the timeout and retry policies for the Ray worker configurable.

@OneSizeFitsQuorum OneSizeFitsQuorum requested a review from a team as a code owner November 3, 2025 12:03
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request makes the Ray worker connection timeout parameters configurable through environment variables, which is a valuable addition for environments with slower startup times. The change correctly centralizes the configuration at the module level and removes redundant code.

My main feedback is regarding the robustness of parsing the environment variables. The current implementation can lead to a ValueError and crash the application if the environment variables are set to non-numeric values. I've left a comment with a suggestion to make this more robust by using os.environ.get() and adding error handling, ideally by leveraging existing helper functions within Ray to ensure consistency and prevent crashes from misconfiguration.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 3, 2025
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Nov 3, 2025
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Left an adjustment on the naming to make it consistent w/ other constants we have.

Please also sign off your commits to fix the DCO build.

@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the make_timeout_configurable branch from f0da4f0 to f11b006 Compare November 4, 2025 05:58
Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
@OneSizeFitsQuorum OneSizeFitsQuorum force-pushed the make_timeout_configurable branch from f11b006 to 730fb03 Compare November 4, 2025 06:03
@edoakes
Copy link
Collaborator

edoakes commented Nov 4, 2025

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
@OneSizeFitsQuorum
Copy link
Contributor Author

@edoakes Thanks a lot for reviewing this. Has fixed!

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution

@edoakes edoakes merged commit 7970875 into ray-project:master Nov 5, 2025
6 checks passed
@OneSizeFitsQuorum OneSizeFitsQuorum deleted the make_timeout_configurable branch November 6, 2025 01:42
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…roject#58372)

When starting a Ray cluster in a Kuberay environment, the startup
process may sometimes be slow. In such cases, it is necessary to
increase the timeout duration for proper startup, otherwise, the error
"ray client connection timeout" will occur. Therefore, we need to make
the timeout and retry policies for the Ray worker configurable.

---------

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…roject#58372)

When starting a Ray cluster in a Kuberay environment, the startup
process may sometimes be slow. In such cases, it is necessary to
increase the timeout duration for proper startup, otherwise, the error
"ray client connection timeout" will occur. Therefore, we need to make
the timeout and retry policies for the Ray worker configurable.

---------

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…roject#58372)

When starting a Ray cluster in a Kuberay environment, the startup
process may sometimes be slow. In such cases, it is necessary to
increase the timeout duration for proper startup, otherwise, the error
"ray client connection timeout" will occur. Therefore, we need to make
the timeout and retry policies for the Ray worker configurable.

---------

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…roject#58372)

When starting a Ray cluster in a Kuberay environment, the startup
process may sometimes be slow. In such cases, it is necessary to
increase the timeout duration for proper startup, otherwise, the error
"ray client connection timeout" will occur. Therefore, we need to make
the timeout and retry policies for the Ray worker configurable.

---------

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…roject#58372)

When starting a Ray cluster in a Kuberay environment, the startup
process may sometimes be slow. In such cases, it is necessary to
increase the timeout duration for proper startup, otherwise, the error
"ray client connection timeout" will occur. Therefore, we need to make
the timeout and retry policies for the Ray worker configurable.

---------

Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants