Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the raylet start wait timeout to accomodate plasma preallocation #15860

Merged
merged 9 commits into from
May 19, 2021

Conversation

ericl
Copy link
Contributor

@ericl ericl commented May 17, 2021

Why are these changes needed?

When RAY_PREALLOCATE_PLASMA_MEMORY=1, it can take quite some time to preallocate hundreds of GBs of memory. This causes the us to timeout waiting for redis to start. To avoid this problem, increase the default timeout.

Related to #14182

# The number of seconds to wait for the Raylet to start. This is normally
# fast, but when RAY_PREALLOCATE_PLASMA_MEMORY=1 is set, it may take some time
# (a few GB/s) to populate all the pages on Raylet startup.
RAYLET_START_WAIT_TIME_S = 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't do 120 if env.get("RAY_PREALLOCATE_PLASMA_MEMORY") else 5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be simpler to raise it for everyone (minimal downside), rather than adding such a special case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I just thought it would be bad UX if someone needs to wait 120 seconds when there are actual issues (when GCS is actually not connected since it is common among new users). But we anyway writes warning, so I am okay with the current approach.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 18, 2021
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll confirm that the Plasma preallocation goes smoothly on large object stores after this change.

@ericl
Copy link
Contributor Author

ericl commented May 18, 2021

//python/ray/tests:test_multi_node_3 FAILED in 3 out of 3 in 191.3s

@stephanie-wang
Copy link
Contributor

The linux wheel build error looks unrelated.

@stephanie-wang stephanie-wang merged commit 2dc4198 into ray-project:master May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants