-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] long_running_serve_failure
is unstable
#31741
Comments
Bumping to release blocker, per @shrekris-anyscale this is to be made stable before 2.3. cc @sihanwang41 |
What's the status here? Is someone actively working on it? |
Yep, I'm working on it. |
#31945 should address this issue. I kicked off a release test run to confirm: https://buildkite.com/ray-project/release-tests-branch/builds/1312#_ |
The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses #31741
I started a test run, and it has been running successfully for 15+ hours without failing: https://buildkite.com/ray-project/release-tests-branch/builds/1319#_ I'll follow up with a PR that marks the release test as stable and adds an iteration limit, so it doesn't run forever. |
do you know if this release test has your latest fix? |
Yes, that release test has the latest fix. It failed because of an unrelated issue with the dashboard. |
…32011) The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses ray-project#31741 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
How to reproduce this bug? Is it any of instructions or step that we must be followed? Thank you |
What happened + What you expected to happen
The
long_running_serve_failure
test is unstable. It passes roughly 50% of the time:This test should be made stable for Ray 2.3.
Versions / Dependencies
Ray on the current master.
Reproduction script
long_running_serve_failure
runs this script over a long period of time.Issue Severity
None
The text was updated successfully, but these errors were encountered: