Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] long_running_serve_failure is unstable #31741

Closed
shrekris-anyscale opened this issue Jan 18, 2023 · 8 comments · Fixed by #32063
Closed

[Serve] long_running_serve_failure is unstable #31741

shrekris-anyscale opened this issue Jan 18, 2023 · 8 comments · Fixed by #32063
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue

Comments

@shrekris-anyscale
Copy link
Contributor

What happened + What you expected to happen

The long_running_serve_failure test is unstable. It passes roughly 50% of the time:

Screen Shot 2023-01-18 at 10 15 09 AM

This test should be made stable for Ray 2.3.

Versions / Dependencies

Ray on the current master.

Reproduction script

long_running_serve_failure runs this script over a long period of time.

Issue Severity

None

@shrekris-anyscale shrekris-anyscale added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue labels Jan 18, 2023
@shrekris-anyscale shrekris-anyscale self-assigned this Jan 18, 2023
@cadedaniel
Copy link
Member

Bumping to release blocker, per @shrekris-anyscale this is to be made stable before 2.3. cc @sihanwang41

@cadedaniel cadedaniel added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jan 24, 2023
@cadedaniel
Copy link
Member

What's the status here? Is someone actively working on it?

@shrekris-anyscale
Copy link
Contributor Author

Yep, I'm working on it.

@shrekris-anyscale
Copy link
Contributor Author

#31945 should address this issue. I kicked off a release test run to confirm: https://buildkite.com/ray-project/release-tests-branch/builds/1312#_

architkulkarni pushed a commit that referenced this issue Jan 27, 2023
The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes.

Related issue number
Addresses #31741
@shrekris-anyscale
Copy link
Contributor Author

I started a test run, and it has been running successfully for 15+ hours without failing: https://buildkite.com/ray-project/release-tests-branch/builds/1319#_

I'll follow up with a PR that marks the release test as stable and adds an iteration limit, so it doesn't run forever.

@cadedaniel
Copy link
Member

do you know if this release test has your latest fix?

https://buildkite.com/ray-project/release-tests-branch/builds/1318#0185fc29-1d52-4529-9222-5cddea6b4167

@shrekris-anyscale
Copy link
Contributor Author

do you know if this release test has your latest fix?

Yes, that release test has the latest fix. It failed because of an unrelated issue with the dashboard.

edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
…32011)

The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes.

Related issue number
Addresses ray-project#31741

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@Wordyka
Copy link

Wordyka commented Jun 5, 2023

How to reproduce this bug? Is it any of instructions or step that we must be followed? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants