-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Make RandomKiller
restart in long_running_serve_failure
test
#32011
[Serve] Make RandomKiller
restart in long_running_serve_failure
test
#32011
Conversation
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! How did you find out that the RandomKiller was crashing? (Ah I see the logs you sent offline)
@@ -172,5 +172,6 @@ def run(self): | |||
break | |||
|
|||
|
|||
tester = RandomTest(max_deployments=NUM_NODES * CPUS_PER_NODE) | |||
random_killer = RandomKiller.remote() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the effect of pulling it out of RandomTest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't affect the test's behavior. I did it based on @edoakes's feedback from the previous change.
Will merge as soon as CI passes (i.e. without blocking on running the release test in this PR) since the test is already flaky and we're monitoring it closely. That way we can get more runs in to verify the flakiness is fixed before the branch cut. |
I kicked off the test yesterday, but it failed after ~10 hours with the error:
I reproduced the issue locally by manually killing the actor via terminal. This caused the same error. |
Looks like relevant tests are passing! Can we merge so the weekly run this weekend will include this? |
Windows test failures unrelated |
The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, #31945 and #32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.
…ct#32063) The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, ray-project#31945 and ray-project#32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.
…32011) The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses ray-project#31741 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ct#32063) The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, ray-project#31945 and ray-project#32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Shreyas Krishnaswamy shrekris@anyscale.com
Why are these changes needed?
The
long_running_serve_failure
test uses a long-running actor,RandomKiller
, to randomly kill Serve actors. This change sets theRandomKiller
'smax_restarts
andmax_task_retries
to-1
, so it can restart after crashes.Related issue number
Addresses #31741
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.long_running_serve_failure
release test.