[Serve] Make `RandomKiller` restart in `long_running_serve_failure` test #32011

shrekris-anyscale · 2023-01-27T21:47:15Z

Signed-off-by: Shreyas Krishnaswamy shrekris@anyscale.com

Why are these changes needed?

The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes.

Related issue number

Addresses #31741

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
  - This change updates the long_running_serve_failure release test.

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

architkulkarni

Nice catch! ~~How did you find out that the RandomKiller was crashing?~~ (Ah I see the logs you sent offline)

architkulkarni · 2023-01-27T22:12:04Z

release/long_running_tests/workloads/serve_failure.py

@@ -172,5 +172,6 @@ def run(self):
                break


-tester = RandomTest(max_deployments=NUM_NODES * CPUS_PER_NODE)
+random_killer = RandomKiller.remote()


What's the effect of pulling it out of RandomTest?

It doesn't affect the test's behavior. I did it based on @edoakes's feedback from the previous change.

architkulkarni · 2023-01-27T22:14:43Z

Will merge as soon as CI passes (i.e. without blocking on running the release test in this PR) since the test is already flaky and we're monitoring it closely. That way we can get more runs in to verify the flakiness is fixed before the branch cut.

shrekris-anyscale · 2023-01-27T22:19:04Z

Nice catch! How did you find out that the RandomKiller was crashing?

I kicked off the test yesterday, but it failed after ~10 hours with the error:

Traceback (most recent call last):
  File "workloads/serve_failure.py", line 176, in <module>
    tester.run()
  File "workloads/serve_failure.py", line 149, in run
    action_chosen()
  File "workloads/serve_failure.py", line 128, in create_deployment
    ray.get(self.random_killer.stop_spare.remote(new_name))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2384, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: RandomKiller
        actor_id: 53851ad5fe3aacb063641c9201000000
        pid: 1533
        namespace: serve_failure_test
        ip: 172.31.205.150

I reproduced the issue locally by manually killing the actor via terminal. This caused the same error.

cadedaniel · 2023-01-27T23:45:51Z

Looks like relevant tests are passing! Can we merge so the weekly run this weekend will include this?

architkulkarni · 2023-01-27T23:57:51Z

Windows test failures unrelated

The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, #31945 and #32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.

…ct#32063) The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, ray-project#31945 and ray-project#32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.

…32011) The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes. Related issue number Addresses ray-project#31741 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ct#32063) The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, ray-project#31945 and ray-project#32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Add max_retries and max_restarts

71d3071

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

shrekris-anyscale requested a review from edoakes January 27, 2023 21:47

shrekris-anyscale assigned edoakes Jan 27, 2023

shrekris-anyscale requested a review from sihanwang41 January 27, 2023 21:47

shrekris-anyscale assigned sihanwang41 Jan 27, 2023

sihanwang41 approved these changes Jan 27, 2023

View reviewed changes

architkulkarni approved these changes Jan 27, 2023

View reviewed changes

architkulkarni reviewed Jan 27, 2023

View reviewed changes

architkulkarni self-assigned this Jan 27, 2023

architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 27, 2023

architkulkarni merged commit dd36360 into ray-project:master Jan 27, 2023

shrekris-anyscale mentioned this pull request Jan 30, 2023

[Serve] Mark long_running_serve_failure test as stable #32063

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Make `RandomKiller` restart in `long_running_serve_failure` test #32011

[Serve] Make `RandomKiller` restart in `long_running_serve_failure` test #32011

shrekris-anyscale commented Jan 27, 2023

architkulkarni left a comment •

edited

Loading

architkulkarni Jan 27, 2023

shrekris-anyscale Jan 27, 2023

architkulkarni commented Jan 27, 2023

shrekris-anyscale commented Jan 27, 2023

cadedaniel commented Jan 27, 2023

architkulkarni commented Jan 27, 2023

[Serve] Make RandomKiller restart in long_running_serve_failure test #32011

[Serve] Make RandomKiller restart in long_running_serve_failure test #32011

Conversation

shrekris-anyscale commented Jan 27, 2023

Why are these changes needed?

Related issue number

Checks

architkulkarni left a comment • edited Loading

Choose a reason for hiding this comment

architkulkarni Jan 27, 2023

Choose a reason for hiding this comment

shrekris-anyscale Jan 27, 2023

Choose a reason for hiding this comment

architkulkarni commented Jan 27, 2023

shrekris-anyscale commented Jan 27, 2023

cadedaniel commented Jan 27, 2023

architkulkarni commented Jan 27, 2023

[Serve] Make `RandomKiller` restart in `long_running_serve_failure` test #32011

[Serve] Make `RandomKiller` restart in `long_running_serve_failure` test #32011

architkulkarni left a comment •

edited

Loading