Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Make RandomKiller restart in long_running_serve_failure test #32011

Merged

Conversation

shrekris-anyscale
Copy link
Contributor

Signed-off-by: Shreyas Krishnaswamy shrekris@anyscale.com

Why are these changes needed?

The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes.

Related issue number

Addresses #31741

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
      • This change updates the long_running_serve_failure release test.

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! How did you find out that the RandomKiller was crashing? (Ah I see the logs you sent offline)

@@ -172,5 +172,6 @@ def run(self):
break


tester = RandomTest(max_deployments=NUM_NODES * CPUS_PER_NODE)
random_killer = RandomKiller.remote()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the effect of pulling it out of RandomTest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't affect the test's behavior. I did it based on @edoakes's feedback from the previous change.

@architkulkarni
Copy link
Contributor

Will merge as soon as CI passes (i.e. without blocking on running the release test in this PR) since the test is already flaky and we're monitoring it closely. That way we can get more runs in to verify the flakiness is fixed before the branch cut.

@architkulkarni architkulkarni self-assigned this Jan 27, 2023
@shrekris-anyscale
Copy link
Contributor Author

Nice catch! How did you find out that the RandomKiller was crashing?

I kicked off the test yesterday, but it failed after ~10 hours with the error:

Traceback (most recent call last):
  File "workloads/serve_failure.py", line 176, in <module>
    tester.run()
  File "workloads/serve_failure.py", line 149, in run
    action_chosen()
  File "workloads/serve_failure.py", line 128, in create_deployment
    ray.get(self.random_killer.stop_spare.remote(new_name))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2384, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: RandomKiller
        actor_id: 53851ad5fe3aacb063641c9201000000
        pid: 1533
        namespace: serve_failure_test
        ip: 172.31.205.150

I reproduced the issue locally by manually killing the actor via terminal. This caused the same error.

@cadedaniel
Copy link
Member

Looks like relevant tests are passing! Can we merge so the weekly run this weekend will include this?

@architkulkarni
Copy link
Contributor

Windows test failures unrelated

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 27, 2023
@architkulkarni architkulkarni merged commit dd36360 into ray-project:master Jan 27, 2023
architkulkarni pushed a commit that referenced this pull request Jan 30, 2023
The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, #31945 and #32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.
clarng pushed a commit to clarng/ray that referenced this pull request Jan 31, 2023
…ct#32063)

The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, ray-project#31945 and ray-project#32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…32011)

The long_running_serve_failure test uses a long-running actor, RandomKiller, to randomly kill Serve actors. This change sets the RandomKiller's max_restarts and max_task_retries to -1, so it can restart after crashes.

Related issue number
Addresses ray-project#31741

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ct#32063)

The long_running_serve_failure release test is marked as unstable due to recent failures. Recently, ray-project#31945 and ray-project#32011 have resolved the root causes of these failures. After those changes, the test ran successfully for 15+ hours without failure. This change limits the test's iterations, so it doesn't run forever, and it marks the test as stable.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants