Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Fail early for user app failure and expose failure reasons #3411

Merged
merged 41 commits into from
Apr 7, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Apr 2, 2024

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
    • pytest tests/test_smoke.py --serve
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • pytest tests/test_smoke.py::test_skyserve_user_bug_restart
    • pytest tests/test_smoke.py::test_skyserve_failures
    • pytest tests/test_smoke.py --serve
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll requested a review from cblmemo April 3, 2024 07:56
@Michaelvll Michaelvll marked this pull request as ready for review April 4, 2024 00:41
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It looks mostly great for me. Left some nits and after all smoke tests added and passed it should be good to go 🫡

sky/serve/autoscalers.py Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
Comment on lines 398 to 406
if info.is_ready:
self.latest_version_ever_ready = self.latest_version
elif (info.status_property.unrecoverable_failure() and
self.latest_version_ever_ready < self.latest_version):
# Stop scaling if one of replica of the latest version
# failed, it is likely that a fatal error happens to the
# user application and may lead to a infinte termination
# and restart.
return []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the elif in a new for loop? If replica 1 is unrecoverable and replica 2 is ready, and the order in replica_infos is [replica_1, replica_2], iiuc here we will first execute the loop body for replica 1 and stop scaling, which is not expected as the replica 2 is ready. instead, should we loop for all replicas and update the self.latest_version_every_ready first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! When unrecoverable_failure() happens, it is likely all the replicas will fail, but changing it to two loops should be safer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering some edge case like the initial delay seconds is about to miss and dependent on network speed 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I suppose the latest two loops should guard the concerns : )

sky/serve/replica_managers.py Outdated Show resolved Hide resolved
@@ -333,17 +340,17 @@ def to_replica_status(self) -> serve_state.ReplicaStatus:
return serve_state.ReplicaStatus.FAILED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return serve_state.ReplicaStatus.FAILED
return serve_state.ReplicaStatus.FAILED_USER_APP

should we update this to a status w/ more information as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking to use FAILED to always mean FAILED_USER_APP, as FAILED represents user app error in managed spot job. Wdyt? cc'ing @concretevitamin @romilbhardwaj for some input here as well : )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh if it is the case, agreed that we keep align w/ spot jobs ;)

sky/serve/serve_state.py Show resolved Hide resolved
sky/serve/serve_state.py Outdated Show resolved Hide resolved
tests/skyserve/failures/probing.py Outdated Show resolved Hide resolved
tests/skyserve/spot/base_ondemand_fallback.yaml Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thx :)

sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
@@ -333,17 +340,17 @@ def to_replica_status(self) -> serve_state.ReplicaStatus:
return serve_state.ReplicaStatus.FAILED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh if it is the case, agreed that we keep align w/ spot jobs ;)

Michaelvll and others added 5 commits April 5, 2024 16:45
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
@Michaelvll Michaelvll merged commit 48a5c63 into master Apr 7, 2024
20 checks passed
@Michaelvll Michaelvll deleted the fail-early-for-user-error branch April 7, 2024 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants