-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Propagate replica constructor error to deployment status message and print num retries left #48531
[Serve] Propagate replica constructor error to deployment status message and print num retries left #48531
Conversation
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, some questions!
@@ -1942,6 +1944,22 @@ def check_curr_status(self) -> Tuple[bool, bool]: | |||
) | |||
return False, any_replicas_recovering | |||
|
|||
if failed_to_start_count > self._prev_replica_constructor_retry_counter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is _prev_replica_constructor_retry_counter
necessary? Can we just always call update_message
and change it to a no-op if the message is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you're right
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if message != self._curr_status_info.message: | ||
self._curr_status_info = self._curr_status_info.update_message(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be easier to move the check into update_message
method
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
…age and print num retries left (ray-project#48531) ## Why are these changes needed? This change will surface the replica constructor error as soon as the replica constructor fails for whatever reason. The exception will be populated in the deployment status so that it's viewable from the ray dashboard. Additionally, the number of replica constructor retries left will also be updated in the error message. This will help users more quickly debug a deployment that is failing to start. ## Related issue number <!-- For example: "Closes ray-project#1234" --> Closes ray-project#35604 Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Why are these changes needed?
This change will surface the replica constructor error as soon as the replica constructor fails for whatever reason. The exception will be populated in the deployment status so that it's viewable from the ray dashboard. Additionally, the number of replica constructor retries left will also be updated in the error message. This will help users more quickly debug a deployment that is failing to start.
Related issue number
Closes #35604
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.