-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update ping endpoint default behavior #2254
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2254 +/- ##
==========================================
+ Coverage 70.28% 70.40% +0.11%
==========================================
Files 75 75
Lines 3392 3392
Branches 57 57
==========================================
+ Hits 2384 2388 +4
+ Misses 1005 1001 -4
Partials 3 3 see 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
} | ||
} else if (state == WorkerState.WORKER_STOPPED) { | ||
if (recoveryStartTS == 0) { | ||
recoveryStartTS = currentTS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is recoveryStartTS = currentTS;
only valid in case of WorkerState.WORKER_STOPPED
What about WorkerState.WORKER_SCALED_DOWN
?
Also, does the current logic handle the case when the thread is dying because of OOM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WORKER_SCALED_DOWN is used in model unregistration or scale down request. it does not trigger backend worker retry.
Any exception such as OOM will trigger worker stage changed to WorkerState.WORKER_STOPPED and then retry. That's why recoveryStartTS = currentTS only happen on WorkerState.WORKER_STOPPED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the one comment, LGTM
@@ -41,6 +41,11 @@ If the server is running, the response is: | |||
} | |||
``` | |||
|
|||
"maxRetryTimeoutInSec" (default: 5MIN) can be defined in a model's config yaml file(eg. model-config.yaml). It is the maximum time window of recovering a dead backend worker. A healthy worker can be in the state: WORKER_STARTED, WORKER_MODEL_LOADED, or WORKER_STOPPED within maxRetryTimeoutInSec window. "Ping" endpont" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would it be better to name this config option maxRecoveryTimeoutInSec
?
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
ping endpoint default behavior is changed as #2231 described.
Fixes #(issue)
#2231
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Checklist: