-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add shutdown_delay option to executor & gRPC GracefulStop #3711
Conversation
Hi @asobrien. Thanks for your PR. I'm waiting for a SeldonIO or todo member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository. |
/assign @adriangonz |
/test integration |
/test notebooks |
@asobrien Thanks for these improvements! We will run integration and notebook tests and have a look. @axsaucedo |
/test integration |
/test notebooks |
/retest |
@asobrien: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository. |
/test integration |
/test notebooks |
@asobrien: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository. I understand the commands that are listed here. |
All integration tests pass - only two tests failed and they are confirmed flaky, merging - thanks for the contribution @asobrien |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: axsaucedo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Several changes changes are introduced in the PR, the individual commits reflect the changes and are described in more detail below; in summary and order of importance the changes are:
shutdown_delay
option in executor and adds graceful termination to gRPC server.ENGINE_PREDICTOR
env var is not found whengetPredictorFromEnv()
is called.gRPC graceful shutdown
The executor has a
graceful_timeout
(default: 15s) option, which allows for graceful termination of the http (REST) server through the net/http Server'sShutdown()
method. When there is no connections to the http server, theShutdown()
doesn't block andos.Exit()
is immediately called. The gRPC server which is running in the main goroutine is abruptly terminated without calling its associatedGracefulStop()
method to drain active connections.The changes introduced here run both the http and gRPC servers in their own goroutines and use per-server channels to signal shutdown. The
context.WithTimeout()
created from thegraceful_timeout
option is moved out ofrunHttpServer
and is used within thewaitForShutdown()
function. While the http server can now block for a period longer than the graceful timeout, the server will still be stopped when the program exits. The gRPC server does not use a context in its graceful shutdown method: the graceful timeout expiry is handled withinwaitForShutdown()
.server shutdown_delay
We run several Seldon models on a kubernetes cluster, and have configured Seldon to use Ambassador as a gateway. When we do RollingUpdates (redeployments) of the model we see a significant increase in error rate, as measured by an upstream service. The errors look something of the form:
These errors are seen in our upstream service whenever one of the pods (running a seldon-container-engine) is deleted. Some simple testing revealed that traffic is still directed at the terminating pod for a period after it has already received a SIGTERM and shutdown, this leads to the errors we see above. We deploy the seldondeployment CRD with a
seldon.io/headless-svc: "true"
annotation. What we think is happening is that Envoy (via Ambassador) continues to keep the terminated pod's IP in rotation even after the pod is rotated. Some rudimentary measurements suggest that it can take over 10 seconds for a terminated pod to stop showing up in DNS record for the service. Additionally, by default, Envoy refreshes from DNS every 5 seconds.We found that we could delete a pod without seeing gRPC errors if we utilized a
shutdown_delay
of 30s. By default this new option is set to 0, so there is no change in existing behavior.maxproc logging
The
logger.Info
passed into maxprocs'sSet()
method does not behave likefmt.Printf()
which leads to panic-level errors in the logs:This is because after the first argument logr expects paired arguments that are formatted as keyval pairs. The change here adds an inline function so that only a single argument is passed to
logger.Info
(a formatted string) and the above error is no longer generated. With this change, the above log line now appears as:missing ENGINE_PREDICTOR env var
The
getPredictorFromEnv()
function is modified to error in a similar manner as thegetPredictorFromFile()
. If a predictor can't be loaded from a file (because the file is missing or unsupported) that function returns an error. If no filename is specified a predictor will be loaded from the environment. However, if the ENGINE_PREDICTOR env var is missing an nil pointer is returned which leads to a subsequent panic. This change returns an error if thegetPredictorFromEnv()
function is called and no corresponding ENGINE_PREDICTOR env var has been set.Which issue(s) this PR fixes: None.
The issue is described within this PR.
Special notes for your reviewer: None.
Does this PR introduce a user-facing change?: Yes.