Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add shutdown_delay option to executor & gRPC GracefulStop #3711

Merged
merged 3 commits into from
Nov 2, 2021

Conversation

asobrien
Copy link
Contributor

What this PR does / why we need it:

Several changes changes are introduced in the PR, the individual commits reflect the changes and are described in more detail below; in summary and order of importance the changes are:

  1. (3b098ea) Introduces shutdown_delay option in executor and adds graceful termination to gRPC server.
  2. (33b04da) Fixes maxproc logging.
  3. (b9122dc) Return error if ENGINE_PREDICTOR env var is not found when getPredictorFromEnv() is called.

gRPC graceful shutdown

The executor has a graceful_timeout (default: 15s) option, which allows for graceful termination of the http (REST) server through the net/http Server's Shutdown() method. When there is no connections to the http server, the Shutdown() doesn't block and os.Exit() is immediately called. The gRPC server which is running in the main goroutine is abruptly terminated without calling its associated GracefulStop() method to drain active connections.

The changes introduced here run both the http and gRPC servers in their own goroutines and use per-server channels to signal shutdown. The context.WithTimeout() created from the graceful_timeout option is moved out of runHttpServer and is used within the waitForShutdown() function. While the http server can now block for a period longer than the graceful timeout, the server will still be stopped when the program exits. The gRPC server does not use a context in its graceful shutdown method: the graceful timeout expiry is handled within waitForShutdown().

server shutdown_delay

We run several Seldon models on a kubernetes cluster, and have configured Seldon to use Ambassador as a gateway. When we do RollingUpdates (redeployments) of the model we see a significant increase in error rate, as measured by an upstream service. The errors look something of the form:

<_Rendezvous of RPC that terminated with:	
status = StatusCode.UNAVAILABLE	
details = "upstream connect error or disconnect/reset before headers. reset reason: local reset"	
debug_error_string = "
{
  "created":"@1635250379.345962743",
  "description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:80",
  "file":"src/core/lib/surface/call.cc",
  "file_line":1041,
  "grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: local reset",
  "grpc_status":14
}"
>

These errors are seen in our upstream service whenever one of the pods (running a seldon-container-engine) is deleted. Some simple testing revealed that traffic is still directed at the terminating pod for a period after it has already received a SIGTERM and shutdown, this leads to the errors we see above. We deploy the seldondeployment CRD with a seldon.io/headless-svc: "true" annotation. What we think is happening is that Envoy (via Ambassador) continues to keep the terminated pod's IP in rotation even after the pod is rotated. Some rudimentary measurements suggest that it can take over 10 seconds for a terminated pod to stop showing up in DNS record for the service. Additionally, by default, Envoy refreshes from DNS every 5 seconds.

We found that we could delete a pod without seeing gRPC errors if we utilized a shutdown_delay of 30s. By default this new option is set to 0, so there is no change in existing behavior.

maxproc logging

The logger.Info passed into maxprocs's Set() method does not behave like fmt.Printf() which leads to panic-level errors in the logs:

{"level":"dpanic","ts":1635457222.747586,"logger":"entrypoint","msg":"odd number of arguments passed as key-value pairs for logging","ignored key":"1","stacktrace":"go.uber.org/automaxprocs/maxprocs.(*config).log\n\t/src/go/pkg/mod/go.uber.org/automaxprocs@v1.4.0/maxprocs/maxprocs.go:47\ngo.uber.org/automaxprocs/maxprocs.Set\n\t/src/go/pkg/mod/go.uber.org/automaxprocs@v1.4.0/maxprocs/maxprocs.go:101\nmain.main\n\t/src/seldon-core/executor/cmd/executor/main.go:310\nruntime.main\n\t/opt/local/lib/go/src/runtime/proc.go:255"}

This is because after the first argument logr expects paired arguments that are formatted as keyval pairs. The change here adds an inline function so that only a single argument is passed to logger.Info (a formatted string) and the above error is no longer generated. With this change, the above log line now appears as:

{"level":"info","ts":1635467960.469512,"logger":"entrypoint.maxprocs","msg":"maxprocs: Honoring GOMAXPROCS=\"1\" as set in environment"}

missing ENGINE_PREDICTOR env var

The getPredictorFromEnv() function is modified to error in a similar manner as the getPredictorFromFile(). If a predictor can't be loaded from a file (because the file is missing or unsupported) that function returns an error. If no filename is specified a predictor will be loaded from the environment. However, if the ENGINE_PREDICTOR env var is missing an nil pointer is returned which leads to a subsequent panic. This change returns an error if the getPredictorFromEnv() function is called and no corresponding ENGINE_PREDICTOR env var has been set.

Which issue(s) this PR fixes: None.

The issue is described within this PR.

Special notes for your reviewer: None.

Does this PR introduce a user-facing change?: Yes.

Adds shutdown_delay option to executor

@seldondev
Copy link
Collaborator

Hi @asobrien. Thanks for your PR.

I'm waiting for a SeldonIO or todo member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository.

@asobrien
Copy link
Contributor Author

/assign @adriangonz

@asobrien asobrien changed the title add shutdown_delay option to exector & gRPC GracefulStop add shutdown_delay option to executor & gRPC GracefulStop Oct 29, 2021
@ukclivecox
Copy link
Contributor

/test integration

@ukclivecox
Copy link
Contributor

/test notebooks

@ukclivecox
Copy link
Contributor

@asobrien Thanks for these improvements! We will run integration and notebook tests and have a look. @axsaucedo

@ukclivecox
Copy link
Contributor

/test integration

@ukclivecox
Copy link
Contributor

/test notebooks

@asobrien
Copy link
Contributor Author

/retest

@seldondev
Copy link
Collaborator

@asobrien: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository.

@axsaucedo
Copy link
Contributor

/test integration

@axsaucedo
Copy link
Contributor

/test notebooks

@seldondev
Copy link
Collaborator

@asobrien: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
integration b9122dc link /test integration
notebooks b9122dc link /test notebooks

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository. I understand the commands that are listed here.

@axsaucedo
Copy link
Contributor

All integration tests pass - only two tests failed and they are confirmed flaky, merging - thanks for the contribution @asobrien
/approve

@axsaucedo axsaucedo merged commit 3a9e469 into SeldonIO:master Nov 2, 2021
@seldondev
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: axsaucedo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stephen37 pushed a commit to stephen37/seldon-core that referenced this pull request Dec 21, 2021
)

* shutdown delay & gRPC graceful shutdown

* fix maxprocs logger

* getPredictorFromEnv missing key error & tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants