Immediately drain connections from hosts not in service discovery, as an option #440

bobzilladev · 2017-02-06T23:58:51Z

When a host is removed from service discovery Envoy will still send it traffic until the healthcheck goes bad. We'd want a configuration to have it start draining connections immediately. Probably along with a minimum-hosts configuration as a sanity check if service-discovery ever lost its mind.

Our discovery always shows all instances, but there's an 'active' flag on whether to send traffic or not, a notion of "in the load balancer". That is used to filter generated HAProxy configurations or other load balancers. For envoy SDS I just omit any host that isn't 'active', so an option would be to have a similar 'active' flag on the SDS API vs an "obey service discovery" on the cluster config.

In a normal deployment we leave up the old fleet for a time to be able to quickly switch back in case of problems. The old fleet is taken out of load balancer but left alive. If monitoring finds a box going bad or rebooted it is also taken out of the load balancer out to be replaced/investigated. In both cases we don't want any customer traffic going to those hosts. I understand there's a mechanism to force-fail the envoy healthcheck but it'd be nice not to have to add that when service-discovery is already saying not to send traffic there.

alexkrishnan · 2017-02-07T00:18:43Z

Am I correct that this behavior could be modified in upstream_impl possibly to read a new property of cluster?

This PR adds a configuration flag that allows disabling the "eventually consistent" aspect of endpoint updates: instead of waiting for the endpoints to go unhealthy before removing them from the cluster, do so immediately. This gives greater control to the control plane in cases where one might want to divert traffic from an endpoint without having it go unhealthy. The flag goes on the cluster and so applies to all endpoints within that cluster. Risk Level: Low: Small configurable feature which reuses existing behavior (identical to the behavior when no health checker is configured). Defaults to disabled, so should have no impact on existing configurations. Testing: Added unit test for the case when endpoints are healthy then removed from the ClusterLoadAssignment in a subsequent config update. Docs Changes: Docs were added to the proto field. Release Notes: Added cluster: Add :ref:`option <envoy_api_field_Clister.drain_connections_on_eds_removal>` to drain endpoints after they are removed through EDS, despite health status. to the release notes. [Optional Fixes #Issue] #440 and #3276 (note that this last issue also asks for more fine grained control over endpoint removal. The solution this PR provides was brought up as a partial solution to #3276). Signed-off-by: Snow Pettersen <snowp@squareup.com>

mattklein123 · 2018-05-18T02:27:58Z

I believe this is fixed by #3302. LMK if not and we can reopen.

* Add support files to test LDS manually. * fix indent.

This change restores Wasm Service, which was accidentally removed in a bad merge (envoyproxy#321). It also fixes a number of issuses: 1. Wasm Services were started before Cluster Manager was initialized, which crashed Envoy when Wasm Service tried to dispatch a callout. 2. Wasm Services (per-thread) were not stored and they were getting out of scope, so they were immediately destroyed. 3. Wasm Services (singleton) were started twice. Signed-off-by: Piotr Sikora <piotrsikora@google.com>

This change restores Wasm Service, which was accidentally removed in a bad merge (#321). It also fixes a number of issuses: 1. Wasm Services were started before Cluster Manager was initialized, which crashed Envoy when Wasm Service tried to dispatch a callout. 2. Wasm Services (per-thread) were not stored and they were getting out of scope, so they were immediately destroyed. 3. Wasm Services (singleton) were started twice. Signed-off-by: Piotr Sikora <piotrsikora@google.com>

This change restores Wasm Service, which was accidentally removed in a bad merge (istio#321). It also fixes a number of issuses: 1. Wasm Services were started before Cluster Manager was initialized, which crashed Envoy when Wasm Service tried to dispatch a callout. 2. Wasm Services (per-thread) were not stored and they were getting out of scope, so they were immediately destroyed. 3. Wasm Services (singleton) were started twice. Signed-off-by: Piotr Sikora <piotrsikora@google.com>

Description: the liveliness CI tests for android are flaky because the response view holder doesn't log with high fidelity without UI interaction. This PR changes the logs to be printed in the main activity. Risk Level: low Testing: CI Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

mattklein123 added the enhancement Feature requests. Not bugs or questions. label Feb 22, 2017

mattklein123 added the help wanted Needs help! label Oct 28, 2017

mattklein123 mentioned this issue May 3, 2018

Add ability to fully disable an endpoint via EDS #3276

Closed

snowp mentioned this issue May 7, 2018

cluster: allow ignoring health check status during eds removal #3302

Merged

mattklein123 closed this as completed May 18, 2018

rshriram pushed a commit to rshriram/envoy that referenced this issue Oct 30, 2018

Add support files to test LDS manually. (envoyproxy#440)

e927fcc

* Add support files to test LDS manually. * fix indent.

wolfguoliang pushed a commit to wolfguoliang/envoy that referenced this issue Jan 23, 2021

Update health_check_filter.rst (envoyproxy#440)

6d51f4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Immediately drain connections from hosts not in service discovery, as an option #440

Immediately drain connections from hosts not in service discovery, as an option #440

bobzilladev commented Feb 6, 2017

alexkrishnan commented Feb 7, 2017

mattklein123 commented May 18, 2018

Immediately drain connections from hosts not in service discovery, as an option #440

Immediately drain connections from hosts not in service discovery, as an option #440

Comments

bobzilladev commented Feb 6, 2017

alexkrishnan commented Feb 7, 2017

mattklein123 commented May 18, 2018