Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immediately drain connections from hosts not in service discovery, as an option #440

Closed
bobzilladev opened this issue Feb 6, 2017 · 2 comments
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!

Comments

@bobzilladev
Copy link
Contributor

When a host is removed from service discovery Envoy will still send it traffic until the healthcheck goes bad. We'd want a configuration to have it start draining connections immediately. Probably along with a minimum-hosts configuration as a sanity check if service-discovery ever lost its mind.

Our discovery always shows all instances, but there's an 'active' flag on whether to send traffic or not, a notion of "in the load balancer". That is used to filter generated HAProxy configurations or other load balancers. For envoy SDS I just omit any host that isn't 'active', so an option would be to have a similar 'active' flag on the SDS API vs an "obey service discovery" on the cluster config.

In a normal deployment we leave up the old fleet for a time to be able to quickly switch back in case of problems. The old fleet is taken out of load balancer but left alive. If monitoring finds a box going bad or rebooted it is also taken out of the load balancer out to be replaced/investigated. In both cases we don't want any customer traffic going to those hosts. I understand there's a mechanism to force-fail the envoy healthcheck but it'd be nice not to have to add that when service-discovery is already saying not to send traffic there.

@alexkrishnan
Copy link

Am I correct that this behavior could be modified in upstream_impl possibly to read a new property of cluster?

@mattklein123 mattklein123 added the enhancement Feature requests. Not bugs or questions. label Feb 22, 2017
@mattklein123 mattklein123 added the help wanted Needs help! label Oct 28, 2017
htuch pushed a commit that referenced this issue May 8, 2018
This PR adds a configuration flag that allows disabling the "eventually consistent" aspect of endpoint updates: instead of waiting for the endpoints to go unhealthy before removing them from the cluster, do so immediately. This gives greater control to the control plane in cases where one might want to divert traffic from an endpoint
without having it go unhealthy. The flag goes on the cluster and so applies to all endpoints within that cluster.

Risk Level:
Low: Small configurable feature which reuses existing behavior (identical to the behavior when no health checker is configured). Defaults to disabled, so should have no impact on existing configurations.

Testing:
Added unit test for the case when endpoints are healthy then removed from the ClusterLoadAssignment in a subsequent config update.

Docs Changes:
Docs were added to the proto field.

Release Notes:
Added cluster: Add :ref:`option <envoy_api_field_Clister.drain_connections_on_eds_removal>` to drain endpoints after they are removed through EDS, despite health status. to the release notes.

[Optional Fixes #Issue]
#440 and #3276 (note that this last issue also asks for more fine grained control over endpoint removal. The solution this PR provides was brought up as a partial solution to #3276).

Signed-off-by: Snow Pettersen <snowp@squareup.com>
@mattklein123
Copy link
Member

I believe this is fixed by #3302. LMK if not and we can reopen.

rshriram pushed a commit to rshriram/envoy that referenced this issue Oct 30, 2018
* Add support files to test LDS manually.

* fix indent.
PiotrSikora added a commit to PiotrSikora/envoy that referenced this issue Mar 3, 2020
This change restores Wasm Service, which was accidentally removed
in a bad merge (envoyproxy#321).

It also fixes a number of issuses:

1. Wasm Services were started before Cluster Manager was initialized,
   which crashed Envoy when Wasm Service tried to dispatch a callout.

2. Wasm Services (per-thread) were not stored and they were getting
   out of scope, so they were immediately destroyed.

3. Wasm Services (singleton) were started twice.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
istio-testing pushed a commit to istio/envoy that referenced this issue Mar 3, 2020
This change restores Wasm Service, which was accidentally removed
in a bad merge (#321).

It also fixes a number of issuses:

1. Wasm Services were started before Cluster Manager was initialized,
   which crashed Envoy when Wasm Service tried to dispatch a callout.

2. Wasm Services (per-thread) were not stored and they were getting
   out of scope, so they were immediately destroyed.

3. Wasm Services (singleton) were started twice.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
jplevyak pushed a commit to jplevyak/envoy that referenced this issue Apr 13, 2020
This change restores Wasm Service, which was accidentally removed
in a bad merge (istio#321).

It also fixes a number of issuses:

1. Wasm Services were started before Cluster Manager was initialized,
   which crashed Envoy when Wasm Service tried to dispatch a callout.

2. Wasm Services (per-thread) were not stored and they were getting
   out of scope, so they were immediately destroyed.

3. Wasm Services (singleton) were started twice.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
wolfguoliang pushed a commit to wolfguoliang/envoy that referenced this issue Jan 23, 2021
jpsim pushed a commit that referenced this issue Nov 28, 2022
Description: the liveliness CI tests for android are flaky because the response view holder doesn't log with high fidelity without UI interaction. This PR changes the logs to be printed in the main activity.
Risk Level: low
Testing: CI

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Description: the liveliness CI tests for android are flaky because the response view holder doesn't log with high fidelity without UI interaction. This PR changes the logs to be printed in the main activity.
Risk Level: low
Testing: CI

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Projects
None yet
Development

No branches or pull requests

3 participants