-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connect: Fix Envoy getting stuck during load #5499
Conversation
I realised that I was mistaken - Envoy does support being told there are zero instances in a LoadAssignement, I just had been too clever and avoided creating an empty assignment at all in the case of no results thinking that was invalid/caused by partial loading. I'll modify this PR to use that fact instead which has the same effect as this PR but will actually cause Envoy to show zero healthy backends immediately . It actually brings up an interesting point though - if we explicitly tell Envoy there are zero instances then it has no chance. For example if a Consul health check has a bug and marks all instances as unavailable then all envoys will hard fail them as it has no known instances at all. The behaviour with this PR as it is now would actually be a little more graceful - Envoy would hang on to the last known set of instances and keep trying them, taking them out of rotation via outlier detection but then when more than 10% of instances are considered outliers, still sending traffic anyway. Which would mean it would ride out issues like bugs in health checks or mass flapping of registrations etc. 🤔 i.e. the 'not ideal' behaviour here actually has benefits. It's only real downside is that it might be confusing that Envoy still reports that it knows about backends (and will even say they are healthy until they've hit the outlier failure threshold) which is somewhat confusing. I think on the basis that all of the rest of Consul's service discovery mechanisms hard remove all instances if they are all critical, I'll follow suit here, but it's certainly something we could make configurable later perhaps. |
agent/xds/clusters.go
Outdated
"time" | ||
|
||
envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2" | ||
envoyauth "github.com/envoyproxy/go-control-plane/envoy/api/v2/auth" | ||
"github.com/envoyproxy/go-control-plane/envoy/api/v2/cluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
named import as envoycluster
for consistency?
agent/xds/clusters.go
Outdated
@@ -101,9 +103,24 @@ func makeUpstreamCluster(upstream structs.Upstream, cfgSnap *proxycfg.ConfigSnap | |||
} | |||
|
|||
if c == nil { | |||
conTimeout := 5 * time.Second | |||
if toRaw, ok := upstream.Config["connect_timeout_ms"]; ok { | |||
switch v := toRaw.(type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The body of this if
block feels like it would be better if it were extracted into a parseTimeMillis
helper function or something like that, since it's not terribly envoy specific.
agent/xds/endpoints.go
Outdated
@@ -4,6 +4,7 @@ import ( | |||
"errors" | |||
|
|||
envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2" | |||
"github.com/envoyproxy/go-control-plane/envoy/api/v2/core" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
named import as envoycore
There is another good argument to make the change suggested that actually removes unhealthy instances. In cases where the instances have been deregistered, the IP and port may actually be recycled for a future workload (e.g. Kubernetes pod) that is actually a different service. In that case keeping stale things around just in case they aren't really gone can do actual harm so it better to be explicit for now and remove instances that are not available. In the future when we have ability to efficiently provide unhealthy instances too marked explicitly as unhealthy that would be best, but leaving Envoy be stale for the sake of possibly being more robust to check malfunctions seems to be a bad trade. tl;dr: I'll update this PR to allow empty LoadAssignments instead of empty discovery results. |
Also in this PR: - Enabled outlier detection on upstreams which will mark instances unhealthy after 5 failures (using Envoy's defaults) - Enable weighted load balancing where DNS weights are configured
a6af4f6
to
c34fd44
Compare
Parts of this PR are already changed in a WIP branch I have that will be pushed up soon (custom config parsing I rewrote to be centralised and more robust using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This may fix or improve the situation reported in #5332 which is a combination of things. It fixes a known issue that has come up in a few other places e.g. #4868 where an Envoy will hang on startup if one of its upstreams has no healthy instances.
The fix is to allow returning empty responses in xDS since otherwise Envoy blocks waiting for endpoints before it continues to load listeners.
This can also happen if there were healthy upstream endpoints but they all got deregistered or turned unhealthy. In this case we now will send the empty set, however Envoy's behaviour if it is updates with zero healthy instances for a cluster is to retain the ones it had before. This makes sense as it can't be worse than having nothing to connect to, but is confusing since in Envoy's metrics they remain marked as healthy which makes it look like Envoy isn't respecting Consul's health checks.
To help fix that, I've enabled default outlier detection on all upstreams too. This means that eventually after 5 failed attempts to connect, Envoy will internally mark the stale upstreams as "unhealthy". This is a partial solution since it will by default only allow outlier detection to mark the lower of 1 or 10% of the upstreams healthy before it ignores that again so in some cases it will still look like Consul has healthy upstreams even when none exist.
Changing Consul to return unhealthy instances with
HealthStatus: UNHEALTHY
is an option but it's not trivial since Consul has no single API that can do that currently for catalog/health and can't do that currently at all for Prepared Queries. We may work on that in the future.Also in this PR: