Connect: Fix Envoy getting stuck during load #5499

banks · 2019-03-15T19:55:06Z

This may fix or improve the situation reported in #5332 which is a combination of things. It fixes a known issue that has come up in a few other places e.g. #4868 where an Envoy will hang on startup if one of its upstreams has no healthy instances.

The fix is to allow returning empty responses in xDS since otherwise Envoy blocks waiting for endpoints before it continues to load listeners.

This can also happen if there were healthy upstream endpoints but they all got deregistered or turned unhealthy. In this case we now will send the empty set, however Envoy's behaviour if it is updates with zero healthy instances for a cluster is to retain the ones it had before. This makes sense as it can't be worse than having nothing to connect to, but is confusing since in Envoy's metrics they remain marked as healthy which makes it look like Envoy isn't respecting Consul's health checks.

To help fix that, I've enabled default outlier detection on all upstreams too. This means that eventually after 5 failed attempts to connect, Envoy will internally mark the stale upstreams as "unhealthy". This is a partial solution since it will by default only allow outlier detection to mark the lower of 1 or 10% of the upstreams healthy before it ignores that again so in some cases it will still look like Consul has healthy upstreams even when none exist.

Changing Consul to return unhealthy instances with HealthStatus: UNHEALTHY is an option but it's not trivial since Consul has no single API that can do that currently for catalog/health and can't do that currently at all for Prepared Queries. We may work on that in the future.

Also in this PR:

Enabled outlier detection on upstreams which will mark instances unhealthy after 5 failures (using Envoy's defaults)
Enable weighted load balancing where DNS weights are configured
- This was an easy win while I was touching this code. I did it originally because I tried to implement marking unhealthy explicitly as mentioned above before I realised we can't populate that from existing API currently.

banks · 2019-03-18T13:37:29Z

I realised that I was mistaken - Envoy does support being told there are zero instances in a LoadAssignement, I just had been too clever and avoided creating an empty assignment at all in the case of no results thinking that was invalid/caused by partial loading.

I'll modify this PR to use that fact instead which has the same effect as this PR but will actually cause Envoy to show zero healthy backends immediately .

It actually brings up an interesting point though - if we explicitly tell Envoy there are zero instances then it has no chance. For example if a Consul health check has a bug and marks all instances as unavailable then all envoys will hard fail them as it has no known instances at all. The behaviour with this PR as it is now would actually be a little more graceful - Envoy would hang on to the last known set of instances and keep trying them, taking them out of rotation via outlier detection but then when more than 10% of instances are considered outliers, still sending traffic anyway. Which would mean it would ride out issues like bugs in health checks or mass flapping of registrations etc.

🤔 i.e. the 'not ideal' behaviour here actually has benefits. It's only real downside is that it might be confusing that Envoy still reports that it knows about backends (and will even say they are healthy until they've hit the outlier failure threshold) which is somewhat confusing.

I think on the basis that all of the rest of Consul's service discovery mechanisms hard remove all instances if they are all critical, I'll follow suit here, but it's certainly something we could make configurable later perhaps.

rboyer · 2019-03-18T14:49:36Z

agent/xds/clusters.go

 	"time"

 	envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2"
 	envoyauth "github.com/envoyproxy/go-control-plane/envoy/api/v2/auth"
+	"github.com/envoyproxy/go-control-plane/envoy/api/v2/cluster"


named import as envoycluster for consistency?

rboyer · 2019-03-18T14:51:24Z

agent/xds/clusters.go

@@ -101,9 +103,24 @@ func makeUpstreamCluster(upstream structs.Upstream, cfgSnap *proxycfg.ConfigSnap
 	}

 	if c == nil {
+		conTimeout := 5 * time.Second
+		if toRaw, ok := upstream.Config["connect_timeout_ms"]; ok {
+			switch v := toRaw.(type) {


The body of this if block feels like it would be better if it were extracted into a parseTimeMillis helper function or something like that, since it's not terribly envoy specific.

rboyer · 2019-03-18T14:51:51Z

agent/xds/endpoints.go

@@ -4,6 +4,7 @@ import (
 	"errors"

 	envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2"
+	"github.com/envoyproxy/go-control-plane/envoy/api/v2/core"


named import as envoycore

banks · 2019-03-18T16:56:14Z

There is another good argument to make the change suggested that actually removes unhealthy instances.

In cases where the instances have been deregistered, the IP and port may actually be recycled for a future workload (e.g. Kubernetes pod) that is actually a different service. In that case keeping stale things around just in case they aren't really gone can do actual harm so it better to be explicit for now and remove instances that are not available.

In the future when we have ability to efficiently provide unhealthy instances too marked explicitly as unhealthy that would be best, but leaving Envoy be stale for the sake of possibly being more robust to check malfunctions seems to be a bad trade.

tl;dr: I'll update this PR to allow empty LoadAssignments instead of empty discovery results.

Also in this PR: - Enabled outlier detection on upstreams which will mark instances unhealthy after 5 failures (using Envoy's defaults) - Enable weighted load balancing where DNS weights are configured

banks · 2019-03-18T22:15:38Z

@rboyer thanks! great catches, I think I addressed those.

I'm happy with this. I found another bug detailed in #5506 which I will fix separately but this fixes the most important long-standing issue with startup hangs and failing to update Envoy when last instance of an upstream goes away or fails.

banks · 2019-03-22T16:31:07Z

Parts of this PR are already changed in a WIP branch I have that will be pushed up soon (custom config parsing I rewrote to be centralised and more robust using mapstructure. Would still like to land this as it is so I can get my branch merged on top.

mkeeler

LGTM

pearkes · 2019-03-22T17:29:26Z

@banks you mentioned this may fix #5499 but looking at your more recent comments it seems like we are of a high degree of confidence this will fix that issue. Is that right?

banks · 2019-03-22T19:37:08Z

@pearkes I think you mean #5332 and yes I'm pretty confident it will based on the description and logs shared in that issue.

banks added the theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies label Mar 15, 2019

banks requested a review from a team March 15, 2019 19:55

rboyer reviewed Mar 18, 2019

View reviewed changes

Connect: Fix Envoy getting stuck during load

1942a10

Also in this PR: - Enabled outlier detection on upstreams which will mark instances unhealthy after 5 failures (using Envoy's defaults) - Enable weighted load balancing where DNS weights are configured

banks mentioned this pull request Mar 18, 2019

Connect: health API blocking doesn't notice changes to proxy health checks #5506

Closed

banks added 3 commits March 18, 2019 22:06

Fix empty load assignments in the right place

ccfb070

Fix import names from review

0157778

Move millisecond parse to a helper function

c34fd44

banks force-pushed the fix-envoy-empty-cluster branch from a6af4f6 to c34fd44 Compare March 18, 2019 22:13

banks requested a review from rboyer March 18, 2019 22:16

This was referenced Mar 18, 2019

Envoy config cluster #5308

Merged

Consul Connect Doesn't Update Envoy with Consul Maint or Failed Healthchecks #5332

Closed

Don't hang Envoy when result set is empty #5384

Closed

mkeeler approved these changes Mar 22, 2019

View reviewed changes

banks merged commit 89fa5ec into master Mar 22, 2019

banks deleted the fix-envoy-empty-cluster branch March 22, 2019 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connect: Fix Envoy getting stuck during load #5499

Connect: Fix Envoy getting stuck during load #5499

banks commented Mar 15, 2019

banks commented Mar 18, 2019

rboyer Mar 18, 2019

rboyer Mar 18, 2019

rboyer Mar 18, 2019

banks commented Mar 18, 2019

banks commented Mar 18, 2019

banks commented Mar 22, 2019

mkeeler left a comment

pearkes commented Mar 22, 2019

banks commented Mar 22, 2019

Connect: Fix Envoy getting stuck during load #5499

Connect: Fix Envoy getting stuck during load #5499

Conversation

banks commented Mar 15, 2019

banks commented Mar 18, 2019

rboyer Mar 18, 2019

Choose a reason for hiding this comment

rboyer Mar 18, 2019

Choose a reason for hiding this comment

rboyer Mar 18, 2019

Choose a reason for hiding this comment

banks commented Mar 18, 2019

banks commented Mar 18, 2019

banks commented Mar 22, 2019

mkeeler left a comment

Choose a reason for hiding this comment

pearkes commented Mar 22, 2019

banks commented Mar 22, 2019