Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect: Fix Envoy getting stuck during load #5499

Merged
merged 4 commits into from
Mar 22, 2019
Merged

Conversation

banks
Copy link
Member

@banks banks commented Mar 15, 2019

This may fix or improve the situation reported in #5332 which is a combination of things. It fixes a known issue that has come up in a few other places e.g. #4868 where an Envoy will hang on startup if one of its upstreams has no healthy instances.

The fix is to allow returning empty responses in xDS since otherwise Envoy blocks waiting for endpoints before it continues to load listeners.

This can also happen if there were healthy upstream endpoints but they all got deregistered or turned unhealthy. In this case we now will send the empty set, however Envoy's behaviour if it is updates with zero healthy instances for a cluster is to retain the ones it had before. This makes sense as it can't be worse than having nothing to connect to, but is confusing since in Envoy's metrics they remain marked as healthy which makes it look like Envoy isn't respecting Consul's health checks.

To help fix that, I've enabled default outlier detection on all upstreams too. This means that eventually after 5 failed attempts to connect, Envoy will internally mark the stale upstreams as "unhealthy". This is a partial solution since it will by default only allow outlier detection to mark the lower of 1 or 10% of the upstreams healthy before it ignores that again so in some cases it will still look like Consul has healthy upstreams even when none exist.

Changing Consul to return unhealthy instances with HealthStatus: UNHEALTHY is an option but it's not trivial since Consul has no single API that can do that currently for catalog/health and can't do that currently at all for Prepared Queries. We may work on that in the future.

Also in this PR:

  • Enabled outlier detection on upstreams which will mark instances unhealthy after 5 failures (using Envoy's defaults)
  • Enable weighted load balancing where DNS weights are configured
    • This was an easy win while I was touching this code. I did it originally because I tried to implement marking unhealthy explicitly as mentioned above before I realised we can't populate that from existing API currently.

@banks banks added the theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies label Mar 15, 2019
@banks banks requested a review from a team March 15, 2019 19:55
@banks
Copy link
Member Author

banks commented Mar 18, 2019

I realised that I was mistaken - Envoy does support being told there are zero instances in a LoadAssignement, I just had been too clever and avoided creating an empty assignment at all in the case of no results thinking that was invalid/caused by partial loading.

I'll modify this PR to use that fact instead which has the same effect as this PR but will actually cause Envoy to show zero healthy backends immediately .

It actually brings up an interesting point though - if we explicitly tell Envoy there are zero instances then it has no chance. For example if a Consul health check has a bug and marks all instances as unavailable then all envoys will hard fail them as it has no known instances at all. The behaviour with this PR as it is now would actually be a little more graceful - Envoy would hang on to the last known set of instances and keep trying them, taking them out of rotation via outlier detection but then when more than 10% of instances are considered outliers, still sending traffic anyway. Which would mean it would ride out issues like bugs in health checks or mass flapping of registrations etc.

🤔 i.e. the 'not ideal' behaviour here actually has benefits. It's only real downside is that it might be confusing that Envoy still reports that it knows about backends (and will even say they are healthy until they've hit the outlier failure threshold) which is somewhat confusing.

I think on the basis that all of the rest of Consul's service discovery mechanisms hard remove all instances if they are all critical, I'll follow suit here, but it's certainly something we could make configurable later perhaps.

"time"

envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2"
envoyauth "github.com/envoyproxy/go-control-plane/envoy/api/v2/auth"
"github.com/envoyproxy/go-control-plane/envoy/api/v2/cluster"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

named import as envoycluster for consistency?

@@ -101,9 +103,24 @@ func makeUpstreamCluster(upstream structs.Upstream, cfgSnap *proxycfg.ConfigSnap
}

if c == nil {
conTimeout := 5 * time.Second
if toRaw, ok := upstream.Config["connect_timeout_ms"]; ok {
switch v := toRaw.(type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of this if block feels like it would be better if it were extracted into a parseTimeMillis helper function or something like that, since it's not terribly envoy specific.

@@ -4,6 +4,7 @@ import (
"errors"

envoy "github.com/envoyproxy/go-control-plane/envoy/api/v2"
"github.com/envoyproxy/go-control-plane/envoy/api/v2/core"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

named import as envoycore

@banks
Copy link
Member Author

banks commented Mar 18, 2019

There is another good argument to make the change suggested that actually removes unhealthy instances.

In cases where the instances have been deregistered, the IP and port may actually be recycled for a future workload (e.g. Kubernetes pod) that is actually a different service. In that case keeping stale things around just in case they aren't really gone can do actual harm so it better to be explicit for now and remove instances that are not available.

In the future when we have ability to efficiently provide unhealthy instances too marked explicitly as unhealthy that would be best, but leaving Envoy be stale for the sake of possibly being more robust to check malfunctions seems to be a bad trade.

tl;dr: I'll update this PR to allow empty LoadAssignments instead of empty discovery results.

Also in this PR:
 - Enabled outlier detection on upstreams which will mark instances unhealthy after 5 failures (using Envoy's defaults)
 - Enable weighted load balancing where DNS weights are configured
@banks
Copy link
Member Author

banks commented Mar 18, 2019

@rboyer thanks! great catches, I think I addressed those.

I'm happy with this. I found another bug detailed in #5506 which I will fix separately but this fixes the most important long-standing issue with startup hangs and failing to update Envoy when last instance of an upstream goes away or fails.

@banks
Copy link
Member Author

banks commented Mar 22, 2019

Parts of this PR are already changed in a WIP branch I have that will be pushed up soon (custom config parsing I rewrote to be centralised and more robust using mapstructure. Would still like to land this as it is so I can get my branch merged on top.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pearkes
Copy link
Contributor

pearkes commented Mar 22, 2019

@banks you mentioned this may fix #5499 but looking at your more recent comments it seems like we are of a high degree of confidence this will fix that issue. Is that right?

@banks
Copy link
Member Author

banks commented Mar 22, 2019

@pearkes I think you mean #5332 and yes I'm pretty confident it will based on the description and logs shared in that issue.

@banks banks merged commit 89fa5ec into master Mar 22, 2019
@banks banks deleted the fix-envoy-empty-cluster branch March 22, 2019 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants