-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: rebuild cluster when health check config is changed #4075
Conversation
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
if (*(*i)->healthCheckAddress() != *host->healthCheckAddress()) { | ||
// If the health check address is updated, set the current host's health check address | ||
// using the updated one. | ||
(*i)->setHealthCheckAddress(host->healthCheckAddress()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is safe from a locking perspective since HC is only on the main thread, it seems like the HC address should be an immutable property of a host? How often does it really change? Did we consider making this a thing that causes a host to get removed/re-added? (Basically having the host check pivot off of normal address and HC address)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be an immutable property of a host
Yeah, on the second thought this PR seems like a quick hack for me.
How often does it really change? Did we consider making this a thing that causes a host to get removed/re-added?
In practice probably not that often. Will update accordingly to make it as a case of causing a host to get removed/re-added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense to do a full add/remove, this code is already crazy complicated ☣️
…ck-address Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
@@ -909,6 +912,9 @@ bool BaseDynamicClusterImpl::updateDynamicHostList(const HostVector& new_hosts, | |||
i = current_hosts.erase(i); | |||
found = true; | |||
} else { | |||
if (address_matched && health_check_changed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry little confused, why is this change necessary? Same below? Seems like just pivoting on address + HC address should be enough above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried by only introducing the change above. But from my test, without this, it leads to keeping the old hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can you take a look at why that is? This code is getting really complicated and could use some more comments anyway. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will try to trace out the flow and craft some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referring to the case example by @bplotnick in here: https://github.com/bplotnick/envoy_examples/blob/master/healthcheck_updates/envoy_config/clusters.yaml, the drain_connections_on_host_removal
is not being set (as true) for the EDS cluster config (type: EDS
). If this set to true, we don't need these two changes. Do you think if we have health checker, we should always set this flag to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two changes make sure, regardless the flag, we always rebuild and skip keeping healthy (old/current) hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh whoops. That was actually unintentional. In my original configuration where I saw this issue, I had that flag set 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 yeah, in that case, we only need to add https://github.com/envoyproxy/envoy/pull/4075/files#diff-cd280606785949980237b2a7f08b7885R872 as suggested by @mattklein123.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But probably that's the correct behavior? Once the old hosts die, it will be removed eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need a little time to grok this conversation thread. I will follow up today or tomorrow. Thank you!
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
health_checker_ != nullptr && *(*i)->healthCheckAddress() != *host->healthCheckAddress(); | ||
// If we find a host matched based on address and the health check address is not changed, we | ||
// keep it. However we do change weight inline so do that here. | ||
if (address_matched && !health_check_changed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not your fault, but this entire method is completely unreadable to me; it's hard to keep track of what current_hosts
represent, which hosts are to be removed, under which conditions we're going to do a rebuild, inplace modify or remove/add. Do you think there's a way to explain to the layman how this works? Each time I reread this method, by head hurts more and more (and I'm partly to blame for sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. This has been an iterative process over time and I agree it's now not very readable and it hurts my head also. I think it would be worth it to have a lot more comments in here, possibly variable name changes, and maybe even functions broken out. @dio since you are in here now do you mind taking a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are in here now do you mind taking a look?
Sure, love to have a look 🙂. Will spend some time dive into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is a quite intensive refactoring in here: #3959
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Thanks to @snowp the code is a lot clearer now. Guess this PR is ready for review. Thanks! |
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much nicer, a couple of comments.
test/common/upstream/eds_test.cc
Outdated
@@ -1152,6 +1152,214 @@ TEST_F(EdsTest, PriorityAndLocalityWeighted) { | |||
EXPECT_EQ(1UL, stats_.counter("cluster.name.update_no_rebuild").value()); | |||
} | |||
|
|||
TEST_F(EdsTest, EndpointUpdateHealthCheckConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably, you could subclass the test fixture EdsTest
and add in a bunch of helper methods to make these tests less verbose; there is a lot of boilerplate that could be refactored away to make the tests more readable that way.
health_checker_ != nullptr && existing_host_found && | ||
*existing_host->second->healthCheckAddress() != *host->healthCheckAddress(); | ||
|
||
if (existing_host_found && !health_check_changed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can still be simplified a bit, since later on, we might add another rebuild condition. I'd rename health_check_changed
to no_inplace_host_update
. Also, maybe add a comment to the top of this if
block explaining what it's doing in terms of hosts insertion/deletion behavior.
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear, thanks!
health_checker_ != nullptr && existing_host_found && | ||
*existing_host->second->healthCheckAddress() != *host->healthCheckAddress(); | ||
|
||
// When there is a match and we decided to do in-place update, we potentially updating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: updating -> update
Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
* origin/master: (34 commits) fuzz: fixes oss-fuzz: 9895 (envoyproxy#4189) docs: added developer docs for pprof/tcmalloc testing (envoyproxy#4194) fix sometimes returns response body to HEAD requests (envoyproxy#3985) admin: fix config dump order (envoyproxy#4197) thrift: allow translation between downstream and upstream protocols (envoyproxy#4136) Use local_info.node() instead of bootstrap.node() whenever possible (envoyproxy#4120) upstream: allow extension protocol options to be used with http filters (envoyproxy#4165) [thrift_proxy] Replace local HeaderMap with Http::HeaderMap[Impl] (envoyproxy#4169) docs: improve gRPC-JSON filter doc (envoyproxy#4184) stats: refactoring MetricImpl without strings (envoyproxy#4190) fuzz: coverage profile generation instructions. (envoyproxy#4193) upstream: rebuild cluster when health check config is changed (envoyproxy#4075) build: use clang-6.0. (envoyproxy#4168) thrift_proxy: introduce header transport (envoyproxy#4082) tcp: allow connection pool callers to store protocol state (envoyproxy#4131) configs: match latest API changes (envoyproxy#4185) Fix wrong mock function name. (envoyproxy#4187) Bump yaml-cpp so it builds with Visual Studio 15.8 (envoyproxy#4182) Converting envoy configs to V2 (envoyproxy#2957) Add timestamp to HealthCheckEvent definition (envoyproxy#4119) ... Signed-off-by: Snow Pettersen <snowp@squareup.com>
…#1938) This is far from finished, but it reduces memory usage by ~10%. Pulling the following changes from github.com/envoyproxy/envoy: c1cc68d stats: refactoring MetricImpl without strings (envoyproxy#4190) 36809d8 fuzz: coverage profile generation instructions. (envoyproxy#4193) ba40cc9 upstream: rebuild cluster when health check config is changed (envoyproxy#4075) 05c0d52 build: use clang-6.0. (envoyproxy#4168) 01f403e thrift_proxy: introduce header transport (envoyproxy#4082) 564d256 tcp: allow connection pool callers to store protocol state (envoyproxy#4131) 3e1d643 configs: match latest API changes (envoyproxy#4185) bc6a10c Fix wrong mock function name. (envoyproxy#4187) e994c1c Bump yaml-cpp so it builds with Visual Studio 15.8 (envoyproxy#4182) 3d1325e Converting envoy configs to V2 (envoyproxy#2957) 8d0680f Add timestamp to HealthCheckEvent definition (envoyproxy#4119) 497efb9 server: handle non-EnvoyExceptions safely if thrown in constructor. (envoyproxy#4173) 6d6fafd config: strengthen validation for gRPC config sources. (envoyproxy#4171) 132302c fuzz: reduce log level when running under fuzz engine. (envoyproxy#4161) 7c04ac2 test: fix V6EmptyOptions in coverage with IPv6 enable (envoyproxy#4155) 1b2219b ci: remove deprecated bazel --batch option (envoyproxy#4166) 2db6a4c ci: update clang to version 6.0 in the Ubuntu build image. (envoyproxy#4157) 71152b7 ratelimit: Add ratelimit custom response headers (envoyproxy#4015) 3062874 ssl: make Ssl::Connection const everywhere (envoyproxy#4179) 706e262 Handle validation of non expiring tokens in jwt_authn filter (envoyproxy#4007) f06e958 fuzz: tag trivial fuzzers with no_fuzz. (envoyproxy#4156) 27fb1d3 thrift_proxy: add service name matching to router implementation (envoyproxy#4130) 8c189a5 Make over provisioning factor configurable (envoyproxy#4003) 6c08fb4 Making test less flaky (envoyproxy#4149) 592775b fuzz: bare bones HCM fuzzer. (envoyproxy#4118) For istio/istio#7912. Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Description:
This PR makes sure changing health check address triggers rebuilding the cluster.
Risk Level: medium
Testing: unit, manual as depicted by @bplotnick in #4070.
Docs Changes: N/A
Release Notes: N/A
Fixes #4070
Signed-off-by: Dhi Aurrahman dio@tetrate.io