Delay HC activation until SDS is initialized #17529

flashyang · 2021-07-28T22:59:30Z

Description:
Recently, we noticed that during the Envoy initialization, there is a race condition between when TLS is configured on a upstream cluster (like validation context) and when active healthcheck begins on that cluster, and it will take around 60s for the Envoy to initialize. In this case, we find that Envoy initiates the first healthcheck on the upstream cluster before the validation context is retrieved, resulting in a health check connection failure and the healthcheck interval will fall back to the no_traffic_interval (because there is no traffic on the cluster). While for Envoy cluster which uses STATIC_DNS and EDS this appears to not delay Envoy initialization, it appears that Envoy cluster using LOGICAL_DNS will wait out the no_traffic_interval to healthcheck again before it considers itself fully initialized.

I have create a same issue #15977 before, but that issue was closed without a fix. The most close fix is this commit #16236 but it wasn't be merged.

mattklein123 · 2021-08-06T21:59:51Z

So you are using SDS on the upstream cluster and we are not waiting for SDS to finish before starting health checking? Is that right? If so I agree this should be fixed.

flashyang · 2021-08-06T22:11:38Z

Yes, that's the case. I think we should delay the HC on the upstream cluster until the SDS resources are ready for them.

mattklein123 · 2021-08-06T22:12:35Z

OK yes we should fix this. Marking it help wanted. cc @lambdai

mpuncel · 2021-08-26T23:05:55Z

@lizan (since you added SDS support as far as I know), is this as simple as just moving health_checker_->start() to onInitDone()? That should only run after the init manager has finished with all targets which I believe includes an initial SDS secret load for all transport socket matches. I wonder if my PR is overly complicated.

lizan · 2021-08-27T01:58:09Z

That might work, but I'm not pretty sure.

This should fix envoyproxy#17529 since the init manager waits until SDS secrets have loaded for all transport sockets configured on the Cluster. Signed-off-by: Michael Puncel <mpuncel@squareup.com>

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com> Co-authored-by: Michael Puncel <mpuncel@squareup.com>

This should fix envoyproxy#17529 since the init manager waits until SDS secrets have loaded for all transport sockets configured on the Cluster. Signed-off-by: Michael Puncel <mpuncel@squareup.com>

tnsardesai · 2023-07-28T20:26:40Z

Hi are there any updates on this ticket? I believe this is still an ongoing issue

flashyang added bug triage Issue requires triage labels Jul 28, 2021

This was referenced Aug 4, 2021

Bug: Client x-request-id Mangled by RequestIdExtension aws/aws-app-mesh-roadmap#321

Open

Bug: Active healthchecks, TLS, and DNS service discovery on a Virtual Node can delay Envoy initialization aws/aws-app-mesh-roadmap#227

Open

mattklein123 added area/health_checking area/sds SDS related help wanted Needs help! and removed triage Issue requires triage labels Aug 6, 2021

mattklein123 changed the title ~~Should HC activation be delayed until needed secrets are available?~~ Delay HC activation until SDS is initialized Aug 6, 2021

This was referenced Aug 13, 2021

Mpuncel/sds hc sequence #17712

Closed

Make health check loop wait for any required SDS secrets to be loaded… #17756

Closed

mpuncel mentioned this issue Sep 14, 2021

upstream: start health checker after init manager complete #18119

Closed

suniltheta mentioned this issue Sep 24, 2021

AppMesh-Gateway with 2 GatewayRoutes - update needs >15min aws/eks-charts#605

Closed

mpuncel mentioned this issue Nov 10, 2021

Mpuncel/hc after sds init manager #18962

Closed

valerian-roche added a commit to valerian-roche/envoy that referenced this issue Mar 16, 2022

[Upstream Health-check][SDS] envoyproxy#17529:

4a524bb

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com> Co-authored-by: Michael Puncel <mpuncel@squareup.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay HC activation until SDS is initialized #17529

Delay HC activation until SDS is initialized #17529

flashyang commented Jul 28, 2021 •

edited

Loading

mattklein123 commented Aug 6, 2021

flashyang commented Aug 6, 2021

mattklein123 commented Aug 6, 2021

mpuncel commented Aug 26, 2021 •

edited

Loading

lizan commented Aug 27, 2021

tnsardesai commented Jul 28, 2023

Delay HC activation until SDS is initialized #17529

Delay HC activation until SDS is initialized #17529

Comments

flashyang commented Jul 28, 2021 • edited Loading

mattklein123 commented Aug 6, 2021

flashyang commented Aug 6, 2021

mattklein123 commented Aug 6, 2021

mpuncel commented Aug 26, 2021 • edited Loading

lizan commented Aug 27, 2021

tnsardesai commented Jul 28, 2023

flashyang commented Jul 28, 2021 •

edited

Loading

mpuncel commented Aug 26, 2021 •

edited

Loading