Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay HC activation until SDS is initialized #17529

Open
flashyang opened this issue Jul 28, 2021 · 6 comments
Open

Delay HC activation until SDS is initialized #17529

flashyang opened this issue Jul 28, 2021 · 6 comments

Comments

@flashyang
Copy link

flashyang commented Jul 28, 2021

Description:
Recently, we noticed that during the Envoy initialization, there is a race condition between when TLS is configured on a upstream cluster (like validation context) and when active healthcheck begins on that cluster, and it will take around 60s for the Envoy to initialize. In this case, we find that Envoy initiates the first healthcheck on the upstream cluster before the validation context is retrieved, resulting in a health check connection failure and the healthcheck interval will fall back to the no_traffic_interval (because there is no traffic on the cluster). While for Envoy cluster which uses STATIC_DNS and EDS this appears to not delay Envoy initialization, it appears that Envoy cluster using LOGICAL_DNS will wait out the no_traffic_interval to healthcheck again before it considers itself fully initialized.

I have create a same issue #15977 before, but that issue was closed without a fix. The most close fix is this commit #16236 but it wasn't be merged.

@mattklein123
Copy link
Member

So you are using SDS on the upstream cluster and we are not waiting for SDS to finish before starting health checking? Is that right? If so I agree this should be fixed.

@mattklein123 mattklein123 added area/health_checking area/sds SDS related help wanted Needs help! and removed triage Issue requires triage labels Aug 6, 2021
@mattklein123 mattklein123 changed the title Should HC activation be delayed until needed secrets are available? Delay HC activation until SDS is initialized Aug 6, 2021
@flashyang
Copy link
Author

Yes, that's the case. I think we should delay the HC on the upstream cluster until the SDS resources are ready for them.

@mattklein123
Copy link
Member

OK yes we should fix this. Marking it help wanted. cc @lambdai

@mpuncel
Copy link
Contributor

mpuncel commented Aug 26, 2021

@lizan (since you added SDS support as far as I know), is this as simple as just moving health_checker_->start() to onInitDone()? That should only run after the init manager has finished with all targets which I believe includes an initial SDS secret load for all transport socket matches. I wonder if my PR is overly complicated.

@lizan
Copy link
Member

lizan commented Aug 27, 2021

That might work, but I'm not pretty sure.

mpuncel added a commit to mpuncel/envoy that referenced this issue Sep 14, 2021
This should fix envoyproxy#17529 since
the init manager waits until SDS secrets have loaded for all transport
sockets configured on the Cluster.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
valerian-roche added a commit to valerian-roche/envoy that referenced this issue Mar 16, 2022
Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
Co-authored-by: Michael Puncel <mpuncel@squareup.com>
mpuncel added a commit to mpuncel/envoy that referenced this issue Aug 8, 2022
This should fix envoyproxy#17529 since
the init manager waits until SDS secrets have loaded for all transport
sockets configured on the Cluster.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@tnsardesai
Copy link

Hi are there any updates on this ticket? I believe this is still an ongoing issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants