Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should HC activation be delayed until needed secrets are available? #15977

Closed
flashyang opened this issue Apr 14, 2021 · 4 comments
Closed

Should HC activation be delayed until needed secrets are available? #15977

flashyang opened this issue Apr 14, 2021 · 4 comments
Labels
area/health_checking area/tls bug stale stalebot believes this issue/PR has not been touched recently

Comments

@flashyang
Copy link

flashyang commented Apr 14, 2021

Description:
Recently, we noticed that during the Envoy initialization, there is a race condition between when TLS is configured on a upstream cluster (like validation context) and when active healthcheck begins on that cluster, and it will take around 60s for the Envoy to initialize. In this case, we find that Envoy initiates the first healthcheck on the upstream cluster before the validation context is retrieved, resulting in a health check connection failure and the healthcheck interval will fall back to the no_traffic_interval (because there is no traffic on the cluster). While for Envoy cluster which uses STATIC_DNS and EDS this appears to not delay Envoy initialization, it appears that Envoy cluster using LOGICAL_DNS will wait out the no_traffic_interval to healthcheck again before it considers itself fully initialized.

I saw we have same issue created before #12389, but it was closed (the fix commit #13516 was merged but reverted latter). So open this issue again. I have verified that this issue still exist in Envoy v1.17.1 image.

@flashyang flashyang added bug triage Issue requires triage labels Apr 14, 2021
@flashyang flashyang changed the title HC activation should be delayed until needed secrets are available Should HC activation be delayed until needed secrets are available? Apr 14, 2021
@htuch
Copy link
Member

htuch commented Apr 14, 2021

@rgs1 @mpuncel

@htuch htuch added area/health_checking area/tls and removed triage Issue requires triage labels Apr 14, 2021
@mpuncel
Copy link
Contributor

mpuncel commented Apr 29, 2021

Confirming this issue still exists, apologies for not re-opening that other issue. The issue with the reverted commit was that there was a deadlock where health checks might never start even after secrets are loaded, which I believe is fixed in the final 2 commits of #13650. What remains is writing an integration test that would have caught the initial deadlock

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 29, 2021
@github-actions
Copy link

github-actions bot commented Jun 5, 2021

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health_checking area/tls bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

3 participants