-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should HC activation be delayed until needed secrets are available? #15977
Comments
Confirming this issue still exists, apologies for not re-opening that other issue. The issue with the reverted commit was that there was a deadlock where health checks might never start even after secrets are loaded, which I believe is fixed in the final 2 commits of #13650. What remains is writing an integration test that would have caught the initial deadlock |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Description:
Recently, we noticed that during the Envoy initialization, there is a race condition between when TLS is configured on a upstream cluster (like validation context) and when active healthcheck begins on that cluster, and it will take around 60s for the Envoy to initialize. In this case, we find that Envoy initiates the first healthcheck on the upstream cluster before the validation context is retrieved, resulting in a health check connection failure and the healthcheck interval will fall back to the
no_traffic_interval
(because there is no traffic on the cluster). While for Envoy cluster which uses STATIC_DNS and EDS this appears to not delay Envoy initialization, it appears that Envoy cluster using LOGICAL_DNS will wait out theno_traffic_interval
to healthcheck again before it considers itself fully initialized.I saw we have same issue created before #12389, but it was closed (the fix commit #13516 was merged but reverted latter). So open this issue again. I have verified that this issue still exist in Envoy v1.17.1 image.
The text was updated successfully, but these errors were encountered: