Should HC activation be delayed until needed secrets are available? #15977

flashyang · 2021-04-14T19:23:31Z

Description:
Recently, we noticed that during the Envoy initialization, there is a race condition between when TLS is configured on a upstream cluster (like validation context) and when active healthcheck begins on that cluster, and it will take around 60s for the Envoy to initialize. In this case, we find that Envoy initiates the first healthcheck on the upstream cluster before the validation context is retrieved, resulting in a health check connection failure and the healthcheck interval will fall back to the no_traffic_interval (because there is no traffic on the cluster). While for Envoy cluster which uses STATIC_DNS and EDS this appears to not delay Envoy initialization, it appears that Envoy cluster using LOGICAL_DNS will wait out the no_traffic_interval to healthcheck again before it considers itself fully initialized.

I saw we have same issue created before #12389, but it was closed (the fix commit #13516 was merged but reverted latter). So open this issue again. I have verified that this issue still exist in Envoy v1.17.1 image.

The text was updated successfully, but these errors were encountered:

htuch · 2021-04-14T20:20:45Z

@rgs1 @mpuncel

mpuncel · 2021-04-29T14:04:18Z

Confirming this issue still exists, apologies for not re-opening that other issue. The issue with the reverted commit was that there was a deadlock where health checks might never start even after secrets are loaded, which I believe is fixed in the final 2 commits of #13650. What remains is writing an integration test that would have caught the initial deadlock

github-actions · 2021-05-29T16:14:23Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2021-06-05T20:09:14Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

flashyang added bug triage Issue requires triage labels Apr 14, 2021

flashyang changed the title ~~HC activation should be delayed until needed secrets are available~~ Should HC activation be delayed until needed secrets are available? Apr 14, 2021

htuch added area/health_checking area/tls and removed triage Issue requires triage labels Apr 14, 2021

mpuncel mentioned this issue Apr 29, 2021

health_checker: Make health check loop wait for any required SDS secrets to be loaded before starting. #16236

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label May 29, 2021

github-actions bot closed this as completed Jun 5, 2021

flashyang mentioned this issue Jul 28, 2021

Delay HC activation until SDS is initialized #17529

Open

This was referenced Aug 13, 2021

Mpuncel/sds hc sequence #17712

Closed

Make health check loop wait for any required SDS secrets to be loaded… #17756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should HC activation be delayed until needed secrets are available? #15977

Should HC activation be delayed until needed secrets are available? #15977

flashyang commented Apr 14, 2021 •

edited

Loading

htuch commented Apr 14, 2021

mpuncel commented Apr 29, 2021

github-actions bot commented May 29, 2021

github-actions bot commented Jun 5, 2021

Should HC activation be delayed until needed secrets are available? #15977

Should HC activation be delayed until needed secrets are available? #15977

Comments

flashyang commented Apr 14, 2021 • edited Loading

htuch commented Apr 14, 2021

mpuncel commented Apr 29, 2021

github-actions bot commented May 29, 2021

github-actions bot commented Jun 5, 2021

flashyang commented Apr 14, 2021 •

edited

Loading