-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
delay health checks until transport socket secrets are ready. #13516
Conversation
This fixes a sequencing issue that could result in undesirable Envoy bootup times. When active health checks are configured on a cluster using a transport socket that uses SDS to fetch secrets, it was possible for the initial health check to run before secrets are loaded, which results in a failure. The health check would then wait for the no_traffic_interval (default of 60s) before trying again. This commit makes health checks wait until secrets are loaded before starting, implemented by adding support for registering a callback on TransportSocketFactory. SslSocketFactory supports invoking the callback when secrets are loaded, and all other socket types currently invoke the callback immediately, preserving existing behavior. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Drive-by comment, but wouldn't #13344 (keep clusters warming) also solve this? |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@PiotrSikora interesting, I was unaware of that PR! From a quick look, I don't think that PR actually delays health checks, so it does not address the problem from #12389. If there was a delay fetching a secret, Envoy would still enter the LIVE state in the middle of a I do think this PR might coincidentally fix #11120 for some cases, because the cluster is not considered warmed until health checks have run for one interval. There would still be some edge cases, however, for instance if health checks are performed over a raw_buffer socket that does not need to load secrets, they would still run immediately and the cluster might be marked as warmed before all secrets are loaded. So that PR still has value as well. |
That PR could maybe be modified to delay |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Envoy will still enter LIVE state but the cluster won't be active, then no HC will be there, isn't that the case? |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
* master: (22 commits) http: using CONNECT_ERROR for HTTP/2 (envoyproxy#13519) listener: respect address.pipe.mode (it didn't work) (envoyproxy#13493) examples: Fix more deprecations/warnings in configs (envoyproxy#13529) overload: tcp connection refusal overload action (envoyproxy#13311) tcp: towards pluggable upstreams (envoyproxy#13331) conn_pool: fixing comments (envoyproxy#13520) Prevent SEGFAULT when disabling listener (envoyproxy#13515) Convert overload manager config literals to YAML (envoyproxy#13518) Fix runtime feature variable name (envoyproxy#13533) dependencies: refactor repository location schema utils, cleanups. (envoyproxy#13452) router: fix an invalid ASSERT when encoding metadata frames in the router. (envoyproxy#13511) http2: Proactively disconnect connections flooded when resetting stream (envoyproxy#13482) ci use azp to sync filter example (envoyproxy#13501) mongo_proxy: support configurable command list for metrics (envoyproxy#13494) http local rate limit: note token bucket is shared (envoyproxy#13525) wasm/extensions: Wasm extension policy. (envoyproxy#13526) http: removing envoy.reloadable_features.http1_flood_protection (envoyproxy#13508) build: update ppc64le CI build status shield (envoyproxy#13521) dependencies: enforce dependency shepherd sign-off via RepoKitteh. (envoyproxy#13522) Add no_traffic_healthy_interval (envoyproxy#13336) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@lizan it looks like that PR doesn't change the flow until Even after that PR is merged, I believe this problematic sequence is still possible:
That said, that PR could maybe be modified to block for secrets updating before calling |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge main branch?
* master: ci: use multiple stage (envoyproxy#13557) tls: update BoringSSL to 2192bbc8 (4240). (envoyproxy#13567) fix macos v8 build (envoyproxy#13572) Fixed Health Check Fuzz corpus syntax (envoyproxy#13576) ci: Remove shellcheck diff (envoyproxy#13560) ci: Increate brew retry interval (envoyproxy#13565) dependencies: fix some of the fallout from Wasm merge. (envoyproxy#13569) hds: add support for delta updates in specifier (envoyproxy#13067) ci: workaround for actions/runner-images#1811 (envoyproxy#13577) ratelimit: be able to disable x-envoy-ratelimited response header sent (envoyproxy#13270) Update opencensus library (envoyproxy#13549) ci: use azp for api and go-control-plane sync (envoyproxy#13550) docs: Remove/make generic lyft references in docs (envoyproxy#13559) check_format: adding 2 more release note checks (envoyproxy#13444) [Wasm] Add cluster metadata fallback and upstream host metadata (envoyproxy#13477) [fuzz] Added validation for secrets (envoyproxy#13543) Add Platform Specific Feature guidance to PR template (envoyproxy#13547) Signed-off-by: Michael Puncel <mpuncel@squareup.com>
merged! |
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
* master: (22 commits) delay health checks until transport socket secrets are ready. (envoyproxy#13516) test, oauth2: Make sure config test runs field validation (envoyproxy#13496) [http] swap codec implementations to default new (envoyproxy#13579) wasm: update proxy-wasm-cpp-host (envoyproxy#13606) postgres: do not copy and linearize received data when it is not going to be used (envoyproxy#13393) configs: Update configs v2 -> v3 (envoyproxy#13562) http2: Remove RELEASE_ASSERTs in sendPendingFrames() error handling (envoyproxy#13546) dependencies: track untracked implied dependencies, wrapup dashboard. (envoyproxy#13571) listener: add match all filter chain (envoyproxy#13449) fix mistakes in docstrings (envoyproxy#13603) ratelimit: add route entry metadata to ratelimit actions (envoyproxy#13269) cluster manager: avoid immediate activation for dynamic inserted cluster when initialize (envoyproxy#12783) ext_authz: Avoid calling check multiple times (envoyproxy#13288) docs: Unexclude remaining configs from validation (envoyproxy#13534) build: update rules_rust to allow Rustc in RBE (envoyproxy#13595) docs: Update sphinxext.rediraffe (envoyproxy#13589) Deprecate moonjit support on Windows before beta (envoyproxy#13541) dependencies: bump LuaJIT to 2.1 branch HEAD @ e9af1ab. (envoyproxy#13474) docs: add TLS stats to cluster stats doc (envoyproxy#13561) ci: stop building alpine-debug images in favor of ubuntu-based debug image (envoyproxy#13598) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
x-posting in case someone else bumps into this issue, we had to revert this internally because it breaks (on a significant number of cases) hot restart for us (gets stuck in PRE_INIT). |
…envoyproxy#13516)" This reverts commit 34b67f9. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
@mattklein123 fine with me! Where is the best place to discuss what might have gone wrong here and how to proceed? Should I open an issue or another PR after the revert is merged? |
I would reopen a PR, and then try to work with @rgs1 to debug? I don't immediately see how hot restart would change the behavior of this PR so will need some poking. |
* master: (22 commits) ci: various improvements (envoyproxy#13660) dns: fix defunct fd bug in apple resolver (envoyproxy#13641) build: support ppc64le with wasm (envoyproxy#13657) [fuzz] Added random load balancer fuzz (envoyproxy#13400) dependencies: compute and check release dates via GitHub API. (envoyproxy#13582) mac ci: try ignoring update failure (envoyproxy#13658) watchdog: Optimize WatchdogImpl::touch in preparation to more frequent petting of the watchdog. (envoyproxy#13103) typos: fix a couple 'enovy' mispellings (envoyproxy#13645) lua: Expose stream info downstreamLocalAddress and downstreamDirectRemoteAddress for Lua filter (envoyproxy#13536) tap: fix upstream streamed transport socket taps (envoyproxy#13638) Revert "delay health checks until transport socket secrets are ready. (envoyproxy#13516)" (envoyproxy#13639) Watchdog: use abort action as a default if killing is enabled. (envoyproxy#13523) [fuzz] Fixed divide by zero bug (envoyproxy#13545) wasm: flip the meaning of the "repository" in envoy_wasm_cc_binary(). (envoyproxy#13621) fix: record recovered local address (envoyproxy#13581) docs: fix incorrect compressor filter doc (envoyproxy#13611) docs: clean up docs for azp migration (envoyproxy#13558) wasm: fix building Wasm example. (envoyproxy#13619) test: Refactor flood tests into a separate test file (envoyproxy#13556) wasm: re-enable tests with precompiled modules. (envoyproxy#13583) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
This commit is a revival of envoyproxy#13516 which aimed to solve envoyproxy#12389. That PR was merged and then reverted due to a deadlock bug, which is still present in this commit. The fix will be applied in a subsequent commit in order to make the resolution clearer to PR reviewers. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
This fixes a sequencing issue that could result in undesirable Envoy
bootup times. When active health checks are configured on a cluster
using a transport socket that uses SDS to fetch secrets, it was possible
for the initial health check to run before secrets are loaded, which
results in a failure. The health check would then wait for the
no_traffic_interval (default of 60s) before trying again.
This commit makes health checks wait until secrets are loaded before
starting, implemented by adding support for registering a callback on
TransportSocketFactory. SslSocketFactory supports invoking the callback
when secrets are loaded, and all other socket types currently invoke the
callback immediately, preserving existing behavior.
Signed-off-by: Michael Puncel mpuncel@squareup.com
Commit Message:
delay health checks until transport socket secrets are ready.
This fixes a sequencing issue that could result in undesirable Envoy
bootup times. When active health checks are configured on a cluster
using a transport socket that uses SDS to fetch secrets, it was possible
for the initial health check to run before secrets are loaded, which
results in a failure. The health check would then wait for the
no_traffic_interval (default of 60s) before trying again.
This commit makes health checks wait until secrets are loaded before
starting, implemented by adding support for registering a callback on
TransportSocketFactory. SslSocketFactory supports invoking the callback
when secrets are loaded, and all other socket types currently invoke the
callback immediately, preserving existing behavior.
Additional Description:
Risk Level: Medium
Testing: Unit tests
Docs Changes: N/A
Release Notes: Updated
Fixes #12389