Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kds): introduce zone health checks #7821

Merged

Conversation

michaelbeaumont
Copy link
Contributor

@michaelbeaumont michaelbeaumont commented Sep 22, 2023

Explanation

Zone:

  • Send health check periodically

Global:

  • When we get a health check, mark it in the corresponding ZoneInsight
  • Introduce new component ZoneWatch that listens for ZoneOpenedStream and marks the (tenantID, zone) for watching
  • Periodically check all watched (tenantID, zone) and if the time of last health check in ZoneInsight is too late, send ZoneWentOffline
  • All opened streams also listen for ZoneWentOffline events and the handler returns if said event is received, ending the stream

We store the info in ZoneInsight because:
All instances need to potentially kill streams but not every instance will receive a health check from connected zones

Tests

The need for time.Sleep in the tests comes about because it happens asynchronously that:

  1. ZoneWatch subscribes to ZoneOpenedStream events in Start
    in reality, this is guaranteed to happen before ZoneOpenedStream events are sent by the fact that we only send them in response to new gRPC streams being opened
  2. ZoneOpenedStream is witnessed by ZoneWatch
    the test adds a time.Sleep because we only want to update the health check time once and in particular after the zone starts being watched and then check that it's disconnected
    in reality, a zone will continually send its health check ping so it eventually will be updated after the initial seen last time.

Todo

  • configurability

Checklist prior to review

  • Link to relevant issue as well as docs and UI issues --
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as a image registry) and it will work on Windows, system specific functions like syscall.Mkfifo have equivalent implementation on the other OS --
  • Tests (Unit test, E2E tests, manual test on universal and k8s) --
  • Do you need to update UPGRADE.md? --
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label) --
  • Do you need to explicitly set a > Changelog: entry here or add a ci/ label to run fewer/more tests?

@michaelbeaumont michaelbeaumont force-pushed the feat/do-zone-healthcheck branch 3 times, most recently from d2211af to cb39b5e Compare September 29, 2023 16:06
@michaelbeaumont michaelbeaumont force-pushed the feat/do-zone-healthcheck branch 2 times, most recently from deaad86 to cbab23a Compare October 5, 2023 08:09
@michaelbeaumont michaelbeaumont changed the title Feat/do zone healthcheck feat: introduce zone health checks Oct 5, 2023
pkg/core/features.go Outdated Show resolved Hide resolved
pkg/core/features.go Outdated Show resolved Hide resolved
pkg/kds/mux/client.go Outdated Show resolved Hide resolved
pkg/kds/mux/client.go Outdated Show resolved Hide resolved
pkg/kds/mux/zone_watch.go Outdated Show resolved Hide resolved
pkg/kds/global/components.go Outdated Show resolved Hide resolved
pkg/kds/mux/zone_watch.go Outdated Show resolved Hide resolved
pkg/kds/service/server.go Show resolved Hide resolved
@michaelbeaumont michaelbeaumont force-pushed the feat/do-zone-healthcheck branch 2 times, most recently from 8cdb4c6 to 0f817ea Compare October 5, 2023 11:31
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
@michaelbeaumont michaelbeaumont force-pushed the feat/do-zone-healthcheck branch 2 times, most recently from 81e8212 to 4ed9872 Compare October 5, 2023 14:24
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
@michaelbeaumont michaelbeaumont force-pushed the feat/do-zone-healthcheck branch 2 times, most recently from 7b6ae86 to 848d7aa Compare October 6, 2023 12:53
@michaelbeaumont michaelbeaumont marked this pull request as ready for review October 6, 2023 13:07
@michaelbeaumont michaelbeaumont requested a review from a team as a code owner October 6, 2023 13:07
@michaelbeaumont michaelbeaumont requested review from bartsmykla and jakubdyszkiewicz and removed request for a team October 6, 2023 13:07
@michaelbeaumont michaelbeaumont changed the title feat: introduce zone health checks feat(kuma-cp): introduce zone health checks Oct 6, 2023
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Copy link
Contributor

@jakubdyszkiewicz jakubdyszkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could introduce tests at least for ZoneWatch. Any ideas for E2E test?

pkg/kds/service/server.go Outdated Show resolved Hide resolved
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
@michaelbeaumont
Copy link
Contributor Author

I think we could introduce tests at least for ZoneWatch. Any ideas for E2E test?

I think e2e will be somewhat difficult because of the nature of isolating failure to the app level health check. The always possible, but ugly option is to introduce an option like "don't send health check pings" to the CP but IMO we should avoid this artificial, public test interface

@slonka mentioned potentially sending SIGSTOP to make the process stop sending pings but keep TCP connections open. The problem that I see is that HTTP2 PING ACKs also stop, which will trigger the gRPC keep alive to fail.

@jakubdyszkiewicz
Copy link
Contributor

but IMO we should avoid this artificial, public test interface

I agree.

We could put Kong in the middle (we have test infra to deploy KIC) which apparently does not proxy PING/GOAWAY and then do SIGKILL on Zone CP. I wonder if this would work.

@slonka
Copy link
Contributor

slonka commented Oct 10, 2023

We could put Kong in the middle (we have test infra to deploy KIC) which apparently does not proxy PING/GOAWAY and then do SIGKILL on Zone CP. I wonder if this would work.

It should work 👍

@michaelbeaumont
Copy link
Contributor Author

OK, if we have some infrastructure for starting up kong gateway it becomes significantly less work

Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
…thcheck

Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
… timing out

Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
@michaelbeaumont michaelbeaumont merged commit baa72b6 into kumahq:master Oct 10, 2023
5 checks passed
@michaelbeaumont michaelbeaumont deleted the feat/do-zone-healthcheck branch October 10, 2023 17:08
@michaelbeaumont michaelbeaumont added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label Oct 11, 2023
slonka added a commit that referenced this pull request Oct 11, 2023
slonka added a commit that referenced this pull request Oct 11, 2023
…8017)

Revert "feat(kuma-cp): introduce zone health checks (#7821)"

This reverts commit baa72b6.

---

Signed-off-by: slonka <slonka@users.noreply.github.com>
michaelbeaumont added a commit to michaelbeaumont/kuma that referenced this pull request Oct 12, 2023
This reverts commit 386ab53.

Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
michaelbeaumont added a commit that referenced this pull request Oct 13, 2023
* feat(kuma-cp): reintroduce zone health checks (#7821)

This reverts commit 386ab53.

* feat: don't error in zone if global CP doesn't support health check

Signed-off-by: Mike Beaumont <mjboamail@gmail.com>
@lahabana lahabana changed the title feat(kuma-cp): introduce zone health checks feat(kds): introduce zone health checks Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants