NE-585 Expose HealthCheck Interval #952

candita · 2021-11-05T23:56:31Z

Expose and make configurable the ROUTER_BACKEND_CHECK_INTERVAL environment variable in HAProxy's template so that administrators may customize the length of time between subsequent healthchecks of backend services.

This is already configurable via a route annotation called router.openshift.io/haproxy.health.check.interval, but
exposing the healthcheck interval at a global scope is desired for efficient administration of routes. HAProxy allows
setting the healthcheck globally as well as per-route, and both options will be addressed as a part of this proposal.

https://issues.redhat.com/browse/NE-585

aravindhp · 2021-11-08T17:03:12Z

/uncc

openshift-bot · 2021-12-06T22:38:29Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

candita · 2021-12-06T22:41:59Z

/remove-lifecycle-stale

candita · 2021-12-13T17:50:27Z

/remove-lifecycle stale

enhancements/ingress/haproxy-healthcheckinterval.md

Miciah · 2021-12-16T16:36:11Z

enhancements/ingress/haproxy-healthcheckinterval.md

+### Non-Goals
+
+Although the frequency of the healthcheck interval is a cause of concern for some customers, this has not been
+identified as a cause for concern for all customers, chiefly because the default is 10 seconds, which is documented in


chiefly because the default is 10 seconds

Is this true? Please clarify what you mean by "the default". HAProxy's default is 2 seconds according to http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#5.2-inter, and OpenShift router's default is 5 seconds per https://github.com/openshift/router/blob/ee343aef5c2eb9fdc079c1e23bf8e908c56f4d97/images/router/haproxy/conf/haproxy-config.template#L638.

The complete timeout is the sum of the read timeout + connect timeout. See:
https://github.com/openshift/enhancements/pull/952/files#diff-0b69ed9fb9fcb39200a3b4bbcb5da3f654f92af1530453835f5eaabdd5083097R97-R101

if timeout check is set, HAProxy "uses min(timeout connect, inter) as a connect timeout for check and timeout check as an additional read timeout"

if timeout check is not set, HAProxy "uses inter for the complete check timeout (connect + read)"

We use the timeout check 5000ms in the haproxy config for plain http backend, backend with TLS terminated at the edge, secure backend with re-encryption, and for passthrough.

So timeout connect 5s is global, but the total timeout is the sum of timeout connect and timeout check when the latter is set.

The complete timeout is the sum of the read timeout + connect timeout.

That makes sense. However, the text in the enhancement seems to be saying that the default healthcheck interval is 10 seconds:

Although the frequency of the healthcheck interval is a cause of concern for some customers, this has not been
identified as a cause for concern for all customers, chiefly because the default is 10 seconds

If we used all the defaults, we have timeout connect 5s + timeout check 5s = 10s for plain http backend, backend with TLS terminated at the edge, secure backend with re-encryption, and for passthrough. Do you mean I have to qualify that it is only for those backend types?

No, I mean that the total timeout might be 10 seconds, but the interval is 5 seconds whereas the enhancement seems to be saying that the interval is 10 seconds.

timeout check is always set for those backend types, so we don't use check inter's value

I see where I misunderstood here, conflating timeouts with interval. Thanks for persisting with me. Will post an update

enhancements/ingress/haproxy-healthcheckinterval.md

Miciah · 2021-12-16T16:57:57Z

enhancements/ingress/haproxy-healthcheckinterval.md

+
+Use of the new `healthCheckInterval` in the `tuningOptions` will change the frequency of healthchecks
+that HAProxy performs on its backends.  There are scenarios where this could improve or compromise the
+performance of HAProxy.


We should describe healthchecks a little bit to help people diagnose issues. Namely, these healthchecks are TCP SYN probes, and HAProxy closes the connection with a RST as soon as it gets a SYN,ACK response. Excessive healthchecks therefore show up as SYN packet storms.

enhancements/ingress/haproxy-healthcheckinterval.md

Miciah · 2021-12-16T19:13:24Z

enhancements/ingress/haproxy-healthcheckinterval.md

+- We could enable the automatic addition of route annotations for healthcheck interval, based
+on a configuration in the ingress controller spec.  This would cause mutating routes, which is not
+  acceptable.


Specifically, this could be done using Gatekeeper or Kyverno.

But it's not acceptable,right?

Apparently not since this PR got prioritized. ¯\_(ツ)_/¯. Might be worth mentioning them as options though.

enhancements/ingress/haproxy-healthcheckinterval.md

openshift-bot · 2022-01-13T23:14:01Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-01-21T05:09:30Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

candita · 2022-01-21T15:15:19Z

/lifecycle frozen

openshift-ci · 2022-01-21T15:15:20Z

@candita: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

candita · 2022-01-21T15:15:50Z

/remove-lifecycle rotten

Miciah

I have a few minor comments, nothing too important.
/approve
/lgtm
/hold
in case you want to address any of the last few comments with this PR.

Miciah · 2022-03-01T22:02:34Z

enhancements/ingress/haproxy-healthcheckinterval.md

+`healthCheckInterval` must be set as a string representing time values.  The time value format is an integer optionally
+followed by a time unit (e.g. "ms", "s", "m", etc.). If no unit is specified, the value is measured in
+milliseconds. More information on the time format can be found in the
+[HAProxy documentation](https://github.com/haproxy/haproxy/blob/v2.2.0/doc/configuration.txt).


The API uses the metav1.Duration value format. For a citation, we could use https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Duration and https://pkg.go.dev/time#ParseDuration. The most important difference from the HAProxy format is that the unit is required.

Miciah · 2022-03-01T22:04:25Z

enhancements/ingress/haproxy-healthcheckinterval.md

+### API Extensions
+
+This proposal will modify the `IngressController` API by adding a new variable called `HealthCheckInterval` to the
+[`IngressControllerTuningOptions`](https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L1193) struct


Better to use permalinks, or just omit the link (people reading this can figure out which repository to grep).

Miciah · 2022-03-01T22:09:02Z

enhancements/ingress/haproxy-healthcheckinterval.md

+Unit tests will be added to test the propagation of the `healthCheckInterval` setting and its interaction with the router
+annotation `router.openshift.io/haproxy.health.check.interval`.
+
+E2E tests can validate that the environment variable is present and the HAProxy template is properly constructed.


I would think the unit test and E2E test would be swapped: a unit test in cluster-ingress-operator could verify that desiredRouterDeployment set the environment variable correctly, and an E2E test could verify that router.openshift.io/haproxy.health.check.interval took precedence over healthCheckInterval. I might be misunderstanding this or looking at it the wrong way though, so if it makes sense to test as described here, that's fine as long as we have reasonable test coverage.

Miciah · 2022-03-01T22:25:35Z

enhancements/ingress/haproxy-healthcheckinterval.md

+Use of the new `healthCheckInterval` in the `tuningOptions` will change the frequency of healthchecks
+that HAProxy performs on its backends.  There are scenarios where this could either improve or compromise the
+performance of HAProxy.  Increasing the healthcheck interval too much can result in increased 500-level HTTP responses,
+due to backend servers that are no longer available, but haven't yet been detected as such.  Decreasing the healthcheck


I would expect the connection from HAProxy to an unavailable backend server to time out and HAProxy to then try a different backend server, resulting in added latency but usually not an HTTP 5xx response, unless multiple backend servers are down (see http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4-retries and http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4-option%20redispatch). Is that not the case?

openshift-ci · 2022-03-01T22:29:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Miciah · 2022-03-02T00:04:21Z

/lgtm

openshift-ci · 2022-03-02T00:13:33Z

@candita: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

candita · 2022-03-02T14:31:22Z

/unhold

openshift-ci bot requested review from aravindhp and sjenning November 5, 2021 23:57

candita force-pushed the NE-585-ExposeHealthCheckInterval branch from e9d4862 to 5f3ae68 Compare November 6, 2021 20:31

openshift-ci bot removed the request for review from aravindhp November 8, 2021 17:03

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2021

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 13, 2021

Miciah reviewed Dec 16, 2021

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2022

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 21, 2022

candita marked this pull request as draft January 21, 2022 15:16

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2022

candita force-pushed the NE-585-ExposeHealthCheckInterval branch 2 times, most recently from eb3522a to 92f9212 Compare February 8, 2022 23:36

candita marked this pull request as ready for review February 8, 2022 23:40

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2022

openshift-ci bot requested review from jsafrane and LalatenduMohanty February 8, 2022 23:40

candita force-pushed the NE-585-ExposeHealthCheckInterval branch from 92f9212 to b580f4a Compare February 10, 2022 19:07

Miciah reviewed Mar 1, 2022

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2022

openshift-ci bot assigned Miciah Mar 1, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2022

candita mentioned this pull request Mar 1, 2022

NE-585 Make ROUTER_BACKEND_CHECK_INTERVAL configurable openshift/api#1127

Merged

candita force-pushed the NE-585-ExposeHealthCheckInterval branch from b580f4a to 47b82ad Compare March 1, 2022 23:51

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2022

NE-585 Expose HealthCheck Interval

fd2546e

candita force-pushed the NE-585-ExposeHealthCheckInterval branch from 47b82ad to fd2546e Compare March 2, 2022 00:01

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2022

openshift-merge-robot merged commit 20183fa into openshift:master Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NE-585 Expose HealthCheck Interval #952

NE-585 Expose HealthCheck Interval #952

candita commented Nov 5, 2021

aravindhp commented Nov 8, 2021

openshift-bot commented Dec 6, 2021

candita commented Dec 6, 2021

candita commented Dec 13, 2021

Miciah Dec 16, 2021

candita Feb 7, 2022 •

edited

Loading

Miciah Feb 7, 2022

candita Feb 7, 2022

Miciah Feb 7, 2022

candita Feb 7, 2022 •

edited

Loading

candita Feb 8, 2022

Miciah Dec 16, 2021

Miciah Dec 16, 2021

candita Feb 8, 2022

Miciah Mar 1, 2022

openshift-bot commented Jan 13, 2022

openshift-bot commented Jan 21, 2022

candita commented Jan 21, 2022

openshift-ci bot commented Jan 21, 2022

candita commented Jan 21, 2022

Miciah left a comment

Miciah Mar 1, 2022

Miciah Mar 1, 2022

Miciah Mar 1, 2022

Miciah Mar 1, 2022

openshift-ci bot commented Mar 1, 2022

Miciah commented Mar 2, 2022

openshift-ci bot commented Mar 2, 2022

candita commented Mar 2, 2022

NE-585 Expose HealthCheck Interval #952

NE-585 Expose HealthCheck Interval #952

Conversation

candita commented Nov 5, 2021

aravindhp commented Nov 8, 2021

openshift-bot commented Dec 6, 2021

candita commented Dec 6, 2021

candita commented Dec 13, 2021

Choose a reason for hiding this comment

candita Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

candita Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Jan 13, 2022

openshift-bot commented Jan 21, 2022

candita commented Jan 21, 2022

openshift-ci bot commented Jan 21, 2022

candita commented Jan 21, 2022

Miciah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 1, 2022

Miciah commented Mar 2, 2022

openshift-ci bot commented Mar 2, 2022

candita commented Mar 2, 2022

candita Feb 7, 2022 •

edited

Loading

candita Feb 7, 2022 •

edited

Loading