Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NE-585 Expose HealthCheck Interval #952

Conversation

candita
Copy link
Contributor

@candita candita commented Nov 5, 2021

Expose and make configurable the ROUTER_BACKEND_CHECK_INTERVAL environment variable in HAProxy's template so that administrators may customize the length of time between subsequent healthchecks of backend services.

This is already configurable via a route annotation called router.openshift.io/haproxy.health.check.interval, but
exposing the healthcheck interval at a global scope is desired for efficient administration of routes. HAProxy allows
setting the healthcheck globally as well as per-route, and both options will be addressed as a part of this proposal.

https://issues.redhat.com/browse/NE-585

@candita candita force-pushed the NE-585-ExposeHealthCheckInterval branch from e9d4862 to 5f3ae68 Compare November 6, 2021 20:31
@aravindhp
Copy link
Contributor

/uncc

@openshift-ci openshift-ci bot removed the request for review from aravindhp November 8, 2021 17:03
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2021
@candita
Copy link
Contributor Author

candita commented Dec 6, 2021

/remove-lifecycle-stale

@candita
Copy link
Contributor Author

candita commented Dec 13, 2021

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 13, 2021
enhancements/ingress/haproxy-healthcheckinterval.md Outdated Show resolved Hide resolved
enhancements/ingress/haproxy-healthcheckinterval.md Outdated Show resolved Hide resolved
### Non-Goals

Although the frequency of the healthcheck interval is a cause of concern for some customers, this has not been
identified as a cause for concern for all customers, chiefly because the default is 10 seconds, which is documented in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chiefly because the default is 10 seconds

Is this true? Please clarify what you mean by "the default". HAProxy's default is 2 seconds according to http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#5.2-inter, and OpenShift router's default is 5 seconds per https://github.com/openshift/router/blob/ee343aef5c2eb9fdc079c1e23bf8e908c56f4d97/images/router/haproxy/conf/haproxy-config.template#L638.

Copy link
Contributor Author

@candita candita Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete timeout is the sum of the read timeout + connect timeout. See:
https://github.com/openshift/enhancements/pull/952/files#diff-0b69ed9fb9fcb39200a3b4bbcb5da3f654f92af1530453835f5eaabdd5083097R97-R101

  • if timeout check is set, HAProxy "uses min(timeout connect, inter) as a connect timeout for check and timeout check as an additional read timeout"
  • if timeout check is not set, HAProxy "uses inter for the complete check timeout (connect + read)"

We use the timeout check 5000ms in the haproxy config for plain http backend, backend with TLS terminated at the edge, secure backend with re-encryption, and for passthrough.

So timeout connect 5s is global, but the total timeout is the sum of timeout connect and timeout check when the latter is set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complete timeout is the sum of the read timeout + connect timeout.

That makes sense. However, the text in the enhancement seems to be saying that the default healthcheck interval is 10 seconds:

Although the frequency of the healthcheck interval is a cause of concern for some customers, this has not been
identified as a cause for concern for all customers, chiefly because the default is 10 seconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we used all the defaults, we have timeout connect 5s + timeout check 5s = 10s for plain http backend, backend with TLS terminated at the edge, secure backend with re-encryption, and for passthrough. Do you mean I have to qualify that it is only for those backend types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean that the total timeout might be 10 seconds, but the interval is 5 seconds whereas the enhancement seems to be saying that the interval is 10 seconds.

Copy link
Contributor Author

@candita candita Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout check is always set for those backend types, so we don't use check inter's value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where I misunderstood here, conflating timeouts with interval. Thanks for persisting with me. Will post an update

enhancements/ingress/haproxy-healthcheckinterval.md Outdated Show resolved Hide resolved
enhancements/ingress/haproxy-healthcheckinterval.md Outdated Show resolved Hide resolved
enhancements/ingress/haproxy-healthcheckinterval.md Outdated Show resolved Hide resolved

Use of the new `healthCheckInterval` in the `tuningOptions` will change the frequency of healthchecks
that HAProxy performs on its backends. There are scenarios where this could improve or compromise the
performance of HAProxy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should describe healthchecks a little bit to help people diagnose issues. Namely, these healthchecks are TCP SYN probes, and HAProxy closes the connection with a RST as soon as it gets a SYN,ACK response. Excessive healthchecks therefore show up as SYN packet storms.

Comment on lines 296 to 298
- We could enable the automatic addition of route annotations for healthcheck interval, based
on a configuration in the ingress controller spec. This would cause mutating routes, which is not
acceptable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, this could be done using Gatekeeper or Kyverno.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not acceptable,right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not since this PR got prioritized. ¯\_(ツ)_/¯. Might be worth mentioning them as options though.

enhancements/ingress/haproxy-healthcheckinterval.md Outdated Show resolved Hide resolved
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2022
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2022
@candita
Copy link
Contributor Author

candita commented Jan 21, 2022

/lifecycle frozen

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

@candita: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita
Copy link
Contributor Author

candita commented Jan 21, 2022

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 21, 2022
@candita candita marked this pull request as draft January 21, 2022 15:16
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2022
@candita candita force-pushed the NE-585-ExposeHealthCheckInterval branch 2 times, most recently from eb3522a to 92f9212 Compare February 8, 2022 23:36
@candita candita marked this pull request as ready for review February 8, 2022 23:40
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2022
@candita candita force-pushed the NE-585-ExposeHealthCheckInterval branch from 92f9212 to b580f4a Compare February 10, 2022 19:07
Copy link
Contributor

@Miciah Miciah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor comments, nothing too important.
/approve
/lgtm
/hold
in case you want to address any of the last few comments with this PR.

`healthCheckInterval` must be set as a string representing time values. The time value format is an integer optionally
followed by a time unit (e.g. "ms", "s", "m", etc.). If no unit is specified, the value is measured in
milliseconds. More information on the time format can be found in the
[HAProxy documentation](https://github.com/haproxy/haproxy/blob/v2.2.0/doc/configuration.txt).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API uses the metav1.Duration value format. For a citation, we could use https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Duration and https://pkg.go.dev/time#ParseDuration. The most important difference from the HAProxy format is that the unit is required.

### API Extensions

This proposal will modify the `IngressController` API by adding a new variable called `HealthCheckInterval` to the
[`IngressControllerTuningOptions`](https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L1193) struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use permalinks, or just omit the link (people reading this can figure out which repository to grep).

Comment on lines 206 to 209
Unit tests will be added to test the propagation of the `healthCheckInterval` setting and its interaction with the router
annotation `router.openshift.io/haproxy.health.check.interval`.

E2E tests can validate that the environment variable is present and the HAProxy template is properly constructed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the unit test and E2E test would be swapped: a unit test in cluster-ingress-operator could verify that desiredRouterDeployment set the environment variable correctly, and an E2E test could verify that router.openshift.io/haproxy.health.check.interval took precedence over healthCheckInterval. I might be misunderstanding this or looking at it the wrong way though, so if it makes sense to test as described here, that's fine as long as we have reasonable test coverage.

Use of the new `healthCheckInterval` in the `tuningOptions` will change the frequency of healthchecks
that HAProxy performs on its backends. There are scenarios where this could either improve or compromise the
performance of HAProxy. Increasing the healthcheck interval too much can result in increased 500-level HTTP responses,
due to backend servers that are no longer available, but haven't yet been detected as such. Decreasing the healthcheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the connection from HAProxy to an unavailable backend server to time out and HAProxy to then try a different backend server, resulting in added latency but usually not an HTTP 5xx response, unless multiple backend servers are down (see http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4-retries and http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4-option%20redispatch). Is that not the case?

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 1, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2022
@candita candita force-pushed the NE-585-ExposeHealthCheckInterval branch from b580f4a to 47b82ad Compare March 1, 2022 23:51
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2022
@candita candita force-pushed the NE-585-ExposeHealthCheckInterval branch from 47b82ad to fd2546e Compare March 2, 2022 00:01
@Miciah
Copy link
Contributor

Miciah commented Mar 2, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 2, 2022

@candita: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@candita
Copy link
Contributor Author

candita commented Mar 2, 2022

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2022
@openshift-merge-robot openshift-merge-robot merged commit 20183fa into openshift:master Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants