Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NE-585 Update enhancement after implementation #1054

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 76 additions & 54 deletions enhancements/ingress/haproxy-healthcheckinterval.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: global-healthcheck-interval-in-haproxy
title: haproxy-healthcheckinterval
authors:
- "@cholman"
reviewers:
Expand All @@ -15,8 +15,9 @@ approvers:
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)
- "@deads2k"
creation-date: 2021-11-05
last-updated: 2022-03-01
status: implementable
last-updated: 2022-04-26
tracking-link:
- "https://issues.redhat.com/browse/NE-585"
see-also:
replaces:
superseded-by:
Expand All @@ -29,7 +30,7 @@ superseded-by:
- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [x] Test plan is defined
- [ ] Operational readiness criteria is defined
- [x] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

Expand All @@ -40,7 +41,7 @@ administrators may customize the length of time between consecutive healthchecks

This is already configurable via a route annotation called `router.openshift.io/haproxy.health.check.interval`, but
exposing the healthcheck interval at a global scope is desired for efficient administration of routes. HAProxy allows
setting the healthcheck globally as well as per-route, and both options will be addressed as a part of this proposal.
setting the healthcheck interval globally as well as per-route, and both options will be addressed as a part of this proposal.

## Motivation

Expand All @@ -67,39 +68,40 @@ Performance validation is not in scope for this proposal.

Other parameters may come into play for adjusting the healthcheck interval, but will not be covered in this proposal.
It is worth checking in the future, for the impact of `spread-checks`, `max-spread-checks`, `fastinter`, `downinter`,
and the use of passive health checks rather than active health checks.
and the use of passive healthchecks rather than active healthchecks.

## Proposal

HAProxy configures health check intervals using the
HAProxy configures healthcheck intervals using the
[`check` parameter](http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#5.2-check).
- `check inter <duration>`: sets the interval between two consecutive health checks to the time specified
in `duration`.
- `check inter <duration>`: sets the interval between two consecutive healthchecks to be the time specified
in `<duration>`.

For plain HTTP backends, or backends with TLS termination set to edge, re-encrypt, or passthrough,
the `check inter` duration value may come from (evaluated in order):
For plain HTTP backends, or backends with TLS termination set to `edge`, `re-encrypt`, or `passthrough`,
the `check inter` duration value may be derived from the following sources (evaluated in order):
- the `router.openshift.io/haproxy.health.check.interval` annotation on the route
- a (currently hidden) `ROUTER_BACKEND_CHECK_INTERVAL` environment variable
- 5 seconds as a default if the previous two values are unset

Note that this `check inter` value also serves as a timeout for health checks sent to servers
Note that this `check inter` value also serves as a timeout for healthchecks sent to servers
if `timeout check` is not set, but `timeout check` is currently hard-coded to 5 seconds in the OpenShift HAProxy
configuration. The minimum value for `check inter` will be 5 seconds.
configuration. The minimum value for `check inter` will be 1 second.
(See the Risks and Mitigations section for further discussion.)

`check inter` is set per server for the same backends as `timeout check`, with some
additional constraints. For plain HTTP backends, or backends with TLS termination set to edge, re-encrypt, or
passthrough, it is set only if the endpoint is not idled, and the route has more than one active endpoint.
additional constraints. For plain HTTP backends, or backends with TLS termination set to `edge`, `re-encrypt`, or
`passthrough`, it is set only if the endpoint is not idled, and the route has more than one active endpoint.

This proposal will add a new optional parameter, `HealthCheckInterval` to the `IngressControllerTuningOptions` struct,
to set the duration used in `check inter` at the backend scope for every backend server that currently sets a healthcheck
interval.

### User Stories

#### As a cluster administrator, I need to reduce the frequency of healthchecks to 20 seconds on all backends
#### As a cluster administrator, I need to reduce the frequency of healthchecks on all backends

Edit the `IngressController` specification to add `healthCheckInterval` to `tuningOptions`. The duration of the
Edit the `IngressController` specification to add `healthCheckInterval` to `tuningOptions`. The default value is
currently 5 seconds, so to reduce the frequency, you must enter a value greater than 5 seconds. The duration of the
`healthCheckInterval` must be set as a string representing time values. The time value format is an integer optionally
followed by a time unit suffix (e.g. "ms", "s", "m", etc.). If no unit is specified, the value is measured in
milliseconds. More information on the time format can be found in the description of
Expand All @@ -113,14 +115,14 @@ metadata:
namespace: openshift-ingress-operator
spec:
tuningOptions:
healthCheckInterval: "20s"
healthCheckInterval: 10s
...
```

#### As a cluster administrator, I need to reduce the frequency of healthchecks to 20 seconds on all backends except one
#### As a cluster administrator, I need to reduce the frequency of healthchecks on all backends except one

Edit the `IngressController` specification to add `healthCheckInterval` to `tuningOptions` as shown above. Then set
the route annotation on the exception route to the default 5 seconds:
the route annotation `router.openshift.io/haproxy.health.check.interval` on the exception route to the default 5 seconds:

```yaml
apiVersion: v1
Expand All @@ -145,31 +147,41 @@ interval on all backends in a cluster.
```go
type IngressControllerTuningOptions struct {
...
// healthCheckInterval defines how long the router waits between two consecutive
// health checks on its configured backends. This value is applied globally as
// a default for all routes, but may be overridden per-route by the route annotation
// "router.openshift.io/haproxy.health.check.interval".
//
// If unset, the default healthCheckInterval is 5s.
//
// +kubebuilder:validation:Optional
// +kubebuilder:validation:Format=duration
// +optional
HealthCheckInterval *metav1.Duration `json:”healthCheckInterval,omitempty”`
// healthCheckInterval defines how long the router waits between two consecutive
// healthchecks on its configured backends. This value is applied globally as
// a default for all routes, but may be overridden per-route by the route annotation
// "router.openshift.io/haproxy.health.check.interval".
//
// Setting this to less than 5s can cause excess traffic due to too frequent
// TCP healthchecks and accompanying SYN packet storms. Alternatively, setting
// this too high can result in increased latency, due to backend servers that are no
// longer available, but haven't yet been detected as such.
//
// An empty or zero healthCheckInterval means no opinion and IngressController chooses
// a default, which is subject to change over time.
// Currently the default healthCheckInterval value is 5s.
//
// Currently the minimum allowed value is 1s and the maximum allowed value is
// 2147483647ms (24.85 days). Both are subject to change over time.
//
// +kubebuilder:validation:Optional
// +kubebuilder:validation:Format=duration
// +optional
HealthCheckInterval *metav1.Duration `json:"healthCheckInterval,omitempty"`
}
```
### Implementation Details/Notes/Constraints [optional]
### Implementation Details/Notes/Constraints

To expose the `healthCheckInterval` to HAProxy, the environment variable `ROUTER_BACKEND_CHECK_TIMEOUT` will be added
To expose the `healthCheckInterval` to HAProxy, the environment variable `ROUTER_BACKEND_CHECK_INTERVAL` will be added
to the environment in
[desiredRouterDeployment](https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/deployment.go#L206):
```go
// desiredRouterDeployment returns the desired router deployment.
func desiredRouterDeployment(ci *operatorv1.IngressController, ingressControllerImage string, ingressConfig *configv1.Ingress, apiConfig *configv1.APIServer, networkConfig *configv1.Network, proxyNeeded bool, haveClientCAConfigmap bool, clientCAConfigmap *corev1.ConfigMap) (*appsv1.Deployment, error) {
...
if ci.Spec.TuningOptions.HealthCheckInterval != nil && ci.Spec.TuningOptions.HealthCheckInterval > 5*time.Second {
env = append(env, corev1.EnvVar{Name: "ROUTER_BACKEND_CHECK_TIMEOUT",
Value: durationToHAProxyTimespec(ci.Spec.TuningOptions.HealthCheckInterval)})
if ci.Spec.TuningOptions.HealthCheckInterval != nil && ci.Spec.TuningOptions.HealthCheckInterval.Duration >= 1*time.Second {
env = append(env, corev1.EnvVar{Name: RouterBackendCheckInterval,
Value: durationToHAProxyTimespec(ci.Spec.TuningOptions.HealthCheckInterval.Duration)})
...
}
```
Expand All @@ -179,31 +191,33 @@ The HAProxy template will not be modified.

One risk of this proposal is that customers may inadvertently cause different problems with a long healthcheck
interval. For example, backend services that would normally be marked as DOWN would be seen as UP for a longer time
period, causing more errors to be seen by their application users. To mitigate this issue the healthcheck can be
adjusted to a shorter interval, or set back to the default.
period, causing more errors to be seen by their application users. To mitigate this issue the healthcheck interval can be
shortened, or set back to the default.

Furthermore, a change to the health check `inter` duration can have an impact on the `timeout connect` for healthchecks,
which is the maximum time to wait for a connection attempt to succeed.
Furthermore, a change to the healthcheck interval (`check inter`) can have an impact on the
maximum time to wait for a connection attempt to succeed (`timeout connect`) for healthchecks.
We globally set the `timeout connect` to be 5 seconds by default. (The default can be overridden as environment variable
`ROUTER_DEFAULT_CONNECT_TIMEOUT` but only with a custom router.) We also hard-code `timeout check` to 5 seconds and
apply to the same backend scope as `check inter`. As
[documented in the HAProxy configuration manual](http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4.2-timeout%20check),
when `timeout check` is set, HAProxy:
>uses min(`timeout connect`, `inter`) as a connect timeout for check and `timeout check` as an additional read timeout
>uses min(`timeout connect`, `inter`) as a connect timeout for check

This is likely to be a problem only if the `inter` duration was accidentally set to lower than 5 seconds, in which case
the healthchecks could time out before the expected 5 seconds. To mitigate this, the validation on the healthcheck
interval value (`healthCheckInterval`) will have a minimum of 5 seconds.
This is likely to be a problem only if the `inter` duration is accidentally set to less than 5 seconds, in which case
the healthchecks could time out before the expected 5 seconds. The godoc for the `healthCheckInterval` setting has already
been updated with a warning that setting it to less than 5 seconds can cause issues with excessive traffic and SYN packet
storms.

## Design Details

### Open Questions [optional]
### Open Questions

Is there ever a reason to allow users to set the healthcheck interval lower than 5 seconds?
Is there ever a reason to allow users to set the healthcheck interval lower than 5 seconds? We don't know the answer
to that question, but a note in the godoc ensures that users know why this is a bad idea.

### Test Plan

Unit testing can validate that `desiredRouterDeployment` sets the `ROUTER_BACKEND_CHECK_TIMEOUT` environment variable
Unit testing can validate that `desiredRouterDeployment` sets the `ROUTER_BACKEND_CHECK_INTERVAL` environment variable
correctly.

E2E testing can verify that the annotation `router.openshift.io/haproxy.health.check.interval` takes precedence
Expand Down Expand Up @@ -267,22 +281,30 @@ N/A
#### Support Procedures

If the frequency of healthchecks compromises the performance of HAProxy, and the revert to default
values doesn't fix it, that is evidence of another issue.
values doesn't fix it, that is evidence of another issue.

## Implementation History

## Drawbacks
### Drawbacks

This enhancement is customer-driven and is not proven to have the effects that the customer desires.

Additionally, because the route annotation can override the global healthcheck interval setting, this enhancement does not provide
defense against revised healthcheck intervals for individual routes.

Finally, this enhancement serves to obscure the configured healthcheck interval from namespace administrators. There is
a risk that this will
prompt them to add route annotations to gain insight into healthcheck intervals, thus duplicating or overriding the
efforts of the cluster administrator to set a global healthcheck interval for all routes.

## Implementation History

## Alternatives

- This customer request is similar to the HSTS customer request, in that users can set the value already
via route annotation but administrators want to make it mandatory on all routes, or to be global. We
could use the same approach we used for HSTS, by adding an admission controller that can be configured
by administrators to enforce a minimum or maximum health check interval configuration on each backend.
could have used the same approach we used for HSTS, by adding an admission controller that can be configured
by administrators to enforce a minimum or maximum healthcheck interval configuration on each backend.
However, the healthcheck interval is probably easier for administrators to understand as a tuning option,
and is not a security-critical change to routes like HSTS was.
and is not a security-critical change to routes like HSTS is.


- We could enable the automatic addition of route annotations for healthcheck interval, based
Expand Down