diff --git a/enhancements/ingress/haproxy-healthcheckinterval.md b/enhancements/ingress/haproxy-healthcheckinterval.md index 89cb257359..bec95275ca 100644 --- a/enhancements/ingress/haproxy-healthcheckinterval.md +++ b/enhancements/ingress/haproxy-healthcheckinterval.md @@ -1,5 +1,5 @@ --- -title: global-healthcheck-interval-in-haproxy +title: haproxy-healthcheckinterval authors: - "@cholman" reviewers: @@ -15,8 +15,9 @@ approvers: api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers) - "@deads2k" creation-date: 2021-11-05 -last-updated: 2022-03-01 -status: implementable +last-updated: 2022-04-26 +tracking-link: + - "https://issues.redhat.com/browse/NE-585" see-also: replaces: superseded-by: @@ -29,7 +30,7 @@ superseded-by: - [x] Enhancement is `implementable` - [x] Design details are appropriately documented from clear requirements - [x] Test plan is defined -- [ ] Operational readiness criteria is defined +- [x] Operational readiness criteria is defined - [ ] Graduation criteria for dev preview, tech preview, GA - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) @@ -40,7 +41,7 @@ administrators may customize the length of time between consecutive healthchecks This is already configurable via a route annotation called `router.openshift.io/haproxy.health.check.interval`, but exposing the healthcheck interval at a global scope is desired for efficient administration of routes. HAProxy allows -setting the healthcheck globally as well as per-route, and both options will be addressed as a part of this proposal. +setting the healthcheck interval globally as well as per-route, and both options will be addressed as a part of this proposal. ## Motivation @@ -67,29 +68,29 @@ Performance validation is not in scope for this proposal. Other parameters may come into play for adjusting the healthcheck interval, but will not be covered in this proposal. It is worth checking in the future, for the impact of `spread-checks`, `max-spread-checks`, `fastinter`, `downinter`, -and the use of passive health checks rather than active health checks. +and the use of passive healthchecks rather than active healthchecks. ## Proposal -HAProxy configures health check intervals using the +HAProxy configures healthcheck intervals using the [`check` parameter](http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#5.2-check). -- `check inter `: sets the interval between two consecutive health checks to the time specified -in `duration`. +- `check inter `: sets the interval between two consecutive healthchecks to be the time specified +in ``. -For plain HTTP backends, or backends with TLS termination set to edge, re-encrypt, or passthrough, -the `check inter` duration value may come from (evaluated in order): +For plain HTTP backends, or backends with TLS termination set to `edge`, `re-encrypt`, or `passthrough`, +the `check inter` duration value may be derived from the following sources (evaluated in order): - the `router.openshift.io/haproxy.health.check.interval` annotation on the route - a (currently hidden) `ROUTER_BACKEND_CHECK_INTERVAL` environment variable - 5 seconds as a default if the previous two values are unset -Note that this `check inter` value also serves as a timeout for health checks sent to servers +Note that this `check inter` value also serves as a timeout for healthchecks sent to servers if `timeout check` is not set, but `timeout check` is currently hard-coded to 5 seconds in the OpenShift HAProxy -configuration. The minimum value for `check inter` will be 5 seconds. +configuration. The minimum value for `check inter` will be 1 second. (See the Risks and Mitigations section for further discussion.) `check inter` is set per server for the same backends as `timeout check`, with some -additional constraints. For plain HTTP backends, or backends with TLS termination set to edge, re-encrypt, or -passthrough, it is set only if the endpoint is not idled, and the route has more than one active endpoint. +additional constraints. For plain HTTP backends, or backends with TLS termination set to `edge`, `re-encrypt`, or +`passthrough`, it is set only if the endpoint is not idled, and the route has more than one active endpoint. This proposal will add a new optional parameter, `HealthCheckInterval` to the `IngressControllerTuningOptions` struct, to set the duration used in `check inter` at the backend scope for every backend server that currently sets a healthcheck @@ -97,9 +98,10 @@ interval. ### User Stories -#### As a cluster administrator, I need to reduce the frequency of healthchecks to 20 seconds on all backends +#### As a cluster administrator, I need to reduce the frequency of healthchecks on all backends -Edit the `IngressController` specification to add `healthCheckInterval` to `tuningOptions`. The duration of the +Edit the `IngressController` specification to add `healthCheckInterval` to `tuningOptions`. The default value is +currently 5 seconds, so to reduce the frequency, you must enter a value greater than 5 seconds. The duration of the `healthCheckInterval` must be set as a string representing time values. The time value format is an integer optionally followed by a time unit suffix (e.g. "ms", "s", "m", etc.). If no unit is specified, the value is measured in milliseconds. More information on the time format can be found in the description of @@ -113,14 +115,14 @@ metadata: namespace: openshift-ingress-operator spec: tuningOptions: - healthCheckInterval: "20s" + healthCheckInterval: 10s ... ``` -#### As a cluster administrator, I need to reduce the frequency of healthchecks to 20 seconds on all backends except one +#### As a cluster administrator, I need to reduce the frequency of healthchecks on all backends except one Edit the `IngressController` specification to add `healthCheckInterval` to `tuningOptions` as shown above. Then set -the route annotation on the exception route to the default 5 seconds: +the route annotation `router.openshift.io/haproxy.health.check.interval` on the exception route to the default 5 seconds: ```yaml apiVersion: v1 @@ -145,31 +147,41 @@ interval on all backends in a cluster. ```go type IngressControllerTuningOptions struct { ... - // healthCheckInterval defines how long the router waits between two consecutive - // health checks on its configured backends. This value is applied globally as - // a default for all routes, but may be overridden per-route by the route annotation - // "router.openshift.io/haproxy.health.check.interval". - // - // If unset, the default healthCheckInterval is 5s. - // - // +kubebuilder:validation:Optional - // +kubebuilder:validation:Format=duration - // +optional - HealthCheckInterval *metav1.Duration `json:ā€¯healthCheckInterval,omitemptyā€¯` + // healthCheckInterval defines how long the router waits between two consecutive + // healthchecks on its configured backends. This value is applied globally as + // a default for all routes, but may be overridden per-route by the route annotation + // "router.openshift.io/haproxy.health.check.interval". + // + // Setting this to less than 5s can cause excess traffic due to too frequent + // TCP healthchecks and accompanying SYN packet storms. Alternatively, setting + // this too high can result in increased latency, due to backend servers that are no + // longer available, but haven't yet been detected as such. + // + // An empty or zero healthCheckInterval means no opinion and IngressController chooses + // a default, which is subject to change over time. + // Currently the default healthCheckInterval value is 5s. + // + // Currently the minimum allowed value is 1s and the maximum allowed value is + // 2147483647ms (24.85 days). Both are subject to change over time. + // + // +kubebuilder:validation:Optional + // +kubebuilder:validation:Format=duration + // +optional + HealthCheckInterval *metav1.Duration `json:"healthCheckInterval,omitempty"` } ``` -### Implementation Details/Notes/Constraints [optional] +### Implementation Details/Notes/Constraints -To expose the `healthCheckInterval` to HAProxy, the environment variable `ROUTER_BACKEND_CHECK_TIMEOUT` will be added +To expose the `healthCheckInterval` to HAProxy, the environment variable `ROUTER_BACKEND_CHECK_INTERVAL` will be added to the environment in [desiredRouterDeployment](https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/deployment.go#L206): ```go // desiredRouterDeployment returns the desired router deployment. func desiredRouterDeployment(ci *operatorv1.IngressController, ingressControllerImage string, ingressConfig *configv1.Ingress, apiConfig *configv1.APIServer, networkConfig *configv1.Network, proxyNeeded bool, haveClientCAConfigmap bool, clientCAConfigmap *corev1.ConfigMap) (*appsv1.Deployment, error) { ... -if ci.Spec.TuningOptions.HealthCheckInterval != nil && ci.Spec.TuningOptions.HealthCheckInterval > 5*time.Second { - env = append(env, corev1.EnvVar{Name: "ROUTER_BACKEND_CHECK_TIMEOUT", - Value: durationToHAProxyTimespec(ci.Spec.TuningOptions.HealthCheckInterval)}) +if ci.Spec.TuningOptions.HealthCheckInterval != nil && ci.Spec.TuningOptions.HealthCheckInterval.Duration >= 1*time.Second { + env = append(env, corev1.EnvVar{Name: RouterBackendCheckInterval, + Value: durationToHAProxyTimespec(ci.Spec.TuningOptions.HealthCheckInterval.Duration)}) ... } ``` @@ -179,31 +191,33 @@ The HAProxy template will not be modified. One risk of this proposal is that customers may inadvertently cause different problems with a long healthcheck interval. For example, backend services that would normally be marked as DOWN would be seen as UP for a longer time -period, causing more errors to be seen by their application users. To mitigate this issue the healthcheck can be -adjusted to a shorter interval, or set back to the default. +period, causing more errors to be seen by their application users. To mitigate this issue the healthcheck interval can be +shortened, or set back to the default. -Furthermore, a change to the health check `inter` duration can have an impact on the `timeout connect` for healthchecks, -which is the maximum time to wait for a connection attempt to succeed. +Furthermore, a change to the healthcheck interval (`check inter`) can have an impact on the +maximum time to wait for a connection attempt to succeed (`timeout connect`) for healthchecks. We globally set the `timeout connect` to be 5 seconds by default. (The default can be overridden as environment variable `ROUTER_DEFAULT_CONNECT_TIMEOUT` but only with a custom router.) We also hard-code `timeout check` to 5 seconds and apply to the same backend scope as `check inter`. As [documented in the HAProxy configuration manual](http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4.2-timeout%20check), when `timeout check` is set, HAProxy: ->uses min(`timeout connect`, `inter`) as a connect timeout for check and `timeout check` as an additional read timeout +>uses min(`timeout connect`, `inter`) as a connect timeout for check -This is likely to be a problem only if the `inter` duration was accidentally set to lower than 5 seconds, in which case -the healthchecks could time out before the expected 5 seconds. To mitigate this, the validation on the healthcheck -interval value (`healthCheckInterval`) will have a minimum of 5 seconds. +This is likely to be a problem only if the `inter` duration is accidentally set to less than 5 seconds, in which case +the healthchecks could time out before the expected 5 seconds. The godoc for the `healthCheckInterval` setting has already +been updated with a warning that setting it to less than 5 seconds can cause issues with excessive traffic and SYN packet +storms. ## Design Details -### Open Questions [optional] +### Open Questions -Is there ever a reason to allow users to set the healthcheck interval lower than 5 seconds? +Is there ever a reason to allow users to set the healthcheck interval lower than 5 seconds? We don't know the answer +to that question, but a note in the godoc ensures that users know why this is a bad idea. ### Test Plan -Unit testing can validate that `desiredRouterDeployment` sets the `ROUTER_BACKEND_CHECK_TIMEOUT` environment variable +Unit testing can validate that `desiredRouterDeployment` sets the `ROUTER_BACKEND_CHECK_INTERVAL` environment variable correctly. E2E testing can verify that the annotation `router.openshift.io/haproxy.health.check.interval` takes precedence @@ -267,22 +281,30 @@ N/A #### Support Procedures If the frequency of healthchecks compromises the performance of HAProxy, and the revert to default -values doesn't fix it, that is evidence of another issue. +values doesn't fix it, that is evidence of another issue. -## Implementation History - -## Drawbacks +### Drawbacks This enhancement is customer-driven and is not proven to have the effects that the customer desires. +Additionally, because the route annotation can override the global healthcheck interval setting, this enhancement does not provide +defense against revised healthcheck intervals for individual routes. + +Finally, this enhancement serves to obscure the configured healthcheck interval from namespace administrators. There is +a risk that this will +prompt them to add route annotations to gain insight into healthcheck intervals, thus duplicating or overriding the +efforts of the cluster administrator to set a global healthcheck interval for all routes. + +## Implementation History + ## Alternatives - This customer request is similar to the HSTS customer request, in that users can set the value already via route annotation but administrators want to make it mandatory on all routes, or to be global. We -could use the same approach we used for HSTS, by adding an admission controller that can be configured -by administrators to enforce a minimum or maximum health check interval configuration on each backend. +could have used the same approach we used for HSTS, by adding an admission controller that can be configured +by administrators to enforce a minimum or maximum healthcheck interval configuration on each backend. However, the healthcheck interval is probably easier for administrators to understand as a tuning option, -and is not a security-critical change to routes like HSTS was. +and is not a security-critical change to routes like HSTS is. - We could enable the automatic addition of route annotations for healthcheck interval, based