-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NE-585 Expose HealthCheck Interval #952
NE-585 Expose HealthCheck Interval #952
Conversation
e9d4862
to
5f3ae68
Compare
/uncc |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle-stale |
/remove-lifecycle stale |
### Non-Goals | ||
|
||
Although the frequency of the healthcheck interval is a cause of concern for some customers, this has not been | ||
identified as a cause for concern for all customers, chiefly because the default is 10 seconds, which is documented in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chiefly because the default is 10 seconds
Is this true? Please clarify what you mean by "the default". HAProxy's default is 2 seconds according to http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#5.2-inter, and OpenShift router's default is 5 seconds per https://github.com/openshift/router/blob/ee343aef5c2eb9fdc079c1e23bf8e908c56f4d97/images/router/haproxy/conf/haproxy-config.template#L638.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complete timeout is the sum of the read timeout + connect timeout. See:
https://github.com/openshift/enhancements/pull/952/files#diff-0b69ed9fb9fcb39200a3b4bbcb5da3f654f92af1530453835f5eaabdd5083097R97-R101
- if
timeout check
is set, HAProxy "uses min(timeout connect
,inter
) as a connect timeout for check andtimeout check
as an additional read timeout" - if
timeout check
is not set, HAProxy "usesinter
for the complete check timeout (connect + read)"
We use the timeout check 5000ms
in the haproxy config for plain http backend, backend with TLS terminated at the edge, secure backend with re-encryption, and for passthrough.
So timeout connect 5s
is global, but the total timeout is the sum of timeout connect
and timeout check
when the latter is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complete timeout is the sum of the read timeout + connect timeout.
That makes sense. However, the text in the enhancement seems to be saying that the default healthcheck interval is 10 seconds:
Although the frequency of the healthcheck interval is a cause of concern for some customers, this has not been
identified as a cause for concern for all customers, chiefly because the default is 10 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we used all the defaults, we have timeout connect 5s
+ timeout check 5s
= 10s for plain http backend, backend with TLS terminated at the edge, secure backend with re-encryption, and for passthrough. Do you mean I have to qualify that it is only for those backend types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean that the total timeout might be 10 seconds, but the interval is 5 seconds whereas the enhancement seems to be saying that the interval is 10 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout check
is always set for those backend types, so we don't use check inter
's value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where I misunderstood here, conflating timeouts with interval. Thanks for persisting with me. Will post an update
|
||
Use of the new `healthCheckInterval` in the `tuningOptions` will change the frequency of healthchecks | ||
that HAProxy performs on its backends. There are scenarios where this could improve or compromise the | ||
performance of HAProxy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should describe healthchecks a little bit to help people diagnose issues. Namely, these healthchecks are TCP SYN probes, and HAProxy closes the connection with a RST as soon as it gets a SYN,ACK response. Excessive healthchecks therefore show up as SYN packet storms.
- We could enable the automatic addition of route annotations for healthcheck interval, based | ||
on a configuration in the ingress controller spec. This would cause mutating routes, which is not | ||
acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, this could be done using Gatekeeper or Kyverno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not acceptable,right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently not since this PR got prioritized. ¯\_(ツ)_/¯. Might be worth mentioning them as options though.
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
/lifecycle frozen |
@candita: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
eb3522a
to
92f9212
Compare
92f9212
to
b580f4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor comments, nothing too important.
/approve
/lgtm
/hold
in case you want to address any of the last few comments with this PR.
`healthCheckInterval` must be set as a string representing time values. The time value format is an integer optionally | ||
followed by a time unit (e.g. "ms", "s", "m", etc.). If no unit is specified, the value is measured in | ||
milliseconds. More information on the time format can be found in the | ||
[HAProxy documentation](https://github.com/haproxy/haproxy/blob/v2.2.0/doc/configuration.txt). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API uses the metav1.Duration
value format. For a citation, we could use https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Duration and https://pkg.go.dev/time#ParseDuration. The most important difference from the HAProxy format is that the unit is required.
### API Extensions | ||
|
||
This proposal will modify the `IngressController` API by adding a new variable called `HealthCheckInterval` to the | ||
[`IngressControllerTuningOptions`](https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L1193) struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use permalinks, or just omit the link (people reading this can figure out which repository to grep).
Unit tests will be added to test the propagation of the `healthCheckInterval` setting and its interaction with the router | ||
annotation `router.openshift.io/haproxy.health.check.interval`. | ||
|
||
E2E tests can validate that the environment variable is present and the HAProxy template is properly constructed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think the unit test and E2E test would be swapped: a unit test in cluster-ingress-operator could verify that desiredRouterDeployment
set the environment variable correctly, and an E2E test could verify that router.openshift.io/haproxy.health.check.interval
took precedence over healthCheckInterval
. I might be misunderstanding this or looking at it the wrong way though, so if it makes sense to test as described here, that's fine as long as we have reasonable test coverage.
Use of the new `healthCheckInterval` in the `tuningOptions` will change the frequency of healthchecks | ||
that HAProxy performs on its backends. There are scenarios where this could either improve or compromise the | ||
performance of HAProxy. Increasing the healthcheck interval too much can result in increased 500-level HTTP responses, | ||
due to backend servers that are no longer available, but haven't yet been detected as such. Decreasing the healthcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the connection from HAProxy to an unavailable backend server to time out and HAProxy to then try a different backend server, resulting in added latency but usually not an HTTP 5xx response, unless multiple backend servers are down (see http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4-retries and http://cbonte.github.io/haproxy-dconv/2.2/configuration.html#4-option%20redispatch). Is that not the case?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b580f4a
to
47b82ad
Compare
47b82ad
to
fd2546e
Compare
/lgtm |
@candita: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/unhold |
Expose and make configurable the ROUTER_BACKEND_CHECK_INTERVAL environment variable in HAProxy's template so that administrators may customize the length of time between subsequent healthchecks of backend services.
This is already configurable via a route annotation called
router.openshift.io/haproxy.health.check.interval
, butexposing the healthcheck interval at a global scope is desired for efficient administration of routes. HAProxy allows
setting the healthcheck globally as well as per-route, and both options will be addressed as a part of this proposal.
https://issues.redhat.com/browse/NE-585