Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(health): add check endpoint and loop control #2575

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jfroy
Copy link
Contributor

@jfroy jfroy commented Nov 8, 2024

On Kubernetes, it makes more sense to use a liveness probe than the health server loop (i.e. only have one loop). This patch introduces a flag to disable the health server loop, and a new /check/ endpoint for such probes.

When the connection is saturated, health checks can take a long time and therefore fail if the timeout is too short. Gradually increasing the timeout, as done in the health server loop, is not all that useful because the upper bound on the timeout is ultimately what you are willing to tolerate before declaring the connection unhealthy. So a static probe with a long timeout and a failure count, as implemented in Kubernetes, will be more stable (i.e. less flopping), especially if more than one sequential failure is allowed.

The above argument aside, the two health/probe loops also do not work well together because they can get out of phase. Kubernetes probes usually must be used to sequence containers in a pod.

On Kubernetes, it makes more sense to use a liveness probe than the
health server loop (i.e. only have one loop). This patch introduces a
flag to disable the health server loop, and a new /check/ endpoint for
such probes.

When the connection is saturated, health checks can take a long time and
therefore fail if the timeout is too short. Gradually increasing the
timeout, as done in the health server loop, is not all that useful
because the upper bound on the timeout is ultimately what you are
willing to tolerate before declaring the connection unhealthy. So a
static probe with a long timeout and a failure count, as implemented in
Kubernetes, will be more stable (i.e. less flopping), especially if more
than one sequential failure is allowed.

The above argument aside, the two health/probe loops also do not work
well together because they can get out of phase. Kubernetes probes
usually must be used to sequence containers in a pod.

Signed-off-by: Jean-Francois Roy <jf@devklog.net>
@qdm12
Copy link
Owner

qdm12 commented Nov 18, 2024

  1. A 10 seconds timeout to a simple TCP dial+TLS handshake should be plenty for a working connection, even if bottlenecked, what do you think?
  2. In your code you address 2 things: a liveness endpoint doing a health check (TCP dial+TLS) timing out when the request gets canceled AND disabling the auto healing. I'm quite opposed on the second one, so let's focus on the first one (I can explain more if you explain more why you need it disabled).
  3. The http request to this new liveness endpoint: would it ever be canceled?? Something to check, otherwise it might leave a bunch of goroutines hanging/small memory leak.
  4. Another solution, such that there is only one endpoint, would be to pass a timeout url query parameter to the healthcheck endpoint, and that one would keep on waiting on the "health loop" to get a nil error, and would only return the last error after the timeout given has elapsed: would that be possible from the orchestrator point of view? Alternatively just pass a liveness=true parameter such that the http handler hangs instead of returning an error, that way we can keep the health loop independent to how we use it externally.
  5. Just in case you did not notice, you adjust timeout values for the auto-healing.

@jfroy
Copy link
Contributor Author

jfroy commented Nov 18, 2024

  1. A 10 seconds timeout to a simple TCP dial+TLS handshake should be plenty for a working connection, even if bottlenecked, what do you think?

My intuition agrees, but I've seen this fail when 1000s of TCP connections are going through the VPN and the link is fairly loaded (80%+ of expected throughput).

  1. In your code you address 2 things: a liveness endpoint doing a health check (TCP dial+TLS) timing out when the request gets canceled AND disabling the auto healing. I'm quite opposed on the second one, so let's focus on the first one (I can explain more if you explain more why you need it disabled).

The liveness endpoint uses the same code as the health loop on purpose, to avoid code duplication and maintain as much of the behavior of the loop as possible.

As I wrote in the commit message, when using a K8S liveness probe, the gluetun health loop and the K8S probe loop sort of interfere with each other -- they can easily get out of phase. So before this patch, you can end up in a situation where the gluetun loop fails and sets the status error, the K8S liveness probe comes in and samples the error (thus failing the probe), the gluetun loop then succeeds and then fails again, the K8S probe comes in and samples an error again, etc. This can lead to K8S considering the container failed, even though the health server is flopping because the connection is loaded. Because the gluetun loop has an adaptive timeout, it's not possible to prevent the 2 loops from going out of phase. Even if you were to change the gluetun loop to have a fixed period and matched it to the K8S probe, they would still eventually get out of phase.

Disabling the gluetun health loop only makes sense when using a K8S liveness probe. When using gluetun from just Docker, it is a bad idea to disable it, indeed. Perhaps the environment variable could be named something more specific ("disable for kubernetes", "disable I know what I am doing", "disable footguns are fun", etc). Alternatively, perhaps if the liveness endpoint is used it could disable the gluetun loop.

  1. The http request to this new liveness endpoint: would it ever be canceled?? Something to check, otherwise it might leave a bunch of goroutines hanging/small memory leak.

Ah yes, very good point. I will fix the code. I don't think cancellation is likely, but it's just better to handle it.

  1. Another solution, such that there is only one endpoint, would be to pass a timeout url query parameter to the healthcheck endpoint, and that one would keep on waiting on the "health loop" to get a nil error, and would only return the last error after the timeout given has elapsed: would that be possible from the orchestrator point of view? Alternatively just pass a liveness=true parameter such that the http handler hangs instead of returning an error, that way we can keep the health loop independent to how we use it externally.

That's an interesting alternative, but it feels better, overall, not to have 2 loops.

  1. Just in case you did not notice, you adjust timeout values for the auto-healing.

Yeah, and I did try that before writing this patch, but I always got into the unstable regime I described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants