-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agones health check shouldn't fail during game server container image pull #2966
Comments
Initially, my thinking on this bug was that Pat of the problem is there's no "right" order to start the sidecar vs the game server. There are pros and cons to each:
Frankly I'm leaning towards the following:
cc @roberthbailey @markmandel who were involved in #2355 and #2351 |
I personally wouldn't go down the path of whether / what order to start the sidecar in. I think we need a way to know what state the gameserver container is in -- much llike the (hackery) we do for health checking (we set annotations on the GameServer). agones/pkg/gameservers/health.go Lines 252 to 296 in 71c90a4
Maybe we take a similar approach (or allow the sidecar to see Pods as well as GameServers?) As an interesting approach - the sidecar can create and patch events! And events will let us know if a container is pulling -- that may be a good way to do this - maybe rather than allowing visibility into pods, we just allow visibility into events? (is that better / worse?) |
🤔 I spent a while looking into this at some point, and there's really no good way for one container to learn that another is up in a clean way:
The last, "something network based" is by far the cleanest approach, since each side of the process is allowed flexbility on when to claim it's started or not. The simplicity and flexibility are one of the reasons k8s does it that way, too: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ |
I've thought about this more, and I am back to thinking that
As long as you have liveness probes from I think I'm missing the problem with a long [1] Sidebar: I think there is currently nothing technically stopping the game server from going |
While I agree with the statement, the Agones health check doesn't really have much intrinsic value if there is no action taken upon failure. Setting Here's an example:
The second time around, This time period where the Agones health check cannot act because of the forced artificial delay before it can consider failing a GameServer is problematic. |
Thanks for the detailed reply, @mtcode!
I'm arguing this condition should be covered by the kubelet liveness check instead. If liveness probes fail, kubelet will kill the pod (and then either restart or not, depending on the
This statement might actually be true. I'm not seeing a lot of value for Agones health checks over kubelet probes, and Agones container management will always be inherently racy when compared to kubelet. I'm actually wondering if it would be better just to advocate for kubelet probes instead and let Agones pick up the failed container, but I realize this is .. kind of a radical position to take. I will do a little more research to understand why we have our own health check system. |
Ok, I had an internal discussion with @markmandel and now I get what's going on. The sidecar is proxying the liveness probe anyways, to avoid the game server having to establish its own probes. Let me think on this, but I still think a network based solution is about the only way forward. |
Okay! After talking about this longer with @markmandel and @mtcode, I think we might have a plan! Sorry for the confusion above, I really didn't understand that the Agones sidecar was proxying Here's the thinking, hat tip to @markmandel for connecting the dots. Background:
Solution:
[1] Sidebar: There's some nuance as to why the way we currently do it is a little off and not totally guaranteed, but it generally works fine because the game server binaries take a while to pull. |
Thanks for the comprehensive writeup!
Question on this point. My thought here was that we pass the initialDelaySeconds down to the Pod's configured health check: agones/pkg/gameservers/controller.go Lines 699 to 712 in dc592e9
That way the SDK itself doesn't even need to track or be aware of the Or are we saying the same thing? |
This doesn't sound entirely correct to me. While the first pull onto a node can be slow, subsequent game server pods that start on the same machine shouldn't incur any pull time as the container image will be cached. So in cases where you have lots of game servers per machine (like in the simple game server load tests we run) most of the game server binaries will have next to 0 pull time at startup. |
I'm a little confused - However, Let's take a more concrete example with:
and propose flow, example with a game server that takes 25s to start up:
The point here is that the GS might still have other conditions where
I confirmed with our internal sig-node team previously that the way we are currently doing it is less guaranteed than the original blog post and may still race the container startup. I think we mostly don't see it because the SDK starts quickly. |
I was able to repro this by creating a large image This took about 2m12 on a GKE Autopilot cluster, even with Image Streaming: Unfortunately this was good for about one pull, as Image Streaming does successfully cache it after, for every node in the project. So I'll need to test with it disabled, or re-push each time (either works, really). |
Implements googleforgames#2966 (comment): * Remove the InitialDelaySeconds from the game server container configuration. The SDK will be available prior to the game server starting. * Rework how InitialDelaySeconds works in the SDK: Rather than starting the timer in Run(), start the timer on first /gshealthz, the URL for the kubelet liveness probe for the game server. kubelet will not send a liveness probe until after the container has started, so we can use the first /gshealthz to indicate the container is actually running. * We still need the concept of InitialDelaySeconds to handle the case that, after container creation, the game server takes a while to initialize before calling Health(). This is more-or-less what the field meant prior to googleforgames#2355, so this PR is more returning it to that state.
Implements googleforgames#2966 (comment): * Remove the InitialDelaySeconds from the game server container configuration. The SDK will be available prior to the game server starting. * Rework how InitialDelaySeconds works in the SDK: Rather than starting the timer in Run(), start the timer on first /gshealthz, the URL for the kubelet liveness probe for the game server. kubelet will not send a liveness probe until after the container has started, so we can use the first /gshealthz to indicate the container is actually running. * We still need the concept of InitialDelaySeconds to handle the case that, after container creation, the game server takes a while to initialize before calling Health(). This is more-or-less what the field meant prior to googleforgames#2355, so this PR is more returning it to that state.
@markmandel and I talked about this more yesterday, and settled on a different model:
So describing a bit more thoroughly:
|
See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * I was glancing at how time was used through the SDK and noticed one place where we don't cast to UTC - adjusted that.
See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy.
See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy.
See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy.
* Rework health check handling of InitialDelaySeconds See #2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in #3059)
Send revert #3068, will close when we get it back in. |
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
* Rework game server health initial delay handling This is a redrive of #3046, which was reverted in #3068 Rework health check handling of InitialDelaySeconds. See #2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in #3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix #2445 as well
* Rework health check handling of InitialDelaySeconds See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
* Rework game server health initial delay handling This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well
Thank you! I see that this was released in Agones v1.31.0. |
@mtcode Yup! Give it a whirl, feedback quite welcome! |
What happened:
The gameserver-sidecar health check fails while the game server container image is being pulled.
What you expected to happen:
The health check should not fail at this time, allowing image pull to complete without terminating the game server.
How to reproduce it (as minimally and precisely as possible):
Given the following example config, the health check will fail 105 seconds after start if the game server isn't healthy.
Unfortunately, this doesn't account for the amount of time that it takes for the container image to be pulled, which can exceed 105 seconds in some cases. For example, if an image pull takes 3 minutes, the health check will fail and attempt to terminate the pod after 105 seconds.
Anything else we need to know?:
This behavior traces back to Agones v1.19, when #2355 was merged which altered the order of container startup so that Agones starts first and the game server starts second. This causes the delay timer to start earlier than previously.
A workaround is to increase the
initialDelaySeconds
to a larger value, the longest we expect image pull to take. This prevents the health check from failing and terminating the game server, but configuring the delay that large introduces a blind spot in monitoring. If image pulls take less than the delay, such as in the case where the image already exists locally, then there is no health monitoring during the remaining duration until theinitialDelaySeconds
expires.Environment:
kubectl version
): v1.23The text was updated successfully, but these errors were encountered: