Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agones health check shouldn't fail during game server container image pull #2966

Closed
mtcode opened this issue Feb 13, 2023 · 16 comments · Fixed by #3046 or #3072
Closed

Agones health check shouldn't fail during game server container image pull #2966

mtcode opened this issue Feb 13, 2023 · 16 comments · Fixed by #3046 or #3072
Assignees
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Milestone

Comments

@mtcode
Copy link

mtcode commented Feb 13, 2023

What happened:

The gameserver-sidecar health check fails while the game server container image is being pulled.

What you expected to happen:

The health check should not fail at this time, allowing image pull to complete without terminating the game server.

How to reproduce it (as minimally and precisely as possible):

Given the following example config, the health check will fail 105 seconds after start if the game server isn't healthy.

health:
    failureThreshold: 3
    initialDelaySeconds: 60
    periodSeconds: 15

Unfortunately, this doesn't account for the amount of time that it takes for the container image to be pulled, which can exceed 105 seconds in some cases. For example, if an image pull takes 3 minutes, the health check will fail and attempt to terminate the pod after 105 seconds.

Anything else we need to know?:

This behavior traces back to Agones v1.19, when #2355 was merged which altered the order of container startup so that Agones starts first and the game server starts second. This causes the delay timer to start earlier than previously.

A workaround is to increase the initialDelaySeconds to a larger value, the longest we expect image pull to take. This prevents the health check from failing and terminating the game server, but configuring the delay that large introduces a blind spot in monitoring. If image pulls take less than the delay, such as in the case where the image already exists locally, then there is no health monitoring during the remaining duration until the initialDelaySeconds expires.

Environment:

  • Agones version: v1.28
  • Kubernetes version (use kubectl version): v1.23
  • Cloud provider or hardware configuration: AWS
  • Install method (yaml/helm): Helm
  • Troubleshooting guide log(s):
  • Others:
@mtcode mtcode added the kind/bug These are bugs. label Feb 13, 2023
@markmandel markmandel added the area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc label Feb 13, 2023
@zmerlynn zmerlynn self-assigned this Mar 14, 2023
@zmerlynn
Copy link
Collaborator

zmerlynn commented Mar 14, 2023

Initially, my thinking on this bug was that initialDelaySeconds was absolutely the right thing to use here and that the "blind" healthcheck period wasn't super important. I've been really going back and forth, though.

Pat of the problem is there's no "right" order to start the sidecar vs the game server. There are pros and cons to each:

  1. (Current) Sidecar starts before game server:
    • (Pro) The SDK is reachable when the game server starts.
    • (Con) Health settings have to include the time for the game server to start healthchecks, which may include cold start delays for e.g. image pulls or game server binary start.
    • (Con) Complicates Autopilot resource adjustment because the game server is second in the manifest.
  2. (before Move SDK sidecar to first position in container list #2355) Game server starts before sidecar:
    • (Pro) Health checks flow immediately when game server is ready.
    • (Con) The SDK may not be reachable when the game server starts. This is actually documented in the REST client documentation.

Frankly I'm leaning towards the following:

cc @roberthbailey @markmandel who were involved in #2355 and #2351

@markmandel
Copy link
Collaborator

I personally wouldn't go down the path of whether / what order to start the sidecar in.

I think we need a way to know what state the gameserver container is in -- much llike the (hackery) we do for health checking (we set annotations on the GameServer).

func (hc *HealthController) skipUnhealthyGameContainer(gs *agonesv1.GameServer, pod *corev1.Pod) (bool, error) {
if !metav1.IsControlledBy(pod, gs) {
// This is not the Pod we are looking for 🤖
return false, nil
}
// If the GameServer is before Ready, both annotation values should be ""
// If the GameServer is past Ready, both the annotations should be exactly the same.
// If they are annotations are different, then the data between the GameServer and the Pod is out of sync,
// in which case, send it back to the queue to try again.
gsReadyContainerID := gs.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation]
if pod.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation] != gsReadyContainerID {
return false, workerqueue.NewDebugError(errors.Errorf("pod and gameserver %s data are out of sync, retrying", gs.ObjectMeta.Name))
}
if gs.IsBeforeReady() {
hc.baseLogger.WithField("gs", gs.ObjectMeta.Name).WithField("state", gs.Status.State).Debug("skipUnhealthyGameContainer: Is Before Ready. Checking failed container")
// If the reason for failure was a container failure, then we can skip moving to Unhealthy.
// otherwise, we know it was one of the other reasons (eviction, lack of ports), so we should definitely go to Unhealthy.
return hc.failedContainer(pod), nil
}
// finally, we need to check if the failed container happened after the gameserver was ready or before.
for _, cs := range pod.Status.ContainerStatuses {
if cs.Name == gs.Spec.Container {
if cs.State.Terminated != nil {
hc.baseLogger.WithField("gs", gs.ObjectMeta.Name).WithField("podStatus", pod.Status).Debug("skipUnhealthyGameContainer: Container is terminated, returning false")
return false, nil
}
if cs.LastTerminationState.Terminated != nil {
// if the current container is running, and is the ready container, then we know this is some
// other pod update, and we previously had a restart before we got to being Ready, and therefore
// shouldn't move to Unhealthy.
check := cs.ContainerID == gsReadyContainerID
if !check {
hc.baseLogger.WithField("gs", gs.ObjectMeta.Name).WithField("gsMeta", gs.ObjectMeta).WithField("podStatus", pod.Status).Debug("skipUnhealthyGameContainer: Container crashed after Ready, returning false")
}
return check, nil
}
break
}
}
hc.baseLogger.WithField("gs", gs.ObjectMeta.Name).WithField("gsMeta", gs.ObjectMeta).WithField("podStatus", pod.Status).Debug("skipUnhealthyGameContainer: Game Container has not crashed, game container may be healthy")
return false, nil

Maybe we take a similar approach (or allow the sidecar to see Pods as well as GameServers?)

As an interesting approach - the sidecar can create and patch events! And events will let us know if a container is pulling -- that may be a good way to do this - maybe rather than allowing visibility into pods, we just allow visibility into events? (is that better / worse?)

@zmerlynn
Copy link
Collaborator

🤔 I spent a while looking into this at some point, and there's really no good way for one container to learn that another is up in a clean way:

  • You can jump through hoops so SDK can monitor the game server container as you're describing, but they are mostly inherently racy, or have other drawbacks like load on kube-apiserver, which we already induce a lot of.
  • You can do something hacky with shared process namespace to see the other container come up, i.e. start sidecar, sidecar waits for a new PID to appear.
  • Or you can do something network based between the game server and the SDK. But once you go down this route, you quickly realize "something network based" is basically health/readiness checks.

The last, "something network based" is by far the cleanest approach, since each side of the process is allowed flexbility on when to claim it's started or not. The simplicity and flexibility are one of the reasons k8s does it that way, too: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

@zmerlynn
Copy link
Collaborator

zmerlynn commented Mar 15, 2023

I've thought about this more, and I am back to thinking that InitialDelaySeconds is the correct abstraction to use here. Here's my argument:

  • The game server container can already use liveness probes from kubelet. kubelet knows exactly when the container starts and already has rich configuration for liveness and startup probes.
  • initialDelaySeconds is only relevant until the first healthcheck is received. So it should be thought of as not a forced delay but rather "how long to wait before Agones will start considering failing a GameServer". Other processes, like kubelet container restart, can still restart the container, and there's plenty of valid use cases where I can imagine a long initialDelaySeconds and expect a flapping container, even outside of image pulls - e.g. crashing because some other dependency hasn't started.

A workaround is to increase the initialDelaySeconds to a larger value, the longest we expect image pull to take. This prevents the health check from failing and terminating the game server, but configuring the delay that large introduces a blind spot in monitoring.

As long as you have liveness probes from kubelet (which is already aware of container pull/start), what's the blind spot in monitoring? Remember that we are mostly likely talking about a container that has not yet called Ready() [1], so if something were to fail, it's likely to just result in the container restarting pre-Ready() due to failed liveness checks, after the long pull. If it succeeds, no harm no foul.

I think I'm missing the problem with a long initialDelaySeconds. Can somewhere describe the scenario with a timeline?

[1] Sidebar: I think there is currently nothing technically stopping the game server from going Ready() without a single call to Health(), but a typical game server is going to start its health-ping routine before calling Ready(). We should perhaps close this possibility, though, and call touchHealthLastUpdated on Ready() as well.

@mtcode
Copy link
Author

mtcode commented Mar 16, 2023

initialDelaySeconds is only relevant until the first healthcheck is received. So it should be thought of as not a forced delay but rather "how long to wait before Agones will start considering failing a GameServer".

While I agree with the statement, the Agones health check doesn't really have much intrinsic value if there is no action taken upon failure. Setting initialDelaySeconds to a large value where the Agones health check will not act for the duration if there is an actual internal liveness issue that the kubelet won't detect is the blind spot that I referred to.

Here's an example:

initialDelaySeconds is set to 300 to allow an additional 5 minutes for images to be pulled from a container repository, and prevent the Agones health check from evicting game servers while they are in ImagePullBackoff. The Agones sidecar starts while the main container is waiting for its image to be pulled, and the delay prevents the health check from running (and failing) and the pod from being killed. After 5 minutes, the main container starts. Now if there are any health issues, Agones will act on them and restart the game server as necessary. All is well.

The second time around, initialDelaySeconds is still set to 300, but the image already exists locally in the cluster, so both the sidecar and the main container start immediately. However, maybe there's an internal issue and the game server becomes unhealthy, without crashing, and its health endpoint returns error codes, or maybe the health endpoint doesn't respond at all. Now, we have to wait 5 minutes before the health check starts failing and the pod gets evicted.

This time period where the Agones health check cannot act because of the forced artificial delay before it can consider failing a GameServer is problematic.

@zmerlynn
Copy link
Collaborator

Thanks for the detailed reply, @mtcode!

However, maybe there's an internal issue and the game server becomes unhealthy, without crashing, and its health endpoint returns error codes, or maybe the health endpoint doesn't respond at all. Now, we have to wait 5 minutes before the health check starts failing and the pod gets evicted.

I'm arguing this condition should be covered by the kubelet liveness check instead. If liveness probes fail, kubelet will kill the pod (and then either restart or not, depending on the restartPolicy), then agones-controller will see the container termination and the GameServer will go Unhealthy - which is what we want.

While I agree with the statement, the Agones health check doesn't really have much intrinsic value if there is no action taken upon failure.

This statement might actually be true. I'm not seeing a lot of value for Agones health checks over kubelet probes, and Agones container management will always be inherently racy when compared to kubelet. I'm actually wondering if it would be better just to advocate for kubelet probes instead and let Agones pick up the failed container, but I realize this is .. kind of a radical position to take. I will do a little more research to understand why we have our own health check system.

@zmerlynn
Copy link
Collaborator

Ok, I had an internal discussion with @markmandel and now I get what's going on. The sidecar is proxying the liveness probe anyways, to avoid the game server having to establish its own probes. Let me think on this, but I still think a network based solution is about the only way forward.

@zmerlynn
Copy link
Collaborator

Okay! After talking about this longer with @markmandel and @mtcode, I think we might have a plan! Sorry for the confusion above, I really didn't understand that the Agones sidecar was proxying /gshealthz for the game server, which absolutely makes sense.

Here's the thinking, hat tip to @markmandel for connecting the dots. Background:

  • We spin up the sidecar first, and the game server second. [1] So the SDK is running by the time the GS is up.
  • On the game server container, we today tack on a liveness probe to /gshealthz - this liveness probe is served by the SDK server. Code here:
    return gs.ApplyToPodContainer(pod, gs.Spec.Container, func(c corev1.Container) corev1.Container {
    if c.LivenessProbe == nil {
    c.LivenessProbe = &corev1.Probe{
    ProbeHandler: corev1.ProbeHandler{
    HTTPGet: &corev1.HTTPGetAction{
    Path: "/gshealthz",
    Port: intstr.FromInt(8080),
    },
    },
    InitialDelaySeconds: gs.Spec.Health.InitialDelaySeconds,
    PeriodSeconds: gs.Spec.Health.PeriodSeconds,
    FailureThreshold: gs.Spec.Health.FailureThreshold,
    }
    }
    return c
    })
  • But on line 708 of that code, we set initialDelaySeconds. kubelet waits for image pull and container start, then just sleeps initialDelaySeconds before sending any probes (documented here). Buuut.. in our setup, this results in a huge gap (as @mtcode insisted, and was correct) - with a large initialDelaySeconds, kubelet doesn't even start monitoring until well after the container start.

Solution:

  • We don't need the initialDelaySeconds on container configuration. We haven't needed it since Move SDK sidecar to first position in container list #2355 changed the container ordering. Why not? Because the game server's liveness probes are going to the SDK server, which is already up!
  • And, in fact, we can use the kubelet liveness probe to determine whether the container is up or not. Rather than have the SDK server start its own initialDelaySeconds when the SDK starts, we can instead have the SDK wait for the first kubelet probe. When we first see /gshealthz, we can start the SDK initialDelaySeconds and then wait for the game server to initialize.

[1] Sidebar: There's some nuance as to why the way we currently do it is a little off and not totally guaranteed, but it generally works fine because the game server binaries take a while to pull.

@markmandel
Copy link
Collaborator

Thanks for the comprehensive writeup!

When we first see /gshealthz, we can start the SDK initialDelaySeconds and then wait for the game server to initialize.

Question on this point. My thought here was that we pass the initialDelaySeconds down to the Pod's configured health check:

return gs.ApplyToPodContainer(pod, gs.Spec.Container, func(c corev1.Container) corev1.Container {
if c.LivenessProbe == nil {
c.LivenessProbe = &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/gshealthz",
Port: intstr.FromInt(8080),
},
},
InitialDelaySeconds: gs.Spec.Health.InitialDelaySeconds,
PeriodSeconds: gs.Spec.Health.PeriodSeconds,
FailureThreshold: gs.Spec.Health.FailureThreshold,
}
}

That way the SDK itself doesn't even need to track or be aware of the initialDelaySeconds, since it's baked into that initial ping on /gshealthz from the kubelet. It becomes a matter of "just start health checking on first hit on /gshealthz, whenever that happens.

Or are we saying the same thing?

@roberthbailey
Copy link
Member

[1] Sidebar: There's some nuance as to why the way we currently do it is a little off and not totally guaranteed, but it generally works fine because the game server binaries take a while to pull.

This doesn't sound entirely correct to me. While the first pull onto a node can be slow, subsequent game server pods that start on the same machine shouldn't incur any pull time as the container image will be cached. So in cases where you have lots of game servers per machine (like in the simple game server load tests we run) most of the game server binaries will have next to 0 pull time at startup.

@zmerlynn
Copy link
Collaborator

That way the SDK itself doesn't even need to track or be aware of the initialDelaySeconds, since it's baked into that initial ping on /gshealthz from the kubelet. It becomes a matter of "just start health checking on first hit on /gshealthz, whenever that happens.

I'm a little confused - InitialDelaySeconds is already baked into that healthcheck, and that's the issue. Since the game server container starts second, we can assume that the SDK should already be started (which means we don't need to tell the kubelet to delay checking) - so I think it should not be in the healthcheck config for the container.

However, initialDelaySeconds still serves a purpose - even measuring the startup from the time the container starts to the time of initial healthcheck, the game server may have a slow startup. i.e. with more concrete example timing:

Let's take a more concrete example with:

health:
    initialDelaySeconds: 30
    periodSeconds: 5
    failureThreshold: 3

and propose flow, example with a game server that takes 25s to start up:

time event
0 SDK starts
2m game server container is pulled and starts
2m kubelet sends check to /gshealthz, SDK records container start time and replies ok
2m5s kubelet sends check to /gshealthz, SDK replies ok
2m10s kubelet sends check to /gshealthz, SDK replies ok
2m15s kubelet sends check to /gshealthz, SDK replies ok
2m20s kubelet sends check to /gshealthz, SDK replies ok
2m25 game server calls Health(), and continues - SDK continues replying ok after this.

The point here is that the GS might still have other conditions where initialDelaySeconds makes sense, but it also makes sense to track it in case the game server hangs. Above, if we had gone to 2m30s without hearing from the gameserver, we would have started failing kubelet healthchecks.

This doesn't sound entirely correct to me. While the first pull onto a node can be slow, subsequent game server pods that start on the same machine shouldn't incur any pull time as the container image will be cached. So in cases where you have lots of game servers per machine (like in the simple game server load tests we run) most of the game server binaries will have next to 0 pull time at startup.

I confirmed with our internal sig-node team previously that the way we are currently doing it is less guaranteed than the original blog post and may still race the container startup. I think we mostly don't see it because the SDK starts quickly.

@zmerlynn
Copy link
Collaborator

I was able to repro this by creating a large image simple-game-server. In fact the pull time was comically large with a 3G image: 69eeaab

This took about 2m12 on a GKE Autopilot cluster, even with Image Streaming: Successfully pulled image "gcr.io/zml-gke-dev/simple-game-server:0.15-big" in 2m12.508554656s (2m12.508571775s including waiting)

Unfortunately this was good for about one pull, as Image Streaming does successfully cache it after, for every node in the project. So I'll need to test with it disabled, or re-push each time (either works, really).

zmerlynn added a commit to zmerlynn/agones that referenced this issue Mar 28, 2023
Implements googleforgames#2966 (comment):

* Remove the InitialDelaySeconds from the game server container
configuration. The SDK will be available prior to the game server
starting.

* Rework how InitialDelaySeconds works in the SDK: Rather than
starting the timer in Run(), start the timer on first /gshealthz, the
URL for the kubelet liveness probe for the game server. kubelet will
not send a liveness probe until after the container has started, so we
can use the first /gshealthz to indicate the container is actually
running.
  * We still need the concept of InitialDelaySeconds to handle the
case that, after container creation, the game server takes a while
to initialize before calling Health(). This is more-or-less what
the field meant prior to googleforgames#2355, so this PR is more returning it to
that state.
zmerlynn added a commit to zmerlynn/agones that referenced this issue Mar 28, 2023
Implements googleforgames#2966 (comment):

* Remove the InitialDelaySeconds from the game server container
configuration. The SDK will be available prior to the game server
starting.

* Rework how InitialDelaySeconds works in the SDK: Rather than
starting the timer in Run(), start the timer on first /gshealthz, the
URL for the kubelet liveness probe for the game server. kubelet will
not send a liveness probe until after the container has started, so we
can use the first /gshealthz to indicate the container is actually
running.
  * We still need the concept of InitialDelaySeconds to handle the
case that, after container creation, the game server takes a while
to initialize before calling Health(). This is more-or-less what
the field meant prior to googleforgames#2355, so this PR is more returning it to
that state.
@zmerlynn
Copy link
Collaborator

zmerlynn commented Mar 31, 2023

@markmandel and I talked about this more yesterday, and settled on a different model:

  • (echoing Rework game server health initial delay handling #3046 (comment)): I missed the point he made above that we could just rely on kubelet to drive the initial timing, and just don't check until then
  • I noticed we could actually further simplify and remove our own go routine that was periodically monitoring the health state. Instead we can rely on the /gshealthz signal from kubelet.

So describing a bit more thoroughly:

  • We remove any knowledge in the SDK of InitialDelaySeconds
  • We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler
  • Along the way, I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy.

zmerlynn added a commit to zmerlynn/agones that referenced this issue Mar 31, 2023
See googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* I was glancing at how time was used through the SDK and noticed one
place where we don't cast to UTC - adjusted that.
zmerlynn added a commit to zmerlynn/agones that referenced this issue Mar 31, 2023
See googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.
zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 3, 2023
See googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.
zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 3, 2023
See googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.
zmerlynn added a commit that referenced this issue Apr 4, 2023
* Rework health check handling of InitialDelaySeconds

See #2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in #3059)
@zmerlynn zmerlynn reopened this Apr 4, 2023
@zmerlynn
Copy link
Collaborator

zmerlynn commented Apr 4, 2023

Send revert #3068, will close when we get it back in.

zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 6, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit that referenced this issue Apr 6, 2023
* Rework game server health initial delay handling

This is a redrive of #3046, which was reverted in #3068

Rework health check handling of InitialDelaySeconds. See
#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in #3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix #2445 as well
@Kalaiselvi84 Kalaiselvi84 added this to the 1.31.0 milestone Apr 10, 2023
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this issue Apr 11, 2023
* Rework health check handling of InitialDelaySeconds

See googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds
* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this issue Apr 11, 2023
* Rework game server health initial delay handling

This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
@mtcode
Copy link
Author

mtcode commented Apr 12, 2023

Thank you! I see that this was released in Agones v1.31.0.

@zmerlynn
Copy link
Collaborator

@mtcode Yup! Give it a whirl, feedback quite welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Projects
None yet
5 participants