GameServer stucks in Shutdown state preventing a rolling update to complete #2360

runtarm · 2021-11-04T06:43:11Z

What happened:

I've been seeing a GameServer stucks in Shutdown state from time to time. The DeletionTimestamp was set. The underlying pod had already gone, but the finalizer still exists in the GameServer object which blocks GameServer deletion.

When this happens, it prevents a GameServerSet (that the GameServer belongs to) from scaling down, which in turn causing a rolling update of a fleet to not complete i.e. mixed versions of GameServer are available for allocation.

What you expected to happen:

The GameServer in Shutdown state should always be removed after its underlying pod is gone.

How to reproduce it (as minimally and precisely as possible):

We suspect this is caused by a race condition. No idea how to reproduce it deterministically yet.

Anything else we need to know?:

One time when it happened, I saw an event from this line emitted but the following log entry is not there.

Our theory: There is a race condition that a pod deletion event arrived before the podLister cache is updated, so it skipped removing the finalizer at this line.

Normally, the race condition shouldn't be a problem since there is a periodic resync to ensure states are eventually consistent. But the condition here may prevent the resync to go through.

A suggestion from @markmandel is to change the condition to something more akin to:

    if oldGs.Status.State != newGs.Status.State || !newGs.ObjectMeta.DeletionTimestamp.IsZero() {
        c.enqueueGameServerBasedOnState(newGs)
    }

Environment:

Agones version: 1.16
Kubernetes version (use kubectl version): 1.19
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): yaml
Troubleshooting guide log(s): (see above)
Others:

The text was updated successfully, but these errors were encountered:

markmandel · 2021-11-08T19:35:47Z

I think the e2e test TestAutoscalerStressCreate is being hit with this issue, which is why I think this test is being flaky (and what I think will be a good way to test that this issue is resolved).

WVerlaek · 2022-04-21T16:30:59Z

We just ran into this as well, had three GameServers stuck in a shutdown state since 2 April, Agones version 1.18. Pods were gone and deletion timestamps were set on the GS but blocked by the agones finalizers still being present. I think the suggested change looks sensible

I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes googleforgames#2360

I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes #2360

I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes googleforgames#2360

runtarm added the kind/bug These are bugs. label Nov 4, 2021

markmandel mentioned this issue Dec 13, 2021

Fleet RollingUpdate gets stuck when Fleet has high number of allocated GameServers #2397

Closed

markmandel added a commit to markmandel/agones that referenced this issue Apr 21, 2022

Queue updated GameServers with DeletionTimestamp

64c5504

I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes googleforgames#2360

markmandel mentioned this issue Apr 21, 2022

Queue updated GameServers with DeletionTimestamp #2550

Merged

roberthbailey closed this as completed in #2550 Apr 22, 2022

roberthbailey pushed a commit that referenced this issue Apr 22, 2022

Queue updated GameServers with DeletionTimestamp (#2550)

50b4cd5

I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes #2360

SaitejaTamma added this to the 1.23.0 milestone May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GameServer stucks in Shutdown state preventing a rolling update to complete #2360

GameServer stucks in Shutdown state preventing a rolling update to complete #2360

runtarm commented Nov 4, 2021

markmandel commented Nov 8, 2021

WVerlaek commented Apr 21, 2022 •

edited

Loading

GameServer stucks in Shutdown state preventing a rolling update to complete #2360

GameServer stucks in Shutdown state preventing a rolling update to complete #2360

Comments

runtarm commented Nov 4, 2021

markmandel commented Nov 8, 2021

WVerlaek commented Apr 21, 2022 • edited Loading

WVerlaek commented Apr 21, 2022 •

edited

Loading