Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GameServer stucks in Shutdown state preventing a rolling update to complete #2360

Closed
runtarm opened this issue Nov 4, 2021 · 2 comments · Fixed by #2550
Closed

GameServer stucks in Shutdown state preventing a rolling update to complete #2360

runtarm opened this issue Nov 4, 2021 · 2 comments · Fixed by #2550
Labels
kind/bug These are bugs.
Milestone

Comments

@runtarm
Copy link

runtarm commented Nov 4, 2021

What happened:

I've been seeing a GameServer stucks in Shutdown state from time to time. The DeletionTimestamp was set. The underlying pod had already gone, but the finalizer still exists in the GameServer object which blocks GameServer deletion.

When this happens, it prevents a GameServerSet (that the GameServer belongs to) from scaling down, which in turn causing a rolling update of a fleet to not complete i.e. mixed versions of GameServer are available for allocation.

What you expected to happen:

The GameServer in Shutdown state should always be removed after its underlying pod is gone.

How to reproduce it (as minimally and precisely as possible):

We suspect this is caused by a race condition. No idea how to reproduce it deterministically yet.

Anything else we need to know?:

One time when it happened, I saw an event from this line emitted but the following log entry is not there.

Our theory: There is a race condition that a pod deletion event arrived before the podLister cache is updated, so it skipped removing the finalizer at this line.

Normally, the race condition shouldn't be a problem since there is a periodic resync to ensure states are eventually consistent. But the condition here may prevent the resync to go through.

A suggestion from @markmandel is to change the condition to something more akin to:

    if oldGs.Status.State != newGs.Status.State || !newGs.ObjectMeta.DeletionTimestamp.IsZero() {
        c.enqueueGameServerBasedOnState(newGs)
    }

Environment:

  • Agones version: 1.16
  • Kubernetes version (use kubectl version): 1.19
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): yaml
  • Troubleshooting guide log(s): (see above)
  • Others:
@runtarm runtarm added the kind/bug These are bugs. label Nov 4, 2021
@markmandel
Copy link
Collaborator

I think the e2e test TestAutoscalerStressCreate is being hit with this issue, which is why I think this test is being flaky (and what I think will be a good way to test that this issue is resolved).

@WVerlaek
Copy link
Contributor

WVerlaek commented Apr 21, 2022

We just ran into this as well, had three GameServers stuck in a shutdown state since 2 April, Agones version 1.18. Pods were gone and deletion timestamps were set on the GS but blocked by the agones finalizers still being present. I think the suggested change looks sensible

markmandel added a commit to markmandel/agones that referenced this issue Apr 21, 2022
I believe this closes the issue with Shutdown GameServer's very
rarely getting stuck (very hard to test in an e2e test).

Closes googleforgames#2360
roberthbailey pushed a commit that referenced this issue Apr 22, 2022
I believe this closes the issue with Shutdown GameServer's very
rarely getting stuck (very hard to test in an e2e test).

Closes #2360
Thiryn pushed a commit to Thiryn/agones that referenced this issue Apr 25, 2022
I believe this closes the issue with Shutdown GameServer's very
rarely getting stuck (very hard to test in an e2e test).

Closes googleforgames#2360
@SaitejaTamma SaitejaTamma added this to the 1.23.0 milestone May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants