-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GameServer stucks in Shutdown state preventing a rolling update to complete #2360
Comments
I think the e2e test |
We just ran into this as well, had three GameServers stuck in a shutdown state since 2 April, Agones version 1.18. Pods were gone and deletion timestamps were set on the GS but blocked by the agones finalizers still being present. I think the suggested change looks sensible |
I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes googleforgames#2360
I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes #2360
I believe this closes the issue with Shutdown GameServer's very rarely getting stuck (very hard to test in an e2e test). Closes googleforgames#2360
What happened:
I've been seeing a GameServer stucks in
Shutdown
state from time to time. TheDeletionTimestamp
was set. The underlying pod had already gone, but the finalizer still exists in the GameServer object which blocks GameServer deletion.When this happens, it prevents a GameServerSet (that the GameServer belongs to) from scaling down, which in turn causing a rolling update of a fleet to not complete i.e. mixed versions of GameServer are available for allocation.
What you expected to happen:
The GameServer in
Shutdown
state should always be removed after its underlying pod is gone.How to reproduce it (as minimally and precisely as possible):
We suspect this is caused by a race condition. No idea how to reproduce it deterministically yet.
Anything else we need to know?:
One time when it happened, I saw an event from this line emitted but the following log entry is not there.
Our theory: There is a race condition that a pod deletion event arrived before the podLister cache is updated, so it skipped removing the finalizer at this line.
Normally, the race condition shouldn't be a problem since there is a periodic resync to ensure states are eventually consistent. But the condition here may prevent the resync to go through.
A suggestion from @markmandel is to change the condition to something more akin to:
Environment:
kubectl version
): 1.19The text was updated successfully, but these errors were encountered: