Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky: TestGameServerRestartBeforeReadyCrash #2445

Closed
markmandel opened this issue Jan 21, 2022 · 17 comments · Fixed by #2812
Closed

Flaky: TestGameServerRestartBeforeReadyCrash #2445

markmandel opened this issue Jan 21, 2022 · 17 comments · Fixed by #2812
Assignees
Labels
area/tests Unit tests, e2e tests, anything to make sure things don't break good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! kind/bug These are bugs.
Milestone

Comments

@markmandel
Copy link
Contributor

What happened:

The test TestGameServerRestartBeforeReadyCrash is quite flaky!

Short version:

--- FAIL: TestGameServerRestartBeforeReadyCrash (213.29s)
    gameserver_test.go:353: 
        	Error Trace:	gameserver_test.go:353
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestGameServerRestartBeforeReadyCrash

What you expected to happen:

The test to pass consistently.

How to reproduce it (as minimally and precisely as possible):

Run builds.

Anything else we need to know?:

Long log:

time="2022-01-21 05:46:12.582" level=info msg="Waiting for us to have an address to send things to" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="Waiting for states to match" awaitingState=Scheduled currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="GameServer created" gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="Dialing UDP message to address" address="35.247.116.52:7735" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:15.786" level=info msg="crashing, and waiting to see restart" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:18.285" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:21.030" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:23.832" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:26.831" level=info msg="sending message" fields.msg=CRASH gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:29.633" level=info msg="successfully crashed. Moving on!" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:32.769" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:36.168" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:39.369" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:41.567" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-servertf8gs test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:41.568" level=info msg="marking GameServer as ready" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:44.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Scheduled test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:47.431" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:50.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:53.035" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:56.237" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:46:59.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:01.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:04.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:07.232" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:10.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:13.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:17.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:19.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:22.634" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:26.032" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:29.035" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:31.449" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:34.447" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:37.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:40.233" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:43.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:46.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:49.451" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:52.232" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:54.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:47:57.469" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:00.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:02.641" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:05.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:07.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:10.233" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:13.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:15.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:18.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:21.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:24.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:27.236" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:30.235" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:32.832" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:35.635" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:38.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:41.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:44.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:47.436" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:50.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:52.833" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:55.067" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:48:57.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:00.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:03.030" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:05.236" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:07.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:09.633" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:11.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:13.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:15.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:17.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:19.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:21.032" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:22.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:24.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:25.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:27.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:28.832" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:30.232" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:31.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:32.831" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:33.634" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:34.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:35.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:36.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:36.433" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:37.033" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:37.835" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:38.432" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:38.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:39.233" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:39.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:40.162" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:40.635" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:41.032" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:41.234" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:41.834" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:42.161" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:42.632" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:43.034" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:43.434" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:43.837" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:44.161" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:44.634" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.037" level=info msg="sending message" fields.msg=READY gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.037" level=info msg="crashing again, should be unhealthy" test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.260" level=info msg="checking final crash state" gs=game-servertf8gs state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
time="2022-01-21 05:49:45.260" level=info msg="Unhealthy! We are done!" test=TestGameServerRestartBeforeReadyCrash
--- FAIL: TestGameServerRestartBeforeReadyCrash (213.29s)
        	Test:       	TestGameServerRestartBeforeReadyCrash

Environment: N/A

  • Agones version: dev
  • Kubernetes version (use kubectl version): 1.21.0
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s): See above
  • Others: N/A
@markmandel markmandel added kind/bug These are bugs. area/tests Unit tests, e2e tests, anything to make sure things don't break help wanted We would love help on these issues. Please come help us! good first issue These are great first issues. If you are looking for a place to start, start here! labels Jan 21, 2022
@markmandel
Copy link
Contributor Author

markmandel commented Apr 23, 2022

This has happened in a few builds lately, but I think the line numbers may have changed - or not

22: e2e-feature-gates
#2547 (comment)

--- FAIL: TestGameServerRestartBeforeReadyCrash (223.08s)
    gameserver_test.go:353: 
        	Error Trace:	gameserver_test.go:353
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	Test:       	TestGameServerRestartBeforeReadyCrash

@markmandel
Copy link
Contributor Author

Just tested this locally, looks like this is still an issue.

root@8308eef5623c:/go/src/agones.dev/agones/test/e2e# go test -race -run TestGameServerRestartBeforeReadyCrash -count 100
....
INFO[2022-08-26 19:45:35.856] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.057] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.258] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.461] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.666] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.862] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.060] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.260] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.452] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.658] successfully crashed. Moving on!              test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.691] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:38.728] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:39.731] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:40.732] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:41.733] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.729] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.772] Waiting for states to match                   awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.772] marking GameServer as ready                   test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.846] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.129] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.333] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.536] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.740] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.930] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.138] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.330] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.531] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.730] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.930] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.133] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.334] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.539] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.720] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.941] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.136] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.342] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.536] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.733] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.940] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.120] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.325] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.528] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.737] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.937] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.130] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.336] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.531] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.744] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.931] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.135] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.333] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.531] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.722] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.937] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.148] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.344] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.539] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.737] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.935] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.138] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.337] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.533] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.728] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.933] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.133] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.342] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.533] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.731] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.936] sending message                               fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.126] sending message                               fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.330] sending message                               fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.533] sending message                               fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.720] ready! Moving On!                             test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.720] crashing again, should be unhealthy           test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.806] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.806] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.086] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.087] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.292] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.293] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.496] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.497] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.694] checking final crash state                    gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.694] sending message                               fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.899] checking final crash state                    gs=game-serverj55pk state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.900] Unhealthy! We are done!                       test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:55.016] Waiting for us to have an address to send things to  test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:55.082] Waiting for states to match                   awaitingState=Scheduled currentState=Creating gs=game-serverw57fc test=TestGameServerRestartBeforeReadyCrash
panic: test timed out after 10m0s

goroutine 4655 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/go/src/testing/testing.go:1788 +0xbb
created by time.goFunc
        /usr/local/go/src/time/sleep.go:180 +0x4a

goroutine 1 [chan receive]:
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1225 +0x635
testing.tRunner(0xc00017eb60, 0xc000157980)
        /usr/local/go/src/testing/testing.go:1265 +0x269
testing.runTests(0xc000490680, {0x3299400, 0x46, 0x46}, {0x1204310, 0xc00052a1c0, 0x32abd00})
        /usr/local/go/src/testing/testing.go:1596 +0x7cb
testing.(*M).Run(0xc000490680)
        /usr/local/go/src/testing/testing.go:1504 +0x9d2
agones.dev/agones/test/e2e.TestMain(0x400)
        /go/src/agones.dev/agones/test/e2e/main_test.go:94 +0x9e8
main.main()
        _testmain.go:183 +0x265

goroutine 4 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/klog/v2/klog.go:1181 +0x8b
created by k8s.io/klog/v2.init.0
        /go/src/agones.dev/agones/vendor/k8s.io/klog/v2/klog.go:420 +0x1c5

goroutine 30 [IO wait]:
internal/poll.runtime_pollWait(0x7f5b3133a7d8, 0x72)
        /usr/local/go/src/runtime/netpoll.go:229 +0x89
internal/poll.(*pollDesc).wait(0xc00011e118, 0xc000012000, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0xbd
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00011e100, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x419
net.(*netFD).Read(0xc00011e100, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/net/fd_posix.go:56 +0x51
net.(*conn).Read(0xc000306010, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/net/net.go:183 +0xb1
crypto/tls.(*atLeastReader).Read(0xc0001319b0, {0xc000012000, 0x44ee, 0x44ee})
        /usr/local/go/src/crypto/tls/conn.go:777 +0x86
bytes.(*Buffer).ReadFrom(0xc0000ae278, {0x255b700, 0xc0001319b0})
        /usr/local/go/src/bytes/buffer.go:204 +0x113
crypto/tls.(*Conn).readFromUntil(0xc0000ae000, {0x255e220, 0xc000306010}, 0x5)
        /usr/local/go/src/crypto/tls/conn.go:799 +0x1df
crypto/tls.(*Conn).readRecordOrCCS(0xc0000ae000, 0x0)
        /usr/local/go/src/crypto/tls/conn.go:606 +0x3fe
crypto/tls.(*Conn).readRecord(...)
        /usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc0000ae000, {0xc00028b000, 0x1000, 0x0})
        /usr/local/go/src/crypto/tls/conn.go:1277 +0x29c
bufio.(*Reader).Read(0xc000700540, {0xc00027e3c0, 0x9, 0x9})
        /usr/local/go/src/bufio/bufio.go:227 +0x4db
io.ReadAtLeast({0x255b560, 0xc000700540}, {0xc00027e3c0, 0x9, 0x9}, 0x9)
        /usr/local/go/src/io/io.go:328 +0xde
io.ReadFull(...)
        /usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc00027e3c0, 0x9, 0x9}, {0x255b560, 0xc000700540})
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/frame.go:237 +0x96
golang.org/x/net/http2.(*Framer).ReadFrame(0xc00027e380)
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/frame.go:498 +0x108
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0005d3f78)
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:2101 +0x1f3
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0004b1080)
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:1997 +0xb5
created by golang.org/x/net/http2.(*Transport).newClientConn
        /go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:725 +0x14cb

goroutine 4703 [select]:
k8s.io/apimachinery/pkg/util/wait.poller.func1.1()
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:708 +0x2ed
created by k8s.io/apimachinery/pkg/util/wait.poller.func1
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:691 +0x12c

goroutine 4701 [select]:
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext({0x2589408, 0xc0000420c8}, 0xc000323260, 0x18)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:658 +0x189
k8s.io/apimachinery/pkg/util/wait.poll({0x2589408, 0xc0000420c8}, 0x1, 0x198f601, 0x4a1045)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594 +0xe6
k8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext({0x2589408, 0xc0000420c8}, 0xc0003f7540, 0x0, 0x0)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:526 +0x66
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0xc0002edab0, 0x22a41e0, 0x4)
        /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:512 +0x71
agones.dev/agones/test/e2e/framework.(*Framework).WaitForGameServerState(0xc000491f00, 0xc00017ed00, 0xc0000d5400, {0x22a968d, 0x9}, 0x0)
        /go/src/agones.dev/agones/test/e2e/framework/framework.go:257 +0x2dc
agones.dev/agones/test/e2e.TestGameServerRestartBeforeReadyCrash(0xc00017ed00)
        /go/src/agones.dev/agones/test/e2e/gameserver_test.go:278 +0x705
testing.tRunner(0xc00017ed00, 0x2366730)
        /usr/local/go/src/testing/testing.go:1259 +0x230
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1306 +0x727
exit status 2
FAIL    agones.dev/agones/test/e2e      600.514s

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 2, 2022

I saw this fail recently as well. Grabbing as a good first issue.

@zmerlynn zmerlynn self-assigned this Nov 2, 2022
@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 2, 2022

Re: #2445 (comment), the problem here (and I ran into it as well), is that -count <X> is still constrained by -timeout, as evidenced by:

INFO[2022-08-26 19:45:55.082] Waiting for states to match                   awaitingState=Scheduled currentState=Creating gs=game-serverw57fc test=TestGameServerRestartBeforeReadyCrash
panic: test timed out after 10m0s

I'm trying this locally with -count 100 -timeout 0 now, but I saw a flake here yesterday, too: #2782 (comment) .

@markmandel
Copy link
Contributor Author

I feel like this one is a nightmare to replicate.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 2, 2022

Breadcrumbs: In the case of #2445 (comment) and #2782 (comment) both, we see this kind of odd pattern:

Step #23 - "e2e-stable": time="2022-11-01 15:59:22.597" level=info msg="checking final crash state" gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:59:22.598" level=info msg="Unhealthy! We are done!" test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": --- FAIL: TestGameServerRestartBeforeReadyCrash (205.04s)
Step #23 - "e2e-stable":     gameserver_test.go:354:
Step #23 - "e2e-stable":                Error Trace:    gameserver_test.go:354
Step #23 - "e2e-stable":                Error:          Received unexpected error:
Step #23 - "e2e-stable":                                timed out waiting for the condition
Step #23 - "e2e-stable":                Test:           TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:59:23.025" level=info msg="Checking Fleet Ready replicas" expected=1 fleet=simple-fleet-1.08n4sb fleetStatus="{Replicas:2 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:2 Players:<nil>}" test="TestFleetRollingUpdate/Use_fleet_Patch_true_25%_cycle_true"

note that we're saying Unhealthy! We are done!, but then failing on a timeout. I included the line after it to understand the timing, which is very tight. Ok, but the Unhealthy! We are done! is from:

logger.Info("Unhealthy! We are done!")

but the error trace is line 354 above it:

assert.NoError(t, err)

ETA: This is presumably because assert.NoError only calls t.Errorf - so we see a timeout and then continue on.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 2, 2022

I think in both cases I can see this pattern (** for emphasis):

Step #23 - "e2e-stable": time="2022-11-01 15:56:08.199" level=info msg="successfully crashed. Moving on!" test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:10.391" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:14.192" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:16.592" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:19.391" level=info msg="Waiting for states to match" awaitingState=Unhealthy currentState=Scheduled gs=game-serverxsdhc test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:19.391" level=info msg="marking GameServer as ready" test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:22.198" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Scheduled test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:24.399" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Scheduled test=TestGameServerRestartBeforeReadyCrash
** Step #23 - "e2e-stable": time="2022-11-01 15:56:27.397" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:29.998" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:33.199" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
Step #23 - "e2e-stable": time="2022-11-01 15:56:36.398" level=info msg="sending message" fields.msg=READY gs=game-serverxsdhc state=Unhealthy test=TestGameServerRestartBeforeReadyCrash

In particular, it looks like during the flakes, the GameServer is always transitioning Scheduled -> Unhealthy.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 3, 2022

I have not been able to repro locally yet (hundreds of runs over the afternoon) - this may take some logs analysis.

I scoured the build logs using the hyper-advanced technique of:

gsutil -m cp gs://agones-build-logs/* .
egrep "Test:.*TestGameServerRestartBeforeReadyCrash" *.txt

Unfortunately, build retention is short enough that I only see the failure in #2782 (comment).

Going to follow up on the CI clusters and see if I can maybe poke at logs there.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 3, 2022

Another dupe: #2790 (comment)

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 3, 2022

So, I've been doing some logs analysis on #2782 (comment). What I see is that we crash the container twice, and due to backoff, there's a pretty big gap in health checks:

image

It's not clear this gap matters, but it's certainly curious. But then it does get to RequestReady state:

image

.. and fails just as quickly?

FWIW, the pod update there seems to be .. possibly irrelevant, not sure:
image

Anyways, still looking, and now I have another dupe to look at tomorrow.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 3, 2022

I have a theory as to what's happening here. I'll distill it down to a time-sequence first, then show the logs. I think what's happening is:

time event
0 container c14f starts
1 kubelet updates Pod status to show container c14f as running (restartCount=0)
2 container c14f crashes (forced by CRASH message from e2e)
3 kubelet restarts container, container 7c7c starts
4 e2e sends another CRASH
5 kubelet updates Pod status to shows container 7c7c as running (restartCount=1)
6 e2e recognizes crash has succeeded and stops sending CRASH, moves on to send READY
7 container 7c7c crashes (forced by CRASH message at t=4)
8 kubelet restarts container, container 2e5d starts
9 container 2e5d receives READY, moving GameServer to RequestReady
10 controller sees RequestReady and moves GameServer to Ready, with GameServerReadyContainerIDAnnotation as 7c7c. (Reminder, controller has not seen a Pod update yet)
11 kubelet updates Pod status to shows container 2e5d as running (restartCount=2)
12 controller sees 2e5d != 7c7c and assumes Pod died after calling Ready().

We see echoes of this in the logs for #2790 (comment) (I am eliding the first container, c14f - the only importance of that container is that the e2e moved on from CRASH to READY):

  • 2022-11-03T00:46:03.552890Z: 7c7c dies
SyncLoop (PLEG): event for pod" pod="1667436145/game-server5m66g" event=&{ID:cc19f829-800c-47b6-b93e-619e07cddd66 Type:ContainerDied Data:7c7ce1cfe5f4717e5fbf14e68806a59a8babde538834ffbb015dcbb1ecb02157}
  • 2022-11-03T00:46:14.622672Z: 2e5d started
SyncLoop (PLEG): event for pod" pod="1667436145/game-server5m66g" event=&{ID:cc19f829-800c-47b6-b93e-619e07cddd66 Type:ContainerStarted Data:2e5db813bc74279ad454d1974637aa2f4c16f32ea77c942f300e6480db0bd7b4}
  • 2022-11-03T00:46:19.171202006Z: We see a Syncing RequestReady State. Note that this is the last log in that message - there's also a SDK.Ready() complete event recorded during the same second (46:19) that appears out of order in Logs Explorer due to lack of precision in the event log.

  • 2022-11-03T00:46:19.191195Z: just after SDK.Ready we see a Pod update that still shows 7c7c as running - this is why I am guessing the sequence around t=10 above:

          {
            "ready": true,
            "lastState": {
              "terminated": {
                "exitCode": 1,
                "reason": "Error",
                "finishedAt": "2022-11-03T00:45:57Z",
                "startedAt": "2022-11-03T00:45:55Z",
                "containerID": "containerd://c14f0b62e3cff080fd9662f1478fd06eef363a878dd732c45dfabf92017e56bd"
              }
            },
            "name": "game-server",
            "image": "gcr.io/agones-images/simple-game-server:0.14",
            "imageID": "gcr.io/agones-images/simple-game-server@sha256:564578aa131ae856bdb7aef3f389a34fc73e7e396b5838ae9bc65f80909b8b96",
            "containerID": "containerd://7c7ce1cfe5f4717e5fbf14e68806a59a8babde538834ffbb015dcbb1ecb02157",
            "state": {
              "running": {
                "startedAt": "2022-11-03T00:45:59Z"
              }
            },
            "started": true,
            "restartCount": 1
          }
  • 2022-11-03T00:46:20.772518856Z: agones-controller sees the container 2e5d for the first time and that 7c7c has failed:
{
  "insertId": "7tw4phgwr0rveiqq",
  "jsonPayload": {
    "containerStatuses": [
      {
        "imageID": "us-docker.pkg.dev/agones-images/ci/agones-sdk@sha256:73413d1cf61febccf1305f22195309c01b8be402cbde7f9fb2762b0f7fea543c",
        "image": "us-docker.pkg.dev/agones-images/ci/agones-sdk:1.28.0-66f21f3",
        "name": "agones-gameserver-sidecar",
        "started": true,
        "lastState": {},
        "ready": true,
        "restartCount": 0,
        "state": {
          "running": {
            "startedAt": "2022-11-03T00:45:55Z"
          }
        },
        "containerID": "containerd://197488cfc52f5cc6683c0c96d7c92bdc517675627b3e465c64ad02a5fb602216"
      },
      {
        "imageID": "gcr.io/agones-images/simple-game-server@sha256:564578aa131ae856bdb7aef3f389a34fc73e7e396b5838ae9bc65f80909b8b96",
        "containerID": "containerd://2e5db813bc74279ad454d1974637aa2f4c16f32ea77c942f300e6480db0bd7b4",
        "state": {
          "running": {
            "startedAt": "2022-11-03T00:46:14Z"
          }
        },
        "name": "game-server",
        "restartCount": 2,
        "image": "gcr.io/agones-images/simple-game-server:0.14",
        "ready": true,
        "lastState": {
          "terminated": {
            "containerID": "containerd://7c7ce1cfe5f4717e5fbf14e68806a59a8babde538834ffbb015dcbb1ecb02157",
            "reason": "Error",
            "finishedAt": "2022-11-03T00:46:00Z",
            "exitCode": 1,
            "startedAt": "2022-11-03T00:45:59Z"
          }
        },
        "started": true
      }
    ],
    "gs": "game-server5m66g",
    "message": "Container Failed",
    "container": "game-server",
    "source": "*gameservers.HealthController"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "location": "us-west1-c",
      "project_id": "agones-images",
      "namespace_name": "agones-system",
      "container_name": "agones-controller",
      "pod_name": "agones-controller-6b94677bc6-rnbwq",
      "cluster_name": "e2e-test-cluster"
    }
  },
  "timestamp": "2022-11-03T00:46:20.772518856Z",
  "severity": "DEBUG",
  "labels": {
    "k8s-pod/agones_dev/role": "controller",
    "k8s-pod/heritage": "Helm",
    "k8s-pod/release": "agones",
    "k8s-pod/app": "agones",
    "k8s-pod/pod-template-hash": "6b94677bc6",
    "compute.googleapis.com/resource_name": "gke-e2e-test-cluster-agones-system-79cf09e3-bgs7"
  },
  "logName": "projects/agones-images/logs/stderr",
  "receiveTimestamp": "2022-11-03T00:46:25.762176239Z"
}

@markmandel
Copy link
Contributor Author

This is some super interesting analysis!

The thing I'm not 100% sure on with the analysis, is that this is the order of updates on the Pod first, then the GameServer for the GameServerReadyContainerIDAnnotation

podCopy.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation] = gsCopy.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation]
if _, err = c.podGetter.Pods(pod.ObjectMeta.Namespace).Update(ctx, podCopy, metav1.UpdateOptions{}); err != nil {
return gs, errors.Wrapf(err, "error updating ready annotation on Pod: %s", pod.ObjectMeta.Name)
}
}
gsCopy.Status.State = agonesv1.GameServerStateReady
gs, err = c.gameServerGetter.GameServers(gs.ObjectMeta.Namespace).Update(ctx, gsCopy, metav1.UpdateOptions{})
if err != nil {
return gs, errors.Wrapf(err, "error setting Ready, Port and address on GameServer %s Status", gs.ObjectMeta.Name)
}

So if the Pod is currently out of sync, the Update call should fail, since there is a newer generation in K8s.

Also, in the health controller, if the annotations are out of sync with each other (which I don't think should ever happen? Is that what you are definitely seeing here?) the health controller will return an error, to kick it into a retry, rather than move it to Unhealthy. 🤔

if pod.ObjectMeta.Annotations[agonesv1.GameServerReadyContainerIDAnnotation] != gsReadyContainerID {
return false, workerqueue.NewDebugError(errors.Errorf("pod and gameserver %s data are out of sync, retrying", gs.ObjectMeta.Name))
}

markmandel pushed a commit that referenced this issue Nov 8, 2022
Narrow the race in #2445 by running GameServerRestartBeforeReadyCrash serially. See #2445 (comment) for a detailed analysis.

Does not fix the issue - this is stopgap until we understand how to fix it.
@zmerlynn
Copy link
Collaborator

| So if the Pod is currently out of sync, the Update call should fail, since there is a newer generation in K8s.

Hmm. I think that as long as step 10 and 11 occur in this order (per earlier comment):

time event
7 container 7c7c crashes (forced by CRASH message at t=4)
8 kubelet restarts container, container 2e5d starts
9 container 2e5d receives READY, moving GameServer to RequestReady
10 controller sees RequestReady and moves GameServer to Ready, with GameServerReadyContainerIDAnnotation as 7c7c. (Reminder, controller has not seen a Pod update yet)
11 kubelet updates Pod status to shows container 2e5d as running (restartCount=2)

then kubelet basically loses the race to update the pod (it's update in step 10, I didn't make that explicit).

| Also, in the health controller, if the annotations are out of sync with each other (which I don't think should ever happen? Is that what you are definitely seeing here?) the health controller will return an error, to kick it into a retry

I haven't seen the error from line 256, so I assume not.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 11, 2022

Wait! Yes, the second screenshot in #2445 (comment) had exactly that error message: pod and gameserver game-serverxsdhc data are out of sync, retrying

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 11, 2022

Oh, except that error is also exactly what you expect after the container restarts, that's right. (Trying to page this all back in.) (ignore me, I think I'm confused.)

@zmerlynn
Copy link
Collaborator

zmerlynn commented Nov 12, 2022

The health.go:256 error message is present, but I think that's expected in between the update to the Pod and the update to the GameServer.

This going to be a bit spammy with pictures, I hope they help. In order, I see:


  • game-server updates to RequestReady (I surmise this by callerSuppliedUserAgent being sdk-server/v0.0.0 [...], but I didn't include that in the screen shot) (this is step 9 in the sequence above):
    image

  • I think this is the updates from line 855 and line 861. Note that in the middle we see some red herring updates from other controllers, including the out of sync, retrying. This is expected between updates. This is step 10 in the sequence above:
    image

  • Right after, we signal SDK.Ready():
    image

  • But then right after, I believe this is kubelet updating the pod (SyncLoop UPDATE is basically saying "I found an update during my sync loop"), after which we notice Container Failed. This is step 11 in the sequence above:
    image


I realize I am inferring a lot from the SyncLoop, but unfortunately we don't keep audit logs on Pod.Status updates, so it's hard to draw a strong conclusion. We know that line 855/861 result in synchronized (but bad) state of GameServerReadyContainerIDAnnotation because we see the skipUnhealthyGameContainer: Container crashed after Ready, returning false message that happens after the HealthController verifies the pod/GS match:

hc.baseLogger.WithField("gs", gs.ObjectMeta.Name).WithField("gsMeta", gs.ObjectMeta).WithField("podStatus", pod.Status).Debug("skipUnhealthyGameContainer: Container crashed after Ready, returning false")

So I'm pretty confident in this analysis. I think if kubelet loses the race, we can't actually use the container ID the way it's being used here.

I question the value of this feature and would like to return to an earlier question I asked on chat: Is this solving a real problem? Do customers often have workloads that flap before Ready(), i.e. a "settling period"? Or is this just kind of a nice to have feature? If it's the latter, I'd really recommend we just move this to "any termination => Unhealthy" and eliminate this logic. It's really complicated to get right.

If it's a really necessary feature for some usecase, I think there's a better way to handle this, but it's invasive: The game server container can generate a random cookie/nonce and communicate that with Ready() and on healthchecks. On healthcheck, the SDK checks the cookie and fail healthchecks immediately if there's any mismatch. Basically, take any of the container analysis out of it and just make this strictly a conversation between the game server and the sidecar - that should be a lot more bulletproof at detecting that the game server that called Ready() is also the one that's currently running.

ETA: Another option to keep the feature might be to implement some delay to RequestReady. We could implement a speed-bump (even 1s would probably close it) at

s.enqueueState(agonesv1.GameServerStateRequestReady)

@zmerlynn
Copy link
Collaborator

Also, this race only really exists because the game server in this case was able to call Ready() so quickly. Another way to close the race from the unit test perspective is to add a speed bump to simple-game-server (it could even be done for only this test). With a minor delay, I expect kubelet will always win the race.

If we fix it this way, we are effectively agreeing that in a narrow set of circumstances, the GameServer may land in Unhealthy unnecessarily - i.e. we accept that the race exists but in 99.9% of cases it just doesn't matter. That's an ok compromise, I suspect.

chiayi pushed a commit to chiayi/agones that referenced this issue Nov 17, 2022
nodepools and regional clusters

Updates to release checklist. (googleforgames#2772)

* Updates to release checklist.

Adding items that showed up in the recent release that were not written
down or required better clarification.

* Review updates, and some other small tweaks.

Co-authored-by: Robert Bailey <robertbailey@google.com>

Release 1.27.0 (googleforgames#2776)

* Release 1.27.0

* Update FAQ on ExternalDNS (googleforgames#2773)

The feature flag it points to have been moved to stable, so the link
is not useful any more.

Also removed notes on ipv6, since they aren't 100% accurate, as we were
discussing in googleforgames#2767.

* Updates to release checklist. (googleforgames#2772)

* Updates to release checklist.

Adding items that showed up in the recent release that were not written
down or required better clarification.

* Review updates, and some other small tweaks.

Co-authored-by: Robert Bailey <robertbailey@google.com>

* Release-changes

* Review comment

* Review changes

Co-authored-by: Mark Mandel <markmandel@google.com>
Co-authored-by: Robert Bailey <robertbailey@google.com>

Version updates (googleforgames#2778)

Players in-game metric for when PlayerTracking is enabled (googleforgames#2765)

* Check for DeletionTimestamp of fleet and gameserverset before scaling

* Add metric to track player count in gameservers

* check PlayerStatus is not nil

* Update metrics available in docs

* Wrong relref path

* typo

* Change name for players in game metric to player connected. Add player capacity metric. Hide docs until next agones release.

* Duplicate metrics table

* add gameserver player tracking metrics to fleetViews

Co-authored-by: Mark Mandel <markmandel@google.com>

Remove generation for swagger Go code and Add static swagger codes for test (googleforgames#2757)

Co-authored-by: Mark Mandel <markmandel@google.com>

Updated allocation yaml files under examples/ to use selectors

Show how to set graceful termination in a game server that is safe to (googleforgames#2780)

evict.

Avoid retry from allocateFromLocalCluster under context kill. (googleforgames#2783)

* Version updates

* issue-2736-changes

Co-authored-by: Mark Mandel <markmandel@google.com>

Bring SDK base image to debian:bullseye (googleforgames#2769)

* Bring SDK base image to debian:bullseye

The upgrade to gRPC solved one issue, and I also added a limit to number
of processes that could run for `make -j` otherwise the whole thing
would fall over (also would crash my dev machine!).

Closes googleforgames#2224

* Force refresh of cpp cache on Cloud Build.

* Fixes for CI:

* Revert CI cache increment (don't think we need it)
* Add shell to cpp image for debugging.
* Fix formatting issue that is breaking CI.

Co-authored-by: Robert Bailey <robertbailey@google.com>

Update health-checking.md (googleforgames#2785)

Fixed spell error: spec.health.failureTheshold to spec.health.failureThreshold

Updated allocation yaml files under examples/ to use selectors (googleforgames#2787)

Cleanup of load tests (googleforgames#2784)

* issue-2744 updated changes with new description
* 2744 review changes

Sync Pod host ports back to GameServer in GCP (googleforgames#2782)

This is the start of the implementation for googleforgames#2777:

* Most of this is mechanical and implements a thin cloud product
abstraction layer in pkg/cloud, instantiated with New(product). The
product abstraction provides a single function so far:
SyncPodPortsToGameServer.

* SyncPodPortsToGameServer is inserted as a hook while syncing
IP/ports, to let different cloud providers handle port allocation
slightly differently (in this case, GKE Autopilot)

* In GKE Autopilot, we look for a JSON string like
`{"min":7000,"max":8000,"portsAssigned":{"7001":7737,"7002":7738}}`
as an indication that the host ports were reassigned (per policy).
As a side note to anyone watching, this is currently an unreleased
feature. If we see this, we use the provided mapping to map the
host ports in the GameServer.Spec.

With this change, it's possible to launch a GameServer and get a
healthy GameServer Pod by adding the following annotation:

```
annotations:
  cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
  autopilot.gke.io/host-port-assignment: '{"min": 7000, "max": 8000}'
```

If this PR causes any issues, the cloud product auto detection can
be disabled by setting `agones.cloudProduct=generic`, or forced to
GKE Autopilot using `agones.cloudProduct=gke-autopilot`.

In a future PR, I will add the host-port-assignment annotation
automatically on Autopilot

Co-authored-by: Mark Mandel <markmandel@google.com>

Update gke terraform files to allow autoscaling

Fix (not really) problems reported by VSCode (googleforgames#2790)

VSCode reports `main redeclared` between allocationload.go and
runscenario.go due to the fact that they both look like `package main`
binaries in the same directory, similar e.g. [this poster on a
different
project](https://stackoverflow.com/questions/66970531/vs-code-go-main-redeclared-in-this-block)

To fix it, it's easy enough to just give these binaries their own
package path and fix up the calling scripts.

Along the way, fix a lint complaint in runscenario.go

Add location variable for cluster location argument

Minor fix

changed default of location var to empty string

GameServerRestartBeforeReadyCrash: Run serially (googleforgames#2791)

Narrow the race in googleforgames#2445 by running GameServerRestartBeforeReadyCrash serially. See googleforgames#2445 (comment) for a detailed analysis.

Does not fix the issue - this is stopgap until we understand how to fix it.

Enable fieldalignment linter, then mostly ignore it (googleforgames#2795)

Enable the fieldalignment linter by enabling all `govet` checks
except shadowing. Ignore large swaths of code (tests, cmd/, APIs),
and nolint'd existing complaints that seemed irrelevant.

Along the way:

* removed existing nolint:maligned, as `maligned` is no more.
* disabled `structcheck` and `deadcode` as they are deprecated (and I
think have been subsumed by other linters?)
* changed `gameServerCacheEntry` to `gameServerCache`. It is the
cache, not just an entry.
* fixed alignment of `gameServerSetCacheEntry`.

Add fswatch library to watch and batch filesystem events, use in allocator (googleforgames#2792)

This pull refactors the fsnotify code in allocator/main out to a
shared library, and in that shared library implements a batched
notification processor.

Closes googleforgames#1816: This takes a slightly different approach than specified
in the issue, instead choosing to just delay processing until after a
batch processing period. I chose 1s - it's far longer than necessary,
but still much shorter than it takes for the secret changes to
propagate to the container anyways.

I considered the approach in googleforgames#1816 of trying to parse the actual
events, but it's too fiddly to get exactly right: e.g. maybe you only
refresh on "write", but then "chmod" could make the file readable
whereas it wasn't before, "rename" could expose a file that wasn't
there before, etc.

Cloud product: Split port allocators, implement Autopilot port allocation/policies (googleforgames#2789)

In the Agones on GKE Autopilot implementation, we have no need for the
port allocator - the informer/etc. is an unnecessary moving piece.
This PR allows for cloud products to provide their own port allocation
implementation, and implements the GKE Autopilot "allocator". We do
this by:

* Splitting portallocator off to its own package. It was basically
self-sufficient anyways, except it was a little too friendly with
controller_test.go. I solved that by introducing a TestInterface for
controller_test.go to upcast to.

* Allow cloud product implementations to define their own port
allocator.

* Defining a new port allocator for GKE that does a simple per-port
HostPort allocation, and adds the host-port-assignment annotation to
the pod template.

* Extend cloudproduct again to add a GameServer validator

* And in Autopilot, reject if the PortPolicy is not `Dynamic`

Release: Note to switch away from `agones-images` (googleforgames#2809)

Since we have few guardrails on accidentally touching `agones-images`
project, adding a note in the release checklist to switch back to a
local development project after running a release.

Flake: TestControllerGameServerCount (googleforgames#2805)

Made it deterministic in the test, and got rod of the potential race
conditions.

Also fix it such that the util function for generating GameServer names
always produce a unique name.

Closes googleforgames#2804

Co-authored-by: Robert Bailey <robertbailey@google.com>

Remove Windows FAQ Entry (googleforgames#2811)

The contents are no longer accurate, and are covered in the installation
section now.

Makefile changes for adding location variable

added autoscale parameters to Makefile and README

Markdown fix in readme

Changed LOCATION to always be set with ZONE as default

use  only if the variable has a value

fixed extraneous characters

update gke terraform exmaple module

Update Node.js dependencies and package (googleforgames#2815)

* Update all dependencies and Node,js to LTS version

* Update other docker images that use Node.js

Added autoscale to example cluster and added to website docs

Added defaults and feature expiry

Remove zone from gke/variable.tf file.
roberthbailey added a commit that referenced this issue Nov 18, 2022
This closes the race in #2445 by introducing a larger delay before
we mark the game server pod as Ready(). This change admits the
possibility that in some circumstances, if the game server initializes
too quickly and kubelet loses a race to update the pod, we may perceive
the game server as having crashed when it did not.

Co-authored-by: Robert Bailey <robertbailey@google.com>
@mangalpalli mangalpalli added this to the 1.28.0 milestone Nov 29, 2022
zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 5, 2023
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
zmerlynn added a commit to zmerlynn/agones that referenced this issue Apr 6, 2023
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
zmerlynn added a commit that referenced this issue Apr 6, 2023
* Rework game server health initial delay handling

This is a redrive of #3046, which was reverted in #3068

Rework health check handling of InitialDelaySeconds. See
#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in #3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix #2445 as well
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this issue Apr 11, 2023
* Rework game server health initial delay handling

This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests Unit tests, e2e tests, anything to make sure things don't break good first issue These are great first issues. If you are looking for a place to start, start here! help wanted We would love help on these issues. Please come help us! kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants