-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky: TestGameServerRestartBeforeReadyCrash #2445
Comments
This has happened in a few builds lately, but I think the line numbers may have changed - or not 22: e2e-feature-gates
|
Just tested this locally, looks like this is still an issue. root@8308eef5623c:/go/src/agones.dev/agones/test/e2e# go test -race -run TestGameServerRestartBeforeReadyCrash -count 100
....
INFO[2022-08-26 19:45:35.856] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.057] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.258] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.461] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.666] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:36.862] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.060] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.260] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.452] sending message fields.msg=CRASH gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.658] successfully crashed. Moving on! test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:37.691] Waiting for states to match awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:38.728] Waiting for states to match awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:39.731] Waiting for states to match awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:40.732] Waiting for states to match awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:41.733] Waiting for states to match awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.729] Waiting for states to match awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.772] Waiting for states to match awaitingState=Unhealthy currentState=Scheduled gs=game-serverj55pk test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.772] marking GameServer as ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:42.846] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.129] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.333] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.536] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.740] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:43.930] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.138] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.330] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.531] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.730] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:44.930] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.133] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.334] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.539] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.720] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:45.941] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.136] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.342] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.536] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.733] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:46.940] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.120] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.325] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.528] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.737] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:47.937] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.130] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.336] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.531] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.744] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:48.931] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.135] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.333] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.531] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.722] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:49.937] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.148] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.344] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.539] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.737] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:50.935] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.138] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.337] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.533] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.728] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:51.933] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.133] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.342] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.533] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.731] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:52.936] sending message fields.msg=READY gs=game-serverj55pk state=Scheduled test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.126] sending message fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.330] sending message fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.533] sending message fields.msg=READY gs=game-serverj55pk state=RequestReady test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.720] ready! Moving On! test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.720] crashing again, should be unhealthy test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.806] checking final crash state gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:53.806] sending message fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.086] checking final crash state gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.087] sending message fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.292] checking final crash state gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.293] sending message fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.496] checking final crash state gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.497] sending message fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.694] checking final crash state gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.694] sending message fields.msg=CRASH gs=game-serverj55pk state=Ready test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.899] checking final crash state gs=game-serverj55pk state=Unhealthy test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:54.900] Unhealthy! We are done! test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:55.016] Waiting for us to have an address to send things to test=TestGameServerRestartBeforeReadyCrash
INFO[2022-08-26 19:45:55.082] Waiting for states to match awaitingState=Scheduled currentState=Creating gs=game-serverw57fc test=TestGameServerRestartBeforeReadyCrash
panic: test timed out after 10m0s
goroutine 4655 [running]:
testing.(*M).startAlarm.func1()
/usr/local/go/src/testing/testing.go:1788 +0xbb
created by time.goFunc
/usr/local/go/src/time/sleep.go:180 +0x4a
goroutine 1 [chan receive]:
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1225 +0x635
testing.tRunner(0xc00017eb60, 0xc000157980)
/usr/local/go/src/testing/testing.go:1265 +0x269
testing.runTests(0xc000490680, {0x3299400, 0x46, 0x46}, {0x1204310, 0xc00052a1c0, 0x32abd00})
/usr/local/go/src/testing/testing.go:1596 +0x7cb
testing.(*M).Run(0xc000490680)
/usr/local/go/src/testing/testing.go:1504 +0x9d2
agones.dev/agones/test/e2e.TestMain(0x400)
/go/src/agones.dev/agones/test/e2e/main_test.go:94 +0x9e8
main.main()
_testmain.go:183 +0x265
goroutine 4 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
/go/src/agones.dev/agones/vendor/k8s.io/klog/v2/klog.go:1181 +0x8b
created by k8s.io/klog/v2.init.0
/go/src/agones.dev/agones/vendor/k8s.io/klog/v2/klog.go:420 +0x1c5
goroutine 30 [IO wait]:
internal/poll.runtime_pollWait(0x7f5b3133a7d8, 0x72)
/usr/local/go/src/runtime/netpoll.go:229 +0x89
internal/poll.(*pollDesc).wait(0xc00011e118, 0xc000012000, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0xbd
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00011e100, {0xc000012000, 0x44ee, 0x44ee})
/usr/local/go/src/internal/poll/fd_unix.go:167 +0x419
net.(*netFD).Read(0xc00011e100, {0xc000012000, 0x44ee, 0x44ee})
/usr/local/go/src/net/fd_posix.go:56 +0x51
net.(*conn).Read(0xc000306010, {0xc000012000, 0x44ee, 0x44ee})
/usr/local/go/src/net/net.go:183 +0xb1
crypto/tls.(*atLeastReader).Read(0xc0001319b0, {0xc000012000, 0x44ee, 0x44ee})
/usr/local/go/src/crypto/tls/conn.go:777 +0x86
bytes.(*Buffer).ReadFrom(0xc0000ae278, {0x255b700, 0xc0001319b0})
/usr/local/go/src/bytes/buffer.go:204 +0x113
crypto/tls.(*Conn).readFromUntil(0xc0000ae000, {0x255e220, 0xc000306010}, 0x5)
/usr/local/go/src/crypto/tls/conn.go:799 +0x1df
crypto/tls.(*Conn).readRecordOrCCS(0xc0000ae000, 0x0)
/usr/local/go/src/crypto/tls/conn.go:606 +0x3fe
crypto/tls.(*Conn).readRecord(...)
/usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc0000ae000, {0xc00028b000, 0x1000, 0x0})
/usr/local/go/src/crypto/tls/conn.go:1277 +0x29c
bufio.(*Reader).Read(0xc000700540, {0xc00027e3c0, 0x9, 0x9})
/usr/local/go/src/bufio/bufio.go:227 +0x4db
io.ReadAtLeast({0x255b560, 0xc000700540}, {0xc00027e3c0, 0x9, 0x9}, 0x9)
/usr/local/go/src/io/io.go:328 +0xde
io.ReadFull(...)
/usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc00027e3c0, 0x9, 0x9}, {0x255b560, 0xc000700540})
/go/src/agones.dev/agones/vendor/golang.org/x/net/http2/frame.go:237 +0x96
golang.org/x/net/http2.(*Framer).ReadFrame(0xc00027e380)
/go/src/agones.dev/agones/vendor/golang.org/x/net/http2/frame.go:498 +0x108
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc0005d3f78)
/go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:2101 +0x1f3
golang.org/x/net/http2.(*ClientConn).readLoop(0xc0004b1080)
/go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:1997 +0xb5
created by golang.org/x/net/http2.(*Transport).newClientConn
/go/src/agones.dev/agones/vendor/golang.org/x/net/http2/transport.go:725 +0x14cb
goroutine 4703 [select]:
k8s.io/apimachinery/pkg/util/wait.poller.func1.1()
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:708 +0x2ed
created by k8s.io/apimachinery/pkg/util/wait.poller.func1
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:691 +0x12c
goroutine 4701 [select]:
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext({0x2589408, 0xc0000420c8}, 0xc000323260, 0x18)
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:658 +0x189
k8s.io/apimachinery/pkg/util/wait.poll({0x2589408, 0xc0000420c8}, 0x1, 0x198f601, 0x4a1045)
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594 +0xe6
k8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext({0x2589408, 0xc0000420c8}, 0xc0003f7540, 0x0, 0x0)
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:526 +0x66
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0xc0002edab0, 0x22a41e0, 0x4)
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:512 +0x71
agones.dev/agones/test/e2e/framework.(*Framework).WaitForGameServerState(0xc000491f00, 0xc00017ed00, 0xc0000d5400, {0x22a968d, 0x9}, 0x0)
/go/src/agones.dev/agones/test/e2e/framework/framework.go:257 +0x2dc
agones.dev/agones/test/e2e.TestGameServerRestartBeforeReadyCrash(0xc00017ed00)
/go/src/agones.dev/agones/test/e2e/gameserver_test.go:278 +0x705
testing.tRunner(0xc00017ed00, 0x2366730)
/usr/local/go/src/testing/testing.go:1259 +0x230
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1306 +0x727
exit status 2
FAIL agones.dev/agones/test/e2e 600.514s
|
I saw this fail recently as well. Grabbing as a good first issue. |
Re: #2445 (comment), the problem here (and I ran into it as well), is that
I'm trying this locally with |
I feel like this one is a nightmare to replicate. |
Breadcrumbs: In the case of #2445 (comment) and #2782 (comment) both, we see this kind of odd pattern:
note that we're saying agones/test/e2e/gameserver_test.go Line 363 in dc3eb37
but the error trace is line 354 above it: agones/test/e2e/gameserver_test.go Line 354 in dc3eb37
ETA: This is presumably because |
I think in both cases I can see this pattern (** for emphasis):
In particular, it looks like during the flakes, the |
I have not been able to repro locally yet (hundreds of runs over the afternoon) - this may take some logs analysis. I scoured the build logs using the hyper-advanced technique of:
Unfortunately, build retention is short enough that I only see the failure in #2782 (comment). Going to follow up on the CI clusters and see if I can maybe poke at logs there. |
Another dupe: #2790 (comment) |
So, I've been doing some logs analysis on #2782 (comment). What I see is that we crash the container twice, and due to backoff, there's a pretty big gap in health checks: It's not clear this gap matters, but it's certainly curious. But then it does get to .. and fails just as quickly? FWIW, the pod update there seems to be .. possibly irrelevant, not sure: Anyways, still looking, and now I have another dupe to look at tomorrow. |
I have a theory as to what's happening here. I'll distill it down to a time-sequence first, then show the logs. I think what's happening is:
We see echoes of this in the logs for #2790 (comment) (I am eliding the first container,
|
This is some super interesting analysis! The thing I'm not 100% sure on with the analysis, is that this is the order of updates on the Pod first, then the GameServer for the agones/pkg/gameservers/controller.go Lines 854 to 864 in 3c38876
So if the Pod is currently out of sync, the Also, in the health controller, if the annotations are out of sync with each other (which I don't think should ever happen? Is that what you are definitely seeing here?) the health controller will return an error, to kick it into a retry, rather than move it to agones/pkg/gameservers/health.go Lines 255 to 257 in f9c333d
|
Narrow the race in #2445 by running GameServerRestartBeforeReadyCrash serially. See #2445 (comment) for a detailed analysis. Does not fix the issue - this is stopgap until we understand how to fix it.
| So if the Pod is currently out of sync, the Update call should fail, since there is a newer generation in K8s. Hmm. I think that as long as step 10 and 11 occur in this order (per earlier comment):
then | Also, in the health controller, if the annotations are out of sync with each other (which I don't think should ever happen? Is that what you are definitely seeing here?) the health controller will return an error, to kick it into a retry I haven't seen the error from line 256, so I assume not. |
Wait! Yes, the second screenshot in #2445 (comment) had exactly that error message: |
|
The health.go:256 error message is present, but I think that's expected in between the update to the This going to be a bit spammy with pictures, I hope they help. In order, I see:
I realize I am inferring a lot from the agones/pkg/gameservers/health.go Line 279 in d5cf2b0
So I'm pretty confident in this analysis. I think if I question the value of this feature and would like to return to an earlier question I asked on chat: Is this solving a real problem? Do customers often have workloads that flap before If it's a really necessary feature for some usecase, I think there's a better way to handle this, but it's invasive: The game server container can generate a random cookie/nonce and communicate that with ETA: Another option to keep the feature might be to implement some delay to agones/pkg/sdkserver/sdkserver.go Line 443 in d5cf2b0
|
Also, this race only really exists because the game server in this case was able to call If we fix it this way, we are effectively agreeing that in a narrow set of circumstances, the |
nodepools and regional clusters Updates to release checklist. (googleforgames#2772) * Updates to release checklist. Adding items that showed up in the recent release that were not written down or required better clarification. * Review updates, and some other small tweaks. Co-authored-by: Robert Bailey <robertbailey@google.com> Release 1.27.0 (googleforgames#2776) * Release 1.27.0 * Update FAQ on ExternalDNS (googleforgames#2773) The feature flag it points to have been moved to stable, so the link is not useful any more. Also removed notes on ipv6, since they aren't 100% accurate, as we were discussing in googleforgames#2767. * Updates to release checklist. (googleforgames#2772) * Updates to release checklist. Adding items that showed up in the recent release that were not written down or required better clarification. * Review updates, and some other small tweaks. Co-authored-by: Robert Bailey <robertbailey@google.com> * Release-changes * Review comment * Review changes Co-authored-by: Mark Mandel <markmandel@google.com> Co-authored-by: Robert Bailey <robertbailey@google.com> Version updates (googleforgames#2778) Players in-game metric for when PlayerTracking is enabled (googleforgames#2765) * Check for DeletionTimestamp of fleet and gameserverset before scaling * Add metric to track player count in gameservers * check PlayerStatus is not nil * Update metrics available in docs * Wrong relref path * typo * Change name for players in game metric to player connected. Add player capacity metric. Hide docs until next agones release. * Duplicate metrics table * add gameserver player tracking metrics to fleetViews Co-authored-by: Mark Mandel <markmandel@google.com> Remove generation for swagger Go code and Add static swagger codes for test (googleforgames#2757) Co-authored-by: Mark Mandel <markmandel@google.com> Updated allocation yaml files under examples/ to use selectors Show how to set graceful termination in a game server that is safe to (googleforgames#2780) evict. Avoid retry from allocateFromLocalCluster under context kill. (googleforgames#2783) * Version updates * issue-2736-changes Co-authored-by: Mark Mandel <markmandel@google.com> Bring SDK base image to debian:bullseye (googleforgames#2769) * Bring SDK base image to debian:bullseye The upgrade to gRPC solved one issue, and I also added a limit to number of processes that could run for `make -j` otherwise the whole thing would fall over (also would crash my dev machine!). Closes googleforgames#2224 * Force refresh of cpp cache on Cloud Build. * Fixes for CI: * Revert CI cache increment (don't think we need it) * Add shell to cpp image for debugging. * Fix formatting issue that is breaking CI. Co-authored-by: Robert Bailey <robertbailey@google.com> Update health-checking.md (googleforgames#2785) Fixed spell error: spec.health.failureTheshold to spec.health.failureThreshold Updated allocation yaml files under examples/ to use selectors (googleforgames#2787) Cleanup of load tests (googleforgames#2784) * issue-2744 updated changes with new description * 2744 review changes Sync Pod host ports back to GameServer in GCP (googleforgames#2782) This is the start of the implementation for googleforgames#2777: * Most of this is mechanical and implements a thin cloud product abstraction layer in pkg/cloud, instantiated with New(product). The product abstraction provides a single function so far: SyncPodPortsToGameServer. * SyncPodPortsToGameServer is inserted as a hook while syncing IP/ports, to let different cloud providers handle port allocation slightly differently (in this case, GKE Autopilot) * In GKE Autopilot, we look for a JSON string like `{"min":7000,"max":8000,"portsAssigned":{"7001":7737,"7002":7738}}` as an indication that the host ports were reassigned (per policy). As a side note to anyone watching, this is currently an unreleased feature. If we see this, we use the provided mapping to map the host ports in the GameServer.Spec. With this change, it's possible to launch a GameServer and get a healthy GameServer Pod by adding the following annotation: ``` annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" autopilot.gke.io/host-port-assignment: '{"min": 7000, "max": 8000}' ``` If this PR causes any issues, the cloud product auto detection can be disabled by setting `agones.cloudProduct=generic`, or forced to GKE Autopilot using `agones.cloudProduct=gke-autopilot`. In a future PR, I will add the host-port-assignment annotation automatically on Autopilot Co-authored-by: Mark Mandel <markmandel@google.com> Update gke terraform files to allow autoscaling Fix (not really) problems reported by VSCode (googleforgames#2790) VSCode reports `main redeclared` between allocationload.go and runscenario.go due to the fact that they both look like `package main` binaries in the same directory, similar e.g. [this poster on a different project](https://stackoverflow.com/questions/66970531/vs-code-go-main-redeclared-in-this-block) To fix it, it's easy enough to just give these binaries their own package path and fix up the calling scripts. Along the way, fix a lint complaint in runscenario.go Add location variable for cluster location argument Minor fix changed default of location var to empty string GameServerRestartBeforeReadyCrash: Run serially (googleforgames#2791) Narrow the race in googleforgames#2445 by running GameServerRestartBeforeReadyCrash serially. See googleforgames#2445 (comment) for a detailed analysis. Does not fix the issue - this is stopgap until we understand how to fix it. Enable fieldalignment linter, then mostly ignore it (googleforgames#2795) Enable the fieldalignment linter by enabling all `govet` checks except shadowing. Ignore large swaths of code (tests, cmd/, APIs), and nolint'd existing complaints that seemed irrelevant. Along the way: * removed existing nolint:maligned, as `maligned` is no more. * disabled `structcheck` and `deadcode` as they are deprecated (and I think have been subsumed by other linters?) * changed `gameServerCacheEntry` to `gameServerCache`. It is the cache, not just an entry. * fixed alignment of `gameServerSetCacheEntry`. Add fswatch library to watch and batch filesystem events, use in allocator (googleforgames#2792) This pull refactors the fsnotify code in allocator/main out to a shared library, and in that shared library implements a batched notification processor. Closes googleforgames#1816: This takes a slightly different approach than specified in the issue, instead choosing to just delay processing until after a batch processing period. I chose 1s - it's far longer than necessary, but still much shorter than it takes for the secret changes to propagate to the container anyways. I considered the approach in googleforgames#1816 of trying to parse the actual events, but it's too fiddly to get exactly right: e.g. maybe you only refresh on "write", but then "chmod" could make the file readable whereas it wasn't before, "rename" could expose a file that wasn't there before, etc. Cloud product: Split port allocators, implement Autopilot port allocation/policies (googleforgames#2789) In the Agones on GKE Autopilot implementation, we have no need for the port allocator - the informer/etc. is an unnecessary moving piece. This PR allows for cloud products to provide their own port allocation implementation, and implements the GKE Autopilot "allocator". We do this by: * Splitting portallocator off to its own package. It was basically self-sufficient anyways, except it was a little too friendly with controller_test.go. I solved that by introducing a TestInterface for controller_test.go to upcast to. * Allow cloud product implementations to define their own port allocator. * Defining a new port allocator for GKE that does a simple per-port HostPort allocation, and adds the host-port-assignment annotation to the pod template. * Extend cloudproduct again to add a GameServer validator * And in Autopilot, reject if the PortPolicy is not `Dynamic` Release: Note to switch away from `agones-images` (googleforgames#2809) Since we have few guardrails on accidentally touching `agones-images` project, adding a note in the release checklist to switch back to a local development project after running a release. Flake: TestControllerGameServerCount (googleforgames#2805) Made it deterministic in the test, and got rod of the potential race conditions. Also fix it such that the util function for generating GameServer names always produce a unique name. Closes googleforgames#2804 Co-authored-by: Robert Bailey <robertbailey@google.com> Remove Windows FAQ Entry (googleforgames#2811) The contents are no longer accurate, and are covered in the installation section now. Makefile changes for adding location variable added autoscale parameters to Makefile and README Markdown fix in readme Changed LOCATION to always be set with ZONE as default use only if the variable has a value fixed extraneous characters update gke terraform exmaple module Update Node.js dependencies and package (googleforgames#2815) * Update all dependencies and Node,js to LTS version * Update other docker images that use Node.js Added autoscale to example cluster and added to website docs Added defaults and feature expiry Remove zone from gke/variable.tf file.
This closes the race in #2445 by introducing a larger delay before we mark the game server pod as Ready(). This change admits the possibility that in some circumstances, if the game server initializes too quickly and kubelet loses a race to update the pod, we may perceive the game server as having crashed when it did not. Co-authored-by: Robert Bailey <robertbailey@google.com>
If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well
If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well
* Rework game server health initial delay handling This is a redrive of #3046, which was reverted in #3068 Rework health check handling of InitialDelaySeconds. See #2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in #3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix #2445 as well
* Rework game server health initial delay handling This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068 Rework health check handling of InitialDelaySeconds. See googleforgames#2966 (comment): * We remove any knowledge in the SDK of InitialDelaySeconds * We remove the runHealth goroutine from main and shift this responsibility to the /gshealthz handler Along the way: * I noted that the FailureThreshold doesn't need to be enforced on both the kubelet and SDK side, so in the injected liveness probe, I dropped that to 1. Previously we were waiting more probes than we needed to. In practice this is not terribly relevant since the SDK pushes it into Unhealthy. * Close race if enqueueState is called rapidly before update can succeed * Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059) * Close consistency race in syncGameServerRequestReadyState: If the SDK and controller win the race to update the Pod with the GameServerReadyContainerIDAnnotation before kubelet even gets a chance to add the running containers to the Pod, the controller may update the pod with an empty annotation, which then confuses further runs. * Fixes TestPlayerConnectWithCapacityZero flakes May fully fix googleforgames#2445 as well
What happened:
The test TestGameServerRestartBeforeReadyCrash is quite flaky!
Short version:
What you expected to happen:
The test to pass consistently.
How to reproduce it (as minimally and precisely as possible):
Run builds.
Anything else we need to know?:
Long log:
Environment: N/A
kubectl version
): 1.21.0The text was updated successfully, but these errors were encountered: