-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for health check race condition #2351
Conversation
Build Succeeded 👏 Build Id: d2702608-cdfe-4f78-9c15-9c202fa36835 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
14418e3
to
e1be871
Compare
Build Succeeded 👏 Build Id: 5ffee1c4-57fa-49e6-a510-8beef192b424 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
e1be871
to
4b7e12b
Compare
Build Failed 😱 Build Id: ea9a49ab-9959-4c01-8871-505619ddad0d To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
4b7e12b
to
93e5a8f
Compare
Build Failed 😱 Build Id: 456cadbe-706b-4aa5-962a-8e164072639d To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
|
Build Failed 😱 Build Id: 887524df-56a0-401c-8e73-9296be1d1626 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
So now one can grep the logs to get all the logs for a specific e2e test, and now we start to see something strange:
So time to go go hunting through the debug logs we added to the controller I think, and find out why these gameservers are randomly moving to Unhealthy. I'm wondering if something has changed in one of our hacks somewhere (i.e. a string match), and it's being construed as a crash that isn't what we think it is. The GameServer from agones/test/e2e/gameserverallocation_test.go Lines 375 to 415 in 296fbb7
|
Build Failed 😱 Build Id: d93d0720-070a-44db-8edf-0f03f917f3e9 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Digging into the controller logs for Just looking at Events, I can see it going from
Looking just before the move to
Looking before that, I can see the message {
"insertId": "9l7sghonrqe6wf83",
"jsonPayload": {
"containerStatuses": [
{
"ready": true,
"containerID": "containerd://4d1c06167addfb1f287f7110a0b2d6594aebd4f3fbb49e99e17702031190980e",
"restartCount": 0,
"state": {
"running": {
"startedAt": "2021-10-30T04:24:59Z"
}
},
"imageID": "gcr.io/agones-images/agones-sdk@sha256:949b060db929eb5ffcf66c961ce259073c00fe38dcfe04c8efe2e96f5ef662fd",
"name": "agones-gameserver-sidecar",
"image": "gcr.io/agones-images/agones-sdk:1.19.0-93e5a8f",
"started": true,
"lastState": {}
},
{
"started": false,
"imageID": "gcr.io/agones-images/simple-game-server@sha256:ac3e5a8388677453834f567fe1561eaec0bb3dab4df6243933064823f5c144b6",
"state": {
"terminated": {
"reason": "Error",
"containerID": "containerd://317fbe44e5a7446bb1b1e4417969ebf562ee9e8236375d0253927841ea7e5b4a",
"exitCode": 1,
"startedAt": "2021-10-30T04:24:32Z",
"finishedAt": "2021-10-30T04:25:02Z"
}
},
"name": "game-server",
"lastState": {},
"ready": false,
"containerID": "containerd://317fbe44e5a7446bb1b1e4417969ebf562ee9e8236375d0253927841ea7e5b4a",
"image": "gcr.io/agones-images/simple-game-server:0.4",
"restartCount": 0
}
],
"message": "Container Failed",
"source": "*gameservers.HealthController",
"container": "game-server",
"gs": "game-server26bwl"
},
"resource": {
"type": "k8s_container",
"labels": {
"location": "us-west1-c",
"namespace_name": "agones-system",
"cluster_name": "e2e-test-cluster",
"project_id": "agones-images",
"pod_name": "agones-controller-7cbc78d57b-7z48d",
"container_name": "agones-controller"
}
},
"timestamp": "2021-10-30T04:25:04.779309764Z",
"severity": "DEBUG",
"labels": {
"k8s-pod/heritage": "Helm",
"compute.googleapis.com/resource_name": "gke-e2e-test-cluster-agones-system-578206ef-4cpx",
"k8s-pod/release": "agones",
"k8s-pod/agones_dev/role": "controller",
"k8s-pod/pod-template-hash": "7cbc78d57b",
"k8s-pod/app": "agones"
},
"logName": "projects/agones-images/logs/stderr",
"receiveTimestamp": "2021-10-30T04:25:06.624093989Z"
} So the next area of investigation: Why did it crash? (or think it crashed?) Going to need to pull up it's logs. |
Pulling up the logs for the simple-gameserver and I can see it crashes because it initially fails to connect to the SDK. https://cloudlogging.app.goo.gl/Rcs7V3uHnq99DpMLA Crashing here, since it can't connect to the SDK server: There are two questions I have at this point:
Okay, that at least gives us something to go on. |
8f8a3bd
to
bec9cbf
Compare
Build Failed 😱 Build Id: 5f64ce57-4526-41df-a508-f4a829c97c61 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
bec9cbf
to
ea95f09
Compare
Build Failed 😱 Build Id: 453f9900-20f6-4a54-bbeb-7b86911d9271 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
This article is interesting: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74 I just checked and it appears that we add the sdk sidecar after the game container in the pods we create for a game server. I wonder if switching the order would help mitigate the issue with the sidecar not starting for such a long time. |
The other thing I'm looking into is whether there are any differences in GKE's default cluster settings that have changed between 1.20 and 1.21 that might affect us. So far it looks like the only thing I've seen is the default network settings are different:
But nothing about container runtimes or node images. |
Sounds like a good idea in general. I figure the delay is either (1) first image download or (2) overloaded system (3) combination therein Agones should handle this anyway, without moving the GameServer to Unhealthy, so it seems like a legitimate bug. |
Is back again. I think this might actually be the best test to dig into the flakiness. Dumping logs from the test:
This looks like a good example of the test issue, because the GameServer should be able to go to Ready even if it has restarted beforehand. Dumping logs for the GameServer from the controller: And yep, can see that just before the GameServer gets marked as Unhealthy, the system sees the game server container to be terminated (even though it must be running, since it marked the GameServer as Ready. (stripped some non-relevant fields)
The GameServer also thinks that what it sees as the currently terminated container is the one that has marked it as Ready
This is a little odd, as we do have a check in the code to check that the state is Running. But race conditions do happen. Let's see what the logs for the Pod look like (stripping Health Pings): (READY repeats until shutdown, since the GameServer gets stuck in Unhealthy). An interesting thing, since we retry CRASHing, we actually crash the container twice (which we see in the container status values) -- this probably reads much like the SDK Server starting up slow and forcing a crash. So feeling pretty confident that I've got a handle on the issue. 🤞🏻 |
Reminder: Running tests lots: (I always forget the timeout) |
Build Succeeded 👏 Build Id: 72c24792-c67f-4bee-83a8-fe9623a97f74 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
178c0ce
to
d20ed4d
Compare
Build Failed 😱 Build Id: fc0977b6-1b5a-48ea-b268-50f9306677ec To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
So this failure seems to be something different?
Looks like it's Ready, so I'm not sure why the connection is being refused. Need to have a look into the logs for the game server, see if anything weird is going on there. |
d20ed4d
to
d61df70
Compare
Build Failed 😱 Build Id: 64d0ec03-7407-4544-97fb-f5a9bea15dd4 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Succeeded 👏 Build Id: 0a91a8e1-c582-4980-a25f-078827ee5f49 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
b787e6f
to
6eddb2e
Compare
Build Succeeded 👏 Build Id: dbafa2b0-c787-4b4c-b3bc-d267cfbbce3d The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Added some more unit tests, and made some minor tweaks to a few things. I'm punting the connection issue to another PR. Assuming this otherwise passes on this run, I'll cleanup the PR description and the git history -- but it should be good to review now. |
Build Succeeded 👏 Build Id: 5e7470a7-bf27-4b24-882e-87b09e9a7238 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
The Pod and GameServer data in the Informer cache are eventually consistent and can result in data races. This fix ensures that they are in sync at the time that the GameServer is moved to Ready, so that Health checking can behave correctly if a crash occurs after the fact. This also includes a combination of smaller fixes/debug for e2e flakiness, as we narrowed down the problem. * Retries on UDP packet sending * Failing on UDP packet sending dumps gameserver details * Drop parallelism on e2e tests from 32 to 16.
daffc50
to
1f532c1
Compare
Build Succeeded 👏 Build Id: 706d3421-b8e6-493f-9e9c-4ec9ad15eaa9 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markmandel, roberthbailey The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / Why we need it:
The Pod and GameServer data in the Informer cache are eventually consistent and can result in data races.
This fix ensures that they are in sync at the time that the GameServer is moved to Ready, so that Health checking can behave correctly if a crash occurs after the fact.
Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
This also includes a combination of smaller fixes/debug for e2e flakiness,
as we narrowed down the problem.