Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Rework game server health initial delay handling (#3046)" #3068

Merged
merged 1 commit into from
Apr 4, 2023

Conversation

zmerlynn
Copy link
Collaborator

@zmerlynn zmerlynn commented Apr 4, 2023

Seeing a high incidence of SDK sidecar failures on Autopilot e2es, looking at the logs. Reverting for now as it seems related to this PR.

This reverts commit 60ed8cd / #3046.

@gongmax
Copy link
Collaborator

gongmax commented Apr 4, 2023

This revert also disable tests on AP 1.26. Is that intended?

@google-oss-prow google-oss-prow bot added the lgtm label Apr 4, 2023
@zmerlynn
Copy link
Collaborator Author

zmerlynn commented Apr 4, 2023

This revert also disable tests on AP 1.26. Is that intended?

Yup. It is definitely the case that without #3046, autopilot 1.26 is struggling. However, it seems like with #3046, we have some other problem.

…mes#3046)"

Seeing a high incidence of SDK sidecar failures on Autopilot e2es,
looking at the logs. Reverting for now as it seems related to this PR.

This reverts commit 60ed8cd.
@google-oss-prow google-oss-prow bot added the lgtm label Apr 4, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gongmax, zmerlynn
Once this PR has been reviewed and has the lgtm label, please ask for approval from markmandel. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 37c2896d-46ef-4d42-bcae-26ba4e3534eb

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3068/head:pr_3068 && git checkout pr_3068
  • helm install agones ./install/helm/agones --namespace agones-system --agones.image.release=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.31.0-9897003-amd64

@zmerlynn zmerlynn merged commit 7d20367 into googleforgames:main Apr 4, 2023
@zmerlynn zmerlynn deleted the revert-3046 branch April 4, 2023 22:32
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 5, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit to zmerlynn/agones that referenced this pull request Apr 6, 2023
This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)
zmerlynn added a commit that referenced this pull request Apr 6, 2023
* Rework game server health initial delay handling

This is a redrive of #3046, which was reverted in #3068

Rework health check handling of InitialDelaySeconds. See
#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in #3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix #2445 as well
@Kalaiselvi84 Kalaiselvi84 added the kind/cleanup Refactoring code, fixing up documentation, etc label Apr 10, 2023
@Kalaiselvi84 Kalaiselvi84 added this to the 1.31.0 milestone Apr 10, 2023
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this pull request Apr 11, 2023
…mes#3046)" (googleforgames#3068)

Seeing a high incidence of SDK sidecar failures on Autopilot e2es,
looking at the logs. Reverting for now as it seems related to this PR.

This reverts commit 60ed8cd.
Kalaiselvi84 pushed a commit to Kalaiselvi84/agones that referenced this pull request Apr 11, 2023
* Rework game server health initial delay handling

This is a redrive of googleforgames#3046, which was reverted in googleforgames#3068

Rework health check handling of InitialDelaySeconds. See
googleforgames#2966 (comment):

* We remove any knowledge in the SDK of InitialDelaySeconds

* We remove the runHealth goroutine from main and shift this
responsibility to the /gshealthz handler

Along the way:

*  I noted that the FailureThreshold doesn't need to be enforced on
both the kubelet and SDK side, so in the injected liveness probe, I
dropped that to 1. Previously we were waiting more probes than we
needed to. In practice this is not terribly relevant since the SDK
pushes it into Unhealthy.

* Close race if enqueueState is called rapidly before update can succeed

* Re-add Autopilot 1.26 to test matrix (removed in googleforgames#3059)

* Close consistency race in syncGameServerRequestReadyState:
If the SDK and controller win the race to update the Pod with the
GameServerReadyContainerIDAnnotation before kubelet even gets a chance
to add the running containers to the Pod, the controller may update
the pod with an empty annotation, which then confuses further runs.

* Fixes TestPlayerConnectWithCapacityZero flakes

May fully fix googleforgames#2445 as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Refactoring code, fixing up documentation, etc lgtm size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants