Mark webhook and controller as safe-to-evict #4124

imjasonh · 2021-07-28T16:24:29Z

The safe-to-evict annotation tells the cluster autoscaler whether the
pod can be evicted to allow the node it's on to scale down.

This was set to false (by me!) 2 years ago in fc6ef39
to prevent service unreliability during scale-down events. If the
no webhook replicas are available, users can't create/update/delete
Tekton objects; if no controller replicas are available, status updates
from Pod events, etc., won't be processed.

Unfortunately, blocking node eviction means the node that the pod(s) get
scheduled to can't be scaled down. Furthermore, the nodes can't be fully
drained when updating the cluster. This can leave a cluster in a
mid-upgrade state that can make issues difficult to diagnose and reason
about.

With this change, a cluster scale-down event might cause temporary
service unreliability with the default single-replica configuration. As
with #3787 if a user/operator wants to prevent this, they should
configure more replicas for HA.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Docs included if any changes are user facing
[n/a] Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

By default, controller components are now marked as safe-to-evict by the cluster autoscaler. See docs/enabling-ha.md for more details.

@vdemeester

The safe-to-evict annotation tells the cluster autoscaler whether the pod can be evicted to allow the node it's on to scale down. This was set to false (by me!) 2 years ago in tektoncd@fc6ef39 to prevent service unreliability during scale-down events. If the no webhook replicas are available, users can't create/update/delete Tekton objects; if no controller replicas are available, status updates from Pod events, etc., won't be processed. Unfortunately, blocking node eviction means the node that the pod(s) get scheduled to can't be scaled down. Furthermore, the nodes can't be fully drained when updating the cluster. This can leave a cluster in a mid-upgrade state that can make issues difficult to diagnose and reason about. With this change, a cluster scale-down event might cause temporary service unreliability with the default single-replica configuration. As with tektoncd#3787 if a user/operator wants to prevent this, they should configure more replicas for HA.

imjasonh · 2021-07-28T16:31:04Z

/test tekton-pipeline-unit-tests

imjasonh · 2021-07-28T16:38:08Z

/test tekton-pipeline-unit-tests

imjasonh · 2021-07-28T16:55:25Z

/test pull-tekton-pipeline-alpha-integration-tests

tekton-robot · 2021-07-29T04:35:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pierretasci · 2021-07-29T22:51:48Z

/assign
/lgtm

vdemeester · 2021-08-05T05:50:49Z

/cc @dibyom as I think triggers does the same.

dibyom · 2021-08-05T18:44:39Z

Yeah we should port this to triggers as well

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jul 28, 2021

tekton-robot requested review from dibyom and dlorenc July 28, 2021 16:25

tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 28, 2021

vdemeester approved these changes Jul 29, 2021

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2021

tekton-robot assigned pierretasci Jul 29, 2021

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 29, 2021

tekton-robot merged commit 5350069 into tektoncd:main Jul 29, 2021

bobcatfish mentioned this pull request Aug 10, 2021

Mark webhook and controller as safe-to-evict tektoncd/triggers#1179

Merged

4 tasks

dprotaso mentioned this pull request May 23, 2023

cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation prevents GKE nodepool to scale down knative/serving#13984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark webhook and controller as safe-to-evict #4124

Mark webhook and controller as safe-to-evict #4124

imjasonh commented Jul 28, 2021

imjasonh commented Jul 28, 2021

imjasonh commented Jul 28, 2021

imjasonh commented Jul 28, 2021

tekton-robot commented Jul 29, 2021

pierretasci commented Jul 29, 2021

vdemeester commented Aug 5, 2021

dibyom commented Aug 5, 2021

Mark webhook and controller as safe-to-evict #4124

Mark webhook and controller as safe-to-evict #4124

Conversation

imjasonh commented Jul 28, 2021

Submitter Checklist

Release Notes

imjasonh commented Jul 28, 2021

imjasonh commented Jul 28, 2021

imjasonh commented Jul 28, 2021

tekton-robot commented Jul 29, 2021

pierretasci commented Jul 29, 2021

vdemeester commented Aug 5, 2021

dibyom commented Aug 5, 2021