-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark webhook and controller as safe-to-evict #4124
Conversation
The safe-to-evict annotation tells the cluster autoscaler whether the pod can be evicted to allow the node it's on to scale down. This was set to false (by me!) 2 years ago in tektoncd@fc6ef39 to prevent service unreliability during scale-down events. If the no webhook replicas are available, users can't create/update/delete Tekton objects; if no controller replicas are available, status updates from Pod events, etc., won't be processed. Unfortunately, blocking node eviction means the node that the pod(s) get scheduled to can't be scaled down. Furthermore, the nodes can't be fully drained when updating the cluster. This can leave a cluster in a mid-upgrade state that can make issues difficult to diagnose and reason about. With this change, a cluster scale-down event might cause temporary service unreliability with the default single-replica configuration. As with tektoncd#3787 if a user/operator wants to prevent this, they should configure more replicas for HA.
/test tekton-pipeline-unit-tests |
1 similar comment
/test tekton-pipeline-unit-tests |
/test pull-tekton-pipeline-alpha-integration-tests |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign |
/cc @dibyom as I think triggers does the same. |
Yeah we should port this to triggers as well |
The safe-to-evict annotation tells the cluster autoscaler whether the
pod can be evicted to allow the node it's on to scale down.
This was set to false (by me!) 2 years ago in fc6ef39
to prevent service unreliability during scale-down events. If the
no webhook replicas are available, users can't create/update/delete
Tekton objects; if no controller replicas are available, status updates
from Pod events, etc., won't be processed.
Unfortunately, blocking node eviction means the node that the pod(s) get
scheduled to can't be scaled down. Furthermore, the nodes can't be fully
drained when updating the cluster. This can leave a cluster in a
mid-upgrade state that can make issues difficult to diagnose and reason
about.
With this change, a cluster scale-down event might cause temporary
service unreliability with the default single-replica configuration. As
with #3787 if a user/operator wants to prevent this, they should
configure more replicas for HA.
/kind bug
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
Release Notes
@vdemeester