-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conventions: Clarify when workload disruption is allowed #554
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
During normal operation, workload disruption is not allowed (such as for CA rotation). Describe the boundaries of disruption, and provide guidelines about what availability components must maintain during normal operation.
9629321
to
bbe260c
Compare
/assign @derekwaynecarr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems sane overall, just two comments.
@@ -129,6 +129,17 @@ it is used to manage all elements of the cluster. | |||
* Components that support workloads directly must not disrupt end-user workloads during upgrade or reconfiguration | |||
* E.g. the upgrade of a network plugin must serve pod traffic without disruption (although tiny increases in latency are allowed) | |||
* All components that currently disrupt end-user workloads must prioritize addressing those issues, and new components may not be introduced that add disruption | |||
* The platform should not disrupt workloads (reboot nodes) during normal operation | |||
* If an admin requests a change to the system that has the clear expectation of disruption, the system may cause workload disruption (for example, an upgrade or machine configuration change should trigger a rolling reboot because that is expected) | |||
* If an admin configures an optional mechanism like machine health checks or automatic upgrades, they are explicitly opting in to workload disruption and this constraint does not apply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'd like to see us move towards encouraging automatic updates by default personally)
* During normal operation all APIs and behaviors should remain available and responsive for all single machine failures | ||
* Components that are leader-elected must transfer leadership within a reasonable interval after a single machine failure: | ||
* Critical components such as scheduler and core controllers for all workload APIs - 15s | ||
* Important components for machine and system management that are responsible for recovering from failures - 30-60s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm the shorter that window though the more risk of split brain/byzantine failure problems right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All components in kube have to have etcd consistency (split brain isn't the problem) and are required to handle stale caches (any cache can be stale even without any failure). All core components that have short intervals have to tolerate racing controllers (etcd is the only safe coherence spot), and the more important the loop the more important it is for operations to be predictable. In general yes, none of our leader election actually provides strong isolation since we don't use lease keys the way you would need to. Election is an optimization, not a protection.
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
During normal operation, workload disruption is not allowed (such
as for CA rotation). Describe the boundaries of disruption, and
provide guidelines about what availability components must maintain
during normal operation.