Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conventions: Clarify when workload disruption is allowed #554

Closed

Conversation

smarterclayton
Copy link
Contributor

During normal operation, workload disruption is not allowed (such
as for CA rotation). Describe the boundaries of disruption, and
provide guidelines about what availability components must maintain
during normal operation.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2020
During normal operation, workload disruption is not allowed (such
as for CA rotation). Describe the boundaries of disruption, and
provide guidelines about what availability components must maintain
during normal operation.
@smarterclayton
Copy link
Contributor Author

/assign @derekwaynecarr

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sane overall, just two comments.

@@ -129,6 +129,17 @@ it is used to manage all elements of the cluster.
* Components that support workloads directly must not disrupt end-user workloads during upgrade or reconfiguration
* E.g. the upgrade of a network plugin must serve pod traffic without disruption (although tiny increases in latency are allowed)
* All components that currently disrupt end-user workloads must prioritize addressing those issues, and new components may not be introduced that add disruption
* The platform should not disrupt workloads (reboot nodes) during normal operation
* If an admin requests a change to the system that has the clear expectation of disruption, the system may cause workload disruption (for example, an upgrade or machine configuration change should trigger a rolling reboot because that is expected)
* If an admin configures an optional mechanism like machine health checks or automatic upgrades, they are explicitly opting in to workload disruption and this constraint does not apply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'd like to see us move towards encouraging automatic updates by default personally)

* During normal operation all APIs and behaviors should remain available and responsive for all single machine failures
* Components that are leader-elected must transfer leadership within a reasonable interval after a single machine failure:
* Critical components such as scheduler and core controllers for all workload APIs - 15s
* Important components for machine and system management that are responsible for recovering from failures - 30-60s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm the shorter that window though the more risk of split brain/byzantine failure problems right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All components in kube have to have etcd consistency (split brain isn't the problem) and are required to handle stale caches (any cache can be stale even without any failure). All core components that have short intervals have to tolerate racing controllers (etcd is the only safe coherence spot), and the more important the loop the more important it is for operations to be predictable. In general yes, none of our leader election actually provides strong isolation since we don't use lease keys the way you would need to. Election is an optimization, not a protection.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 15, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 15, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants