conventions: Clarify when workload disruption is allowed #554

smarterclayton · 2020-12-01T15:36:32Z

During normal operation, workload disruption is not allowed (such
as for CA rotation). Describe the boundaries of disruption, and
provide guidelines about what availability components must maintain
during normal operation.

openshift-ci-robot · 2020-12-01T15:36:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

During normal operation, workload disruption is not allowed (such as for CA rotation). Describe the boundaries of disruption, and provide guidelines about what availability components must maintain during normal operation.

smarterclayton · 2020-12-02T19:55:43Z

/assign @derekwaynecarr

cgwalters

Seems sane overall, just two comments.

cgwalters · 2020-12-11T01:51:15Z

CONVENTIONS.md

@@ -129,6 +129,17 @@ it is used to manage all elements of the cluster.
  * Components that support workloads directly must not disrupt end-user workloads during upgrade or reconfiguration
    * E.g. the upgrade of a network plugin must serve pod traffic without disruption (although tiny increases in latency are allowed)
    * All components that currently disrupt end-user workloads must prioritize addressing those issues, and new components may not be introduced that add disruption
+* The platform should not disrupt workloads (reboot nodes) during normal operation
+  * If an admin requests a change to the system that has the clear expectation of disruption, the system may cause workload disruption (for example, an upgrade or machine configuration change should trigger a rolling reboot because that is expected)
+  * If an admin configures an optional mechanism like machine health checks or automatic upgrades, they are explicitly opting in to workload disruption and this constraint does not apply


(I'd like to see us move towards encouraging automatic updates by default personally)

cgwalters · 2020-12-11T01:52:18Z

CONVENTIONS.md

+    * During normal operation all APIs and behaviors should remain available and responsive for all single machine failures
+    * Components that are leader-elected must transfer leadership within a reasonable interval after a single machine failure:
+      * Critical components such as scheduler and core controllers for all workload APIs - 15s
+      * Important components for machine and system management that are responsible for recovering from failures - 30-60s


Hm the shorter that window though the more risk of split brain/byzantine failure problems right?

All components in kube have to have etcd consistency (split brain isn't the problem) and are required to handle stale caches (any cache can be stale even without any failure). All core components that have short intervals have to tolerate racing controllers (etcd is the only safe coherence spot), and the more important the loop the more important it is for operations to be predictable. In general yes, none of our leader election actually provides strong isolation since we don't use lease keys the way you would need to. Election is an optimization, not a protection.

openshift-bot · 2021-03-15T23:10:32Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-04-15T04:18:43Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-05-15T07:07:47Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-05-15T07:07:59Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from juliakreger and sjenning December 1, 2020 15:36

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 1, 2020

conventions: Clarify when workload disruption is allowed

bbe260c

During normal operation, workload disruption is not allowed (such as for CA rotation). Describe the boundaries of disruption, and provide guidelines about what availability components must maintain during normal operation.

smarterclayton force-pushed the reboot_guidelines branch from 9629321 to bbe260c Compare December 1, 2020 15:51

openshift-ci-robot assigned derekwaynecarr Dec 2, 2020

cgwalters reviewed Dec 11, 2020

View reviewed changes

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 15, 2021

openshift-ci bot closed this May 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conventions: Clarify when workload disruption is allowed #554

conventions: Clarify when workload disruption is allowed #554

smarterclayton commented Dec 1, 2020

openshift-ci-robot commented Dec 1, 2020

smarterclayton commented Dec 2, 2020

cgwalters left a comment

cgwalters Dec 11, 2020

cgwalters Dec 11, 2020

smarterclayton Dec 15, 2020

openshift-bot commented Mar 15, 2021

openshift-bot commented Apr 15, 2021

openshift-bot commented May 15, 2021

openshift-ci bot commented May 15, 2021

conventions: Clarify when workload disruption is allowed #554

conventions: Clarify when workload disruption is allowed #554

Conversation

smarterclayton commented Dec 1, 2020

openshift-ci-robot commented Dec 1, 2020

smarterclayton commented Dec 2, 2020

cgwalters left a comment

Choose a reason for hiding this comment

cgwalters Dec 11, 2020

Choose a reason for hiding this comment

cgwalters Dec 11, 2020

Choose a reason for hiding this comment

smarterclayton Dec 15, 2020

Choose a reason for hiding this comment

openshift-bot commented Mar 15, 2021

openshift-bot commented Apr 15, 2021

openshift-bot commented May 15, 2021

openshift-ci bot commented May 15, 2021