Add Reduced Reboots enhancement #643

sdodson · 2021-02-11T15:41:31Z

This enhancement builds upon the MCO node constraints(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies) to enable host component version skew of N-2 which allows reboots across multiple minor version upgrades to be skipped.

openshift-ci-robot · 2021-02-11T15:42:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign sdodson after the PR has been reviewed.
You can assign the PR to them by writing /assign @sdodson in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sdodson · 2021-02-11T15:43:50Z

CC @darkmuggle @rphillips @crawford @dcbw @miabbott @mrunalp @zvonkok @wking @vrutkovs
PTAL, I know we still need to reach consensus on the MCO node constraints stuff but hopefully we can find a path forward on that while we discuss how we'd move on to the next objective of reducing reboots

dhellmann · 2021-02-11T16:14:12Z

/label priority/important-soon

openshift-ci-robot · 2021-02-11T16:14:13Z

@dhellmann: The label(s) /label priority/important-soon cannot be applied. These labels are supported: platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed

In response to this:

/label priority/important-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2021-02-11T16:15:33Z

/priority important-soon

wking · 2021-02-17T04:03:43Z

enhancements/update/reduced_reboots.md

+We will achieve this by delivering upgrade jobs which pause the Worker MachineConfigPool
+then upgrade from 4.x to 4.x+1 to 4.x+2. We will run conformance tests from 4.x
+after upgrading to 4.x+1 in order to ensure that we continue to provide a baseline
+feature set, then again after upgrading to 4.x+2, finally after unpausing the


I'm not clear on why we couldn't run the 4.x+1 suite on the cluster once the control-plane reached 4.x+1. It seems unlikely to me that the test suite is tightly tied to kubelet-side features, and that when it is, it requires 4.x+1 compute nodes. The distinction isn't critical, because leaving a pool paused long enough to port workloads to new APIs is unwise, but running the tests that match the current control plane seems convenient if it works.

Yeah, it's not a straight forward question to answer. We know that under our current upgrade process most operators have to tolerate some incomplete upgrade state without exploding because MCO updates nodes (kubelets) toward the very end of the upgrade, but dodging problems there isn't quite the same as running a full test suite. At the same time it's not truly a 4.x+1 cluster so if we do choose to run the tests we should not be surprised if there's some portion which leverage new features that fail.

I have a feeling that where we'll run into problems is when we upgrade the control plane to 4.x+2 but we still have kubelets at 4.x. This is where we'll likely run into operators which have always assumed a baseline feature set of 4.x+1.

Perhaps node team have an opinion here? @rphillips @ehashman @harche

zvonkok · 2021-02-17T12:32:02Z

enhancements/update/reduced_reboots.md

+Building upon the EUS-to-EUS upgrade MVP work(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies)
+to allow MCO to enforce host constraints we will broaden those constraints to
+enable host component version skew.
+


The MCO, will set Upgradeable=False whenever any MachineConfigPool has one more more nodes present which fall outside of a defined list of constraints.

Where can I see the complete list of constraints?

The MCO is not responsible for defining these constraint

Reading both enhancements I see now the following constraints:

MachineConfigPool is paused, MCO sets Upgradeable=False?

OpenShift defines the Kubelet version skew to (N-1 or N-2)

OLM inclusive range

OLM maxKube or maxOCP -> OLM sets Upgradeable=False

Are other entities/operators also allowed to set Upgradeable=False on self-defined constraints like, my kernel module will not build on kernel version x.y.z that is coming with update 4.y.z? For some drivers I may know it for some I cannot know it.

Another constraint to consider would be maxUnavailable: 0 in an MCP?

We're iterating on design with MCO, the original proposal to examine node specific details like kubelet version or kernel version may not be accepted in favor of something more abstract like the version of MachineConfig templates currently applied to nodes being greater than or equal to some version. At very high level we're just seeking to ensure that we enforce whatever host component version skew policy we come up with by setting Upgradeable=False to prevent upgrades that would violate those policies.

In 1) above MCO would only set Upgradeable=False when those constraints would be violated through a minor version upgrade, within this context we don't actually care if the pools are paused just that they're of some minimum version.

All operators are enabled to set Upgradeable=False, it's the primary mechanism used by operators to assert that the cluster is not in a state which would allow for minor version upgrades.

@sdodson That is fine, for such information we have an RFE for the machine-os-content to provide such information via annotations as part of the extension system. Just trying to understand the high-level constraints and the best way to "guard" special resources in a cluster from unintended upgrades.

The annotations are needed to support #357 https://issues.redhat.com/browse/GRPA-2713

Upgradeable=False Where is this exactly set if one of the constraints are not met, in the conditions of the ClusterOperator object for MCO?

@sdodson By all operators you mean operators that are managed via CVO and OLM? I had a question on my enhancement if OLM managed operators are allowed to create a ClusterOperator object to "control" upgradability.

I had a question on my enhancement if OLM managed operators are allowed to create a ClusterOperator object to "control" upgradability.

I didn't think this was possible before, but turns out I was wrong, and the CVO uses an unfiltered list of all ClusterOperator objects, regardless of whether they are from CVO manifests or added by third parties, when it calculates whether minor bumps are blocked.

@wking Thanks for pointing this out, started to read the code as well and doing some tests with custom MCPs and custom ClusterOperator objects.

zvonkok · 2021-02-17T14:24:14Z

enhancements/update/reduced_reboots.md

+1. Starting with a 4.8 cluster, pause Worker MachineConfigPool
+1. Upgrade to 4.9
+1. Upgrade to 4.10
+1. Unpause Worker MachineConfigPool


What about custom MachineConfigPools?

One example would be e.g. real-time kernel workers that are in a separate MCP with additional MachineConfigs

All pools should inherit from worker, so if users have custom pools these would be updated when MCO generates a new rendered config for worker on every step. Custom pools needs to be paused as well

Do those inherit from Worker pool? This does make me wonder if we need to define our policies on a per pool basis, especially if we're considering that control plane pool must be updated.

@darkmuggle another point to consider here.

Ok just for clarification, if I pause a custom MCP this property will not be backpropagated to the parent worker MCP. MCO will rollout the upgrade on the worker MCP and since my custom MCP is paused it will wait until availability?

Custom pools needs to be paused as well

I don't think any pools need to be paused. Folks interested in reducing reboots can just pause the pools in which they want to reduce reboots, for the short time it takes to make a few consecutive update hops. I don't understand how pool-pausing interacts between the worker pool and its custom decedents; maybe the MCO folks can speak to that.

Users could have self-defined constraints that are not yet applied or implemented in a piece of software. Either an operator is preventing the upgrade or an admin wishes to do some "day 2" customizations before upgrading, knowing that an upgrade could break something.

What about workloads that are running longer (days, weeks) that cannot be checkpointed, wouldn't this also be a reason to pause an MCP? PodDisruptionBudgets with minAvailable=100% would prevent draining to finish but the Node would already be cordoned. The workload may need additional Pods to be scheduled on the node to finish the task.

PodDisruptionBudgets with minAvailable=100% would prevent draining to finish but the Node would already be cordoned.

There is a balance between wasting resources by not scheduling work on the node while you wait for the slow pod to wrap up, and getting stuck in an endless drain because the new work you let onto the node ends up also being guarded by a PDB or other mechanism and taking longer than the initial slow pod. Currently we weight in favor of quickest-possible-drain at the expense of underutilizing the available capacity. Folks who want to minimize wasted capacity can set maxUpgradeable: 1 (the default) on their MachineConfigPools, although the net effect is just spreading the waste out over a longer wall-clock duration as the nodes roll one after the other. That spreading may still be useful if the overall pool doesn't have much extra capacity to spare. And if folks feel really strongly, they can arrange to fill each node with workloads that will all wrap up around the same point in time, or twiddle tolerations to allow faster work onto the cordoned node until the slow work gets closer to wrapping up.

The workload may need additional Pods to be scheduled on the node to finish the task.

It's up to the actor managing that workload to either say "you know what, we should clear this slow work off so that poor node can update" or "I am unwilling to abandon this slow work, so I'm going to set sufficient tolerations and PDB guards on these new Pods so they can go join the slow worker on the cordoned node and help it push through to completion". I don't think that's something generic OpenShift tooling that doesn't understand the details of your workload can help you with out of the box. In both of these cases, pausing the pool is one way to reduce the likelihood of cordons and node reboots. But MachineConfigs are not the only reason your nodes might get cordoned or rolled. I think it's better for folks who have really expensive, slow work to write their own really smart controllers who can shepherd the work through a generic Kube-node environment instead of using the coarse pool-pausing knob to try to orchestrate the dance they want to see.

@wking Right, we do not have to lengthen this. Either you want to update or you do not want to. If you want to update then make a plan on how to fit the upgrade procedure into your daily/weekly business. We can offer the tools and mechanics but "you" should do "your" homework.

zvonkok · 2021-02-17T14:27:24Z

enhancements/update/reduced_reboots.md

+
+1. Should we make these between specific named versions. ie: 4.6-4.8, and 4.8-4.10
+or should this be a standard N-2 rule, ie: 4.6-4.8, 4.7-4.9, 4.8-4.10?
+


What happens with an MCP which is "never" unpaused? Will MCO force an upgrade if the MCP violates the constraint N-1 or N-2

No, we'd just inhibit upgrades that fall outside of our defined policy. That can still be forced around but anytime an admin chooses to force an upgrade they become responsible for the disaster they create.

Never unpausing a pool will sooner or later fail the nodes out as the kubelets stop getting Kube-API CA rotations, even if the admins don't force an update. We should be very clear that pool pauses are acceptable practice for a few hours, maybe into days, but that the whole time pools are paused, you are not getting CA rotations, bumped pull secrets, RHCOS bugfixes, and all the other good stuff that comes with having an unpaused pool. Folks who want to leave a pool paused for longer should be considering all of those consequences, and deciding for themselves if the reduced disruption is worth the risk.

zvonkok · 2021-02-17T14:37:42Z

enhancements/update/reduced_reboots.md

+
+## Design Details
+
+### Open Questions [optional]


To prevent an upgrade of specific nodes, the high-value assets of a cluster with special resources and constraints (kernel, os, ... ), one could create a custom MCP and set either pause: true or maxUnavailable: 0 this would prevent MCO from updating this very MCP for the next 1 or 2 release depending on the constraint.

An operator that does preflight checks on newer kernel, os could use the new information that is rolled out to the other nodes to check if the special resource would work on the update.

I think if an operator has tighter constraints than can be expressed by either minimum version of MachineConfig, kubelet version, or kernel then that logic should live outside of the MCO. The MCO is caring about this only because it can affect the host component versions.

@sdodson Understood, this is related to this: #357 my questions around this is, how SRO can prevent upgrades and guard the special resources from unwanted upgrades. Some kmods are sensitive to any kernel version change and some kmods (kABI whitelisted symbols) only care of about OS major changes (8.x -> 9.x) .

zvonkok · 2021-02-22T20:17:40Z

@wking @sdodson I have the following use-case for upgrading. In my case, I can safely upgrade the operator but not the node. If I set Upgradeable=False the operator and the node will not be updated. I would like to have an updated operator but not the node. Should we keep here simple and just say the operator just sets Upgradeable=False and does not any special cases with pausing an MCP but letting the cluster upgrade all operators.

The benefit of pausing a special MCP that all operators and all other MCPs would be up to date. If the admin is "brave" he can still unpause if he is willing to take the risk of having an incompatible "driver".

darkmuggle · 2021-02-23T01:39:09Z

enhancements/update/reduced_reboots.md

+authors:
+  - "@sdodson"
+reviewers:
+  - @darkmuggle


Suggested change

- @darkmuggle

Alex will be driving the MCO side of this discussion.

ehashman · 2021-02-24T22:54:40Z

enhancements/update/reduced_reboots.md

+Kubernetes defines a version skew policy(https://kubernetes.io/docs/setup/release/version-skew-policy/#kubelet)
+which allows for kubelet N-2 to be compatible with kube-apiserver version N. At
+this point in time OpenShift is not comfortable with the level of testing upstream
+and the intersection of the specific features of OpenShift. We should work to
+define and implement upstream testing changes which give us an appropriate level
+of confidence that N-2 version skew issues would be identified in the community
+whenever possible.


Note that this is not currently verified in upstream CI -- I raised this for the SIG Architecture agenda tomorrow to ensure that this has test coverage or to clarify the support policy.

brenton · 2021-04-09T18:52:27Z

@crawford, @derekwaynecarr, Anything blocking this from merge from your end? On the OTA side we're ready for this to merge.

openshift-bot · 2021-07-08T20:12:43Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sdodson · 2021-07-20T19:40:58Z

This was meant to be follow-on work from the MVP work that was outlined in #762 and #585 however, actual implementation has moved forward so i'm going ahead and closing this. I will follow up with changes to existing content for anything that needs to be pulled over.

/close

openshift-ci · 2021-07-20T19:41:08Z

@sdodson: Closed this PR.

In response to this:

This was meant to be follow-on work from the MVP work that was outlined in #762 and #585 however, actual implementation has moved forward so i'm going ahead and closing this. I will follow up with changes to existing content for anything that needs to be pulled over.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdodson added 2 commits February 11, 2021 10:39

Add Reduced Reboots enhancement

55036ba

Add wking and vrutkovs as reviewers

daf31b3

openshift-ci-robot requested review from ecordell and jwmatthews February 11, 2021 15:42

openshift-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 11, 2021

OTA team will build the OCP tests jobs

bac1ac3

wking reviewed Feb 17, 2021

View reviewed changes

zvonkok reviewed Feb 17, 2021

View reviewed changes

darkmuggle reviewed Feb 23, 2021

View reviewed changes

ehashman reviewed Feb 24, 2021

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2021

openshift-ci bot closed this Jul 20, 2021


		1. Should we make these between specific named versions. ie: 4.6-4.8, and 4.8-4.10
		or should this be a standard N-2 rule, ie: 4.6-4.8, 4.7-4.9, 4.8-4.10?

Add Reduced Reboots enhancement #643

Add Reduced Reboots enhancement #643

Conversation

sdodson commented Feb 11, 2021

openshift-ci-robot commented Feb 11, 2021

sdodson commented Feb 11, 2021

dhellmann commented Feb 11, 2021

openshift-ci-robot commented Feb 11, 2021

dhellmann commented Feb 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvonkok Feb 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvonkok Feb 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvonkok Feb 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdodson Feb 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvonkok commented Feb 22, 2021

darkmuggle Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brenton commented Apr 9, 2021

openshift-bot commented Jul 8, 2021

sdodson commented Jul 20, 2021

openshift-ci bot commented Jul 20, 2021

zvonkok Feb 17, 2021 •

edited

Loading

zvonkok Feb 17, 2021 •

edited

Loading

zvonkok Feb 17, 2021 •

edited

Loading

sdodson Feb 17, 2021 •

edited

Loading

darkmuggle Feb 23, 2021 •

edited

Loading