Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Reduced Reboots enhancement #643

Closed
wants to merge 3 commits into from

Conversation

sdodson
Copy link
Member

@sdodson sdodson commented Feb 11, 2021

This enhancement builds upon the MCO node constraints(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies) to enable host component version skew of N-2 which allows reboots across multiple minor version upgrades to be skipped.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign sdodson after the PR has been reviewed.
You can assign the PR to them by writing /assign @sdodson in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson
Copy link
Member Author

sdodson commented Feb 11, 2021

CC @darkmuggle @rphillips @crawford @dcbw @miabbott @mrunalp @zvonkok @wking @vrutkovs
PTAL, I know we still need to reach consensus on the MCO node constraints stuff but hopefully we can find a path forward on that while we discuss how we'd move on to the next objective of reducing reboots

@dhellmann
Copy link
Contributor

/label priority/important-soon

@openshift-ci-robot
Copy link

@dhellmann: The label(s) /label priority/important-soon cannot be applied. These labels are supported: platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed

In response to this:

/label priority/important-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor

/priority important-soon

@openshift-ci-robot openshift-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 11, 2021
We will achieve this by delivering upgrade jobs which pause the Worker MachineConfigPool
then upgrade from 4.x to 4.x+1 to 4.x+2. We will run conformance tests from 4.x
after upgrading to 4.x+1 in order to ensure that we continue to provide a baseline
feature set, then again after upgrading to 4.x+2, finally after unpausing the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why we couldn't run the 4.x+1 suite on the cluster once the control-plane reached 4.x+1. It seems unlikely to me that the test suite is tightly tied to kubelet-side features, and that when it is, it requires 4.x+1 compute nodes. The distinction isn't critical, because leaving a pool paused long enough to port workloads to new APIs is unwise, but running the tests that match the current control plane seems convenient if it works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not a straight forward question to answer. We know that under our current upgrade process most operators have to tolerate some incomplete upgrade state without exploding because MCO updates nodes (kubelets) toward the very end of the upgrade, but dodging problems there isn't quite the same as running a full test suite. At the same time it's not truly a 4.x+1 cluster so if we do choose to run the tests we should not be surprised if there's some portion which leverage new features that fail.

I have a feeling that where we'll run into problems is when we upgrade the control plane to 4.x+2 but we still have kubelets at 4.x. This is where we'll likely run into operators which have always assumed a baseline feature set of 4.x+1.

Perhaps node team have an opinion here? @rphillips @ehashman @harche

Building upon the EUS-to-EUS upgrade MVP work(https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#mco---enforce-openshifts-defined-host-component-version-skew-policies)
to allow MCO to enforce host constraints we will broaden those constraints to
enable host component version skew.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCO, will set Upgradeable=False whenever any MachineConfigPool has one more more nodes present which fall outside of a defined list of constraints.

Where can I see the complete list of constraints?

The MCO is not responsible for defining these constraint

Reading both enhancements I see now the following constraints:

  1. MachineConfigPool is paused, MCO sets Upgradeable=False?
  2. OpenShift defines the Kubelet version skew to (N-1 or N-2)
  3. OLM inclusive range
  4. OLM maxKube or maxOCP -> OLM sets Upgradeable=False

Are other entities/operators also allowed to set Upgradeable=False on self-defined constraints like, my kernel module will not build on kernel version x.y.z that is coming with update 4.y.z? For some drivers I may know it for some I cannot know it.

Copy link

@zvonkok zvonkok Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another constraint to consider would be maxUnavailable: 0 in an MCP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're iterating on design with MCO, the original proposal to examine node specific details like kubelet version or kernel version may not be accepted in favor of something more abstract like the version of MachineConfig templates currently applied to nodes being greater than or equal to some version. At very high level we're just seeking to ensure that we enforce whatever host component version skew policy we come up with by setting Upgradeable=False to prevent upgrades that would violate those policies.

In 1) above MCO would only set Upgradeable=False when those constraints would be violated through a minor version upgrade, within this context we don't actually care if the pools are paused just that they're of some minimum version.

All operators are enabled to set Upgradeable=False, it's the primary mechanism used by operators to assert that the cluster is not in a state which would allow for minor version upgrades.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson That is fine, for such information we have an RFE for the machine-os-content to provide such information via annotations as part of the extension system. Just trying to understand the high-level constraints and the best way to "guard" special resources in a cluster from unintended upgrades.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotations are needed to support #357 https://issues.redhat.com/browse/GRPA-2713

Copy link

@zvonkok zvonkok Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgradeable=False Where is this exactly set if one of the constraints are not met, in the conditions of the ClusterOperator object for MCO?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson By all operators you mean operators that are managed via CVO and OLM? I had a question on my enhancement if OLM managed operators are allowed to create a ClusterOperator object to "control" upgradability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a question on my enhancement if OLM managed operators are allowed to create a ClusterOperator object to "control" upgradability.

I didn't think this was possible before, but turns out I was wrong, and the CVO uses an unfiltered list of all ClusterOperator objects, regardless of whether they are from CVO manifests or added by third parties, when it calculates whether minor bumps are blocked.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking Thanks for pointing this out, started to read the code as well and doing some tests with custom MCPs and custom ClusterOperator objects.

1. Starting with a 4.8 cluster, pause Worker MachineConfigPool
1. Upgrade to 4.9
1. Upgrade to 4.10
1. Unpause Worker MachineConfigPool
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about custom MachineConfigPools?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example would be e.g. real-time kernel workers that are in a separate MCP with additional MachineConfigs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All pools should inherit from worker, so if users have custom pools these would be updated when MCO generates a new rendered config for worker on every step. Custom pools needs to be paused as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do those inherit from Worker pool? This does make me wonder if we need to define our policies on a per pool basis, especially if we're considering that control plane pool must be updated.

@darkmuggle another point to consider here.

Copy link

@zvonkok zvonkok Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok just for clarification, if I pause a custom MCP this property will not be backpropagated to the parent worker MCP. MCO will rollout the upgrade on the worker MCP and since my custom MCP is paused it will wait until availability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom pools needs to be paused as well

I don't think any pools need to be paused. Folks interested in reducing reboots can just pause the pools in which they want to reduce reboots, for the short time it takes to make a few consecutive update hops. I don't understand how pool-pausing interacts between the worker pool and its custom decedents; maybe the MCO folks can speak to that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users could have self-defined constraints that are not yet applied or implemented in a piece of software. Either an operator is preventing the upgrade or an admin wishes to do some "day 2" customizations before upgrading, knowing that an upgrade could break something.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about workloads that are running longer (days, weeks) that cannot be checkpointed, wouldn't this also be a reason to pause an MCP? PodDisruptionBudgets with minAvailable=100% would prevent draining to finish but the Node would already be cordoned. The workload may need additional Pods to be scheduled on the node to finish the task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodDisruptionBudgets with minAvailable=100% would prevent draining to finish but the Node would already be cordoned.

There is a balance between wasting resources by not scheduling work on the node while you wait for the slow pod to wrap up, and getting stuck in an endless drain because the new work you let onto the node ends up also being guarded by a PDB or other mechanism and taking longer than the initial slow pod. Currently we weight in favor of quickest-possible-drain at the expense of underutilizing the available capacity. Folks who want to minimize wasted capacity can set maxUpgradeable: 1 (the default) on their MachineConfigPools, although the net effect is just spreading the waste out over a longer wall-clock duration as the nodes roll one after the other. That spreading may still be useful if the overall pool doesn't have much extra capacity to spare. And if folks feel really strongly, they can arrange to fill each node with workloads that will all wrap up around the same point in time, or twiddle tolerations to allow faster work onto the cordoned node until the slow work gets closer to wrapping up.

The workload may need additional Pods to be scheduled on the node to finish the task.

It's up to the actor managing that workload to either say "you know what, we should clear this slow work off so that poor node can update" or "I am unwilling to abandon this slow work, so I'm going to set sufficient tolerations and PDB guards on these new Pods so they can go join the slow worker on the cordoned node and help it push through to completion". I don't think that's something generic OpenShift tooling that doesn't understand the details of your workload can help you with out of the box. In both of these cases, pausing the pool is one way to reduce the likelihood of cordons and node reboots. But MachineConfigs are not the only reason your nodes might get cordoned or rolled. I think it's better for folks who have really expensive, slow work to write their own really smart controllers who can shepherd the work through a generic Kube-node environment instead of using the coarse pool-pausing knob to try to orchestrate the dance they want to see.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking Right, we do not have to lengthen this. Either you want to update or you do not want to. If you want to update then make a plan on how to fit the upgrade procedure into your daily/weekly business. We can offer the tools and mechanics but "you" should do "your" homework.


1. Should we make these between specific named versions. ie: 4.6-4.8, and 4.8-4.10
or should this be a standard N-2 rule, ie: 4.6-4.8, 4.7-4.9, 4.8-4.10?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with an MCP which is "never" unpaused? Will MCO force an upgrade if the MCP violates the constraint N-1 or N-2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we'd just inhibit upgrades that fall outside of our defined policy. That can still be forced around but anytime an admin chooses to force an upgrade they become responsible for the disaster they create.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never unpausing a pool will sooner or later fail the nodes out as the kubelets stop getting Kube-API CA rotations, even if the admins don't force an update. We should be very clear that pool pauses are acceptable practice for a few hours, maybe into days, but that the whole time pools are paused, you are not getting CA rotations, bumped pull secrets, RHCOS bugfixes, and all the other good stuff that comes with having an unpaused pool. Folks who want to leave a pool paused for longer should be considering all of those consequences, and deciding for themselves if the reduced disruption is worth the risk.


## Design Details

### Open Questions [optional]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent an upgrade of specific nodes, the high-value assets of a cluster with special resources and constraints (kernel, os, ... ), one could create a custom MCP and set either pause: true or maxUnavailable: 0 this would prevent MCO from updating this very MCP for the next 1 or 2 release depending on the constraint.

An operator that does preflight checks on newer kernel, os could use the new information that is rolled out to the other nodes to check if the special resource would work on the update.

Copy link
Member Author

@sdodson sdodson Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if an operator has tighter constraints than can be expressed by either minimum version of MachineConfig, kubelet version, or kernel then that logic should live outside of the MCO. The MCO is caring about this only because it can affect the host component versions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdodson Understood, this is related to this: #357 my questions around this is, how SRO can prevent upgrades and guard the special resources from unwanted upgrades. Some kmods are sensitive to any kernel version change and some kmods (kABI whitelisted symbols) only care of about OS major changes (8.x -> 9.x) .

@zvonkok
Copy link

zvonkok commented Feb 22, 2021

@wking @sdodson I have the following use-case for upgrading. In my case, I can safely upgrade the operator but not the node. If I set Upgradeable=False the operator and the node will not be updated. I would like to have an updated operator but not the node. Should we keep here simple and just say the operator just sets Upgradeable=False and does not any special cases with pausing an MCP but letting the cluster upgrade all operators.

The benefit of pausing a special MCP that all operators and all other MCPs would be up to date. If the admin is "brave" he can still unpause if he is willing to take the risk of having an incompatible "driver".

authors:
- "@sdodson"
reviewers:
- @darkmuggle
Copy link
Contributor

@darkmuggle darkmuggle Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- @darkmuggle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex will be driving the MCO side of this discussion.

Comment on lines +73 to +79
Kubernetes defines a version skew policy(https://kubernetes.io/docs/setup/release/version-skew-policy/#kubelet)
which allows for kubelet N-2 to be compatible with kube-apiserver version N. At
this point in time OpenShift is not comfortable with the level of testing upstream
and the intersection of the specific features of OpenShift. We should work to
define and implement upstream testing changes which give us an appropriate level
of confidence that N-2 version skew issues would be identified in the community
whenever possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is not currently verified in upstream CI -- I raised this for the SIG Architecture agenda tomorrow to ensure that this has test coverage or to clarify the support policy.

@brenton
Copy link
Contributor

brenton commented Apr 9, 2021

@crawford, @derekwaynecarr, Anything blocking this from merge from your end? On the OTA side we're ready for this to merge.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2021
@sdodson
Copy link
Member Author

sdodson commented Jul 20, 2021

This was meant to be follow-on work from the MVP work that was outlined in #762 and #585 however, actual implementation has moved forward so i'm going ahead and closing this. I will follow up with changes to existing content for anything that needs to be pulled over.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 20, 2021

@sdodson: Closed this PR.

In response to this:

This was meant to be follow-on work from the MVP work that was outlined in #762 and #585 however, actual implementation has moved forward so i'm going ahead and closing this. I will follow up with changes to existing content for anything that needs to be pulled over.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants