Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-controller: Support an annotation to hold updates #2163

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

Today the MCO arbitrarily chooses a node to update from the candidates.
We want to allow admins to avoid specific nodes entirely.

(Aside: This replaces the defunct etcd-specific ordering code that
we didn't end up using)

Add an annotation machineconfiguration.openshift.io/hold that allows
an external controller (and/or human) to avoid specific nodes.

Related: #2059

Today the MCO arbitrarily chooses a node to update from the candidates.
We want to allow admins to avoid specific nodes entirely.

(Aside: This replaces the defunct etcd-specific ordering code that
 we didn't end up using)

Add an annotation `machineconfiguration.openshift.io/hold` that allows
an external controller (and/or human) to avoid specific nodes.

Related: openshift#2059
@cgwalters
Copy link
Member Author

Only compile tested locally.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2020
@crawford
Copy link
Contributor

High-level direction looks good to me. Can you split this into multiple commits since this is also pulling out unrelated functionality? This will need an update to the docs.

How does an admin know that machines in the cluster are held? What does the MCO's ClusterOperator object look like when a machine is in the held state? We'll need to make it easy for an admin to know that machines have been held, otherwise people are going to forget and it's going to cause problems with upgrades.

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Oct 15, 2020

How does an admin know that machines in the cluster are held? What does the MCO's ClusterOperator object look like when a machine is in the held state? We'll need to make it easy for an admin to know that machines have been held, otherwise people are going to forget and it's going to cause problems with upgrades.

I actually think pausing nodes indefinitely (and allowing them to skip and update) undermines the point of pools entirely (whereas specific ordering does not). If the point of a pool is knowing that all members of the pool have a certain rendered-config, skipping an update makes the pool status and information meaningless. If you can do that what does it mean to be a member of a pool? If a node doesn't want the updates given to a specific pool why is it a member?

As a UI, the MCO thinks in pools for its status, how can we reflect a pool as upgraded if all members don't have the same config? In the long-term how do we troubleshoot in a general sense when the state of a node is not longer easily known? If there’s an mco issue, I have to dig thru annotations+rendered configs? What if a node is paused on a required update path?

If the hold is indefinite (and not just a hold this node for last or something) I think we really do start to hit some fundamental questions/concerns and some long-term confusion/debugging risks.

Note: If I'm misunderstanding what the intention of "hold" is please let me know =)

@ashcrow ashcrow requested review from runcom and removed request for ashcrow October 15, 2020 19:22
@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Oct 15, 2020

Also adding a hold (:drum: ) since the MCO team has some questions/concerns that need to be addressed and would also like more background/use case info.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2020
@cgwalters
Copy link
Member Author

cgwalters commented Oct 15, 2020

If the point of a pool is knowing that all members of the pool have a certain rendered-config, skipping an update makes the pool status and information meaningless.

This doesn't skip an update - the node isn't updated and the pool wouldn't "skip" the node. The node would still count as not updated.

The pool would progress until the only nodes remaining are "held" and at that point I think the status should be the same as if the pool was paused.

@kikisdeliveryservice
Copy link
Contributor

The pool would progress until the only nodes remaining are "held" and at that point I think the status should be the same as if the pool was paused.

I guess this brings me to the question/possible solution of: If we're holding multiple nodes of a pool, why not designate them into a custom pool to preserve the UI/general sense of the MCO dealing in pools? Is there something preventing that or some problem with custom pools that we could address?

I'm starting to compile a list of our questions bc I feel like there's some color we are missing in this.

@cgwalters
Copy link
Member Author

If we're holding multiple nodes of a pool, why not designate them into a custom pool to preserve the UI/general sense of the MCO dealing in pools? Is there something preventing that or some problem with custom pools that we could address?

  • In some cases one might want this for the control plane nodes, can't change the pool for those
  • Switching the pool a node is in currently causes a drain/reboot (because the rendered MC name changes, we might be able to fix that)
  • A number of cases of this would be avoiding one or two nodes, making a pool for just one node is awkward at best, and kind of a "permanent" situation where this is more of a transient thing IMO

@cgwalters
Copy link
Member Author

(Will circle back to this on Monday but) To expand on the logic here, we're just changing the node controller's getCandidateMachines().

Nothing else calls that function - in particular when we go to calculate the pool status, notice:

func calculateStatus(pool *mcfgv1.MachineConfigPool, nodes []*corev1.Node) mcfgv1.MachineConfigPoolStatus {
	machineCount := int32(len(nodes))

	updatedMachines := getUpdatedMachines(pool.Spec.Configuration.Name, nodes)
	updatedMachineCount := int32(len(updatedMachines))

	readyMachines := getReadyMachines(pool.Spec.Configuration.Name, nodes)
	readyMachineCount := int32(len(readyMachines))

	unavailableMachines := getUnavailableMachines(nodes)
	unavailableMachineCount := int32(len(unavailableMachines))

...

	allUpdated := updatedMachineCount == machineCount &&
		readyMachineCount == machineCount &&
		unavailableMachineCount == 0

Now we do need to patch this logic to probably add a "held node" count to the pool status, but the important thing to notice here is that held machines don't count as updated, so the pool won't count as updated until they're un-held. Another way to look at this is that holding a machine is much like marking it unscheduable - the node is still there and counts.

@yuqi-zhang
Copy link
Contributor

yuqi-zhang commented Oct 20, 2020

So here's a general concern I have for this: we really only should allow one version to be updated to at a time, and since we don't have the concept of queue'd updates, paused nodes should block the NEXT update because we can't guarantee edges in the future, somewhat invalidating this path.

Let me give an example:

Today let's say your worker nodes (named A, B, C, D) are on rendered-1, and you add a MC to create rendered-2. Let's say A starts the update first. While A is updating you add a new MC that causes rendered-3. The worker pool will target that immediately, so A, after it finishes the update, will go to rendered 3, BUT because BCD has not yet started they will skip the inbetween step.

Basically:
A: rendered-1 -> rendered-2 -> rendered-3
BCD: rendered-1 -> rendered-3

At a glance that might seem like behaviour we want, which is fine-ish for our regular operations. But now lets add pausing to the picture, and use more concrete examples with the rendered version.

Same nodes, installed on 4.4, and now you're upgrading to 4.5. Lets say you pause D, and ABC finishes the upgrade without a problem. And now you have
ABC: 4.5
D: 4.4

Now you want to start another upgrade to 4.6. There is no edge from 4.4->4.6. ABC, although on 4.5 and can upgrade, should not. Otherwise we're going to leave D in a limbo where it can never be un-held or risk failing the update. We also don't have a queue to tell D to eventually go from 4.4->4.5->4.6 so held nodes would forever block updates until there are no held nodes from the previous version, which somewhat kills the point of node holds really because you can't drift more than 1 version.

Now of course we can add a lot of complex logic here, edge checking, or queued updates, or something, but I think that if we do go down that path I wouldn't recommend a quick and dirty annotation like this, but a more fleshed out design.

Maybe my assessment is not entirely correct but just wanted to add this for discussion.

@yuqi-zhang
Copy link
Contributor

I guess my comment doesn't really apply to this PR specifically very well. It's more so to the use case that some nodes getting paused will block all updates forever, I feel we may want a better solution in that case.

@cgwalters
Copy link
Member Author

This discussion is strongly related to the "worker pool blocking cluster updates across major versions" thread that came up (not sure if it's on GH or just in the internal jira dark matter).

Remember today, the worker pool does not block upgrades - we have had in the past (and will continue to) have potential failure scenarios where a worker is stuck on e.g. 4.4 becuase of some failure to drain and the control plane is all the way on 4.6 and we end up trying to jump the worker (if the drain gets unstuck) straight from 4.4 to 4.6.

Or the admin could have just straight up paused the worker pool and forgotten about it.

IOW this capability is not introducing a new possible failure mode. (I would probably agree it makes it more likely though)

And yes I think the MCO should learn about major version updates vs minor.

@yuqi-zhang
Copy link
Contributor

I agree, I'm not against this PR, just that "pausing" seems like a paradigm we should not be supporting for extended lengths of time. You can pause a few nodes for a short time while they're running critical workloads, but you should not be holding them for weeks/months to drift the cluster.

@cgwalters
Copy link
Member Author

but you should not be holding them for weeks/months to drift the cluster.

I think it's a bit more nuanced than that. Remember Red Hat hasn't finished the acquisition of CoreOS so it's only drift if the admin starts an oc adm upgrade. Or to say this another way, a worker node sitting there for months in hold isn't a problem if the control plane isn't upgrading either and the admin is OK with the tradeoff of not applying node security updates. (In some cases admins may wave their hands and say that there's no untrusted workloads, or in other cases those workloads are all isolated as VMs, etc.)

@crawford
Copy link
Contributor

Catching up a bit...

@kikisdeliveryservice I completely agree with your argument and I think @yuqi-zhang does a good job of spelling out just how nasty things can get. The tentative agreement is that the customers who make use of this feature will enjoy the responsibility of ensuring that the myriad versions of RHCOS that make up their cluster and the updates they apply remain a subgraph of the published Cincinnati graph. The overwhelming majority of cluster administrators will not be able to effectively follow these constraints, so in practice, this holding mechanism will only be useful for temporary actions on a machine (e.g. debugging) and really should not be used across updates. Come to think of it, we should include telemetry and an alerting rule so that administrators are reminded when a machine remains paused for too long.

@openshift-merge-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws 9bd2397 link /test okd-e2e-aws
ci/prow/e2e-aws-serial 9bd2397 link /test e2e-aws-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member Author

and the updates they apply remain a subgraph of the published Cincinnati graph.

Note that today the worker pool isn't gating, so arbitrary worker node skew can and definitely does happen. IOW there's no relationship today between Cincinnati and the worker pool. There were discussions about fixing that so that the MCO wouldn't allow upgrading the control plane across majors while workers were at N-1 but it doesn't exist today.

I'd agree explicitly supporting it this way makes it more likely for a few nodes to skew, but failure to drain can also cause skew (and that's come up in a few bug reports).

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@geoberle
Copy link

I agree, I'm not against this PR, just that "pausing" seems like a paradigm we should not be supporting for extended lengths of time. You can pause a few nodes for a short time while they're running critical workloads, but you should not be holding them for weeks/months to drift the cluster.

i know this is mostly a technical discussion between peers but I hope you allow me to bring in a cluster admin perspective

We have exactly the situation that our users start critical jobs that run for several hours. We would like not to interrupt them. On openshift 4 it is not apparent how to achieve that. Having multiple machine pools for upgrading control via pause was an initial idea of ours but felt a bit like misuse so we discarded it.

Holding those few nodes and taint them for new pods at the beginning of a cluster upgrade sounds manageable in our case. Since critical workloads are properly labeled, we would automate the hold/unhold actions during cluster upgrades or machineconfig changes.

Since the risk for skew was discussed here as well: node drains are not always reliable so looking out for blocking pods and incomplete updates on pools is something required as part of day 2 operations anyways.

@cgwalters
Copy link
Member Author

i know this is mostly a technical discussion between peers but I hope you allow me to bring in a cluster admin perspective

That's fine, our discussions are public partially for this reason!

We have exactly the situation that our users start critical jobs that run for several hours. We would like not to interrupt them. On openshift 4 it is not apparent how to achieve that.

Broadly speaking, https://kubernetes.io/docs/tasks/run-application/configure-pdb/

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2021
@cgwalters
Copy link
Member Author

/remove-lifecycle rotten
came up again today

@openshift-ci-robot openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 21, 2021

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-serial 9bd2397 link /test e2e-aws-serial
ci/prow/e2e-agnostic-upgrade 9bd2397 link /test e2e-agnostic-upgrade
ci/prow/images 9bd2397 link /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member Author

We're having a big debate on a related issue in coreos/zincati#245
and one thing I recently realized is that systemd has a nice API to block reboots.

I think the MCO should honor systemd-inhibit --mode=block - basically if an admin used ssh to add a node reboot blocker via systemd, then the MCD should reflect that state into the node via annotation, and then the node controller knows to automatically avoid it.

Doing this via systemd has a lot of nice properties:

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2021

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2021
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 23, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. team-mco
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants