-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose new controller for pausing MHC during cluster upgrades #834
Conversation
Signed-off-by: Marc Sluiter <msluiter@redhat.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @beekhof @JoelSpeed |
/uncc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the concept and I think this is an important feature to have!
|
||
### Goals | ||
|
||
- Automatically pause all machineHealthChecks during cluster upgrades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does OCP upgrade nodes one by one?
if yes, I'd say that a perfect solution would know if a specific node is unhealthy due to upgrade or something else, so we won't miss remediations that can take place even during upgrade (see lease in alternatives comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try walking before we attempt to run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't contradicts.
it helps to know what is the vision/end-goal so we can:
- know what are the limitations of the proposed solution
- Make steps at the right direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this to non-goals and alternatives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you actually care about the machineConfigPool being in progressing state rather than the upgrade state. A full hour of the upgrade has nothing to do with host reboots and if we deliver on paused-worker-pool upgrades most nodes in a cluster won't even reboot some of the time.
I think you should instead watch for nodes that have been cordoned and ignore them in some manner. That seems more broadly applicable.
@JoelSpeed do you have an opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you actually care about the machineConfigPool being in progressing state rather than the upgrade state.
I agree, if we can track the MCP upgrade rather than whole cluster upgrade, that will be a smaller time window. Perhaps the easiest way to do this is to actually track the progressing
state of the ClusterOperator for machine config?
I think you should instead watch for nodes that have been cordoned and ignore them in some manner. That seems more broadly applicable.
This seems more like the node maintenance lease idea that has floated around before, kubernetes/enhancements#1411, this is definitely something that could be more intuitive and avoid the dependency on upgrade specific watching, but is a much larger piece of work.
In terms of just watching cordoned nodes, to me this makes sense as a signal that a node is going into maintenance of some description. Need to think a bit more about if this would cause other issues 🤔
|
||
### Upgrade / Downgrade Strategy | ||
|
||
The new controller lives in the machine-api-operator image so the upgrades will be driven by the CVO which will fetch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any special thing needs to be done when the upgraded node is the one that runs this new controller?
e.g. if it's the first node to be upgraded, there's a race between this controller to pause all MHCs, and whatever does the upgrade to reboot the node.
If a reboot takes place before this controller paused all MHCs, and other node was upgraded at the same time (is it possible?), then once MHC comes up, together with this new controller, there's another race, and MHC might remediate the other node before this new controller was able to pause all MHCs.
Maybe I've gone too far :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't follow tbh... why does it make a difference if the new controller runs on an updated node or an old one? What kind of race is there? 🤔 Maybe remediation isn't paused in time for all nodes during the upgrade which introduces this new controller, but is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to describe the flow (which is probably a super corner case, it's just an example):
- MHC CR created by an admin
- upgrade is starting
- the node that runs the
machine-api-controllers
pod (node A) is the first to be upgraded, and is rebooted before the proposed controller had a chance to pause all MHCs machine-api-controllers
pod is scheduled on other node (node B)- MHC (the controller) comes up (on node B) before the proposed controller, and tries to remediate node A even though it's down for upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is anything we can do about it. I also think it isn't an issue, because the same can happen today already? 🤔
enhancements/machine-api/machine-health-check-pause-on-upgrade.md
Outdated
Show resolved
Hide resolved
Lets include a goal along the lines of "avoiding worst case behaviour in time for 4.9"
|
and fixed some typos etc Signed-off-by: Marc Sluiter <msluiter@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets include a goal along the lines of "avoiding worst case behaviour in time for 4.9"
And indicate somewhere that
- we are limited by the scope of the pause functionality available upstream
- we will look to improve the feature in future releases
done
@dhellmann @hardys Can you please take a look at this enhancement? |
During cluster upgrades nodes get temporarily unhealthy, which might trigger remediation when a machineHealthCheck | ||
resource is targeting those nodes, and they don't get healthy before the configured timeout. This remediation delays | ||
the cluster upgrade significantly especially on bare metal clusters, and causes needless interruption of customer | ||
workloads while they are transferred to other nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a problem we've seen in practice? Do we know the cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came up with a customer case, so we have seen it in the wild.
The issue was that the MCO reboot took longer than the configured timeout for the MHC. I don't know the full details of that MHC but I'm assuming they had a reasonable timeout configured. We should double check this and add some context here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that just misconfiguration, I configured a MHC without properly understanding the upper bound of the time required to reboot a host and the Kubelet to start posting Ready? Should our docs around adding MHCs make it clear you'll probably want to measure the time it takes your hosts to perform a MCO initiated reboot before setting these values?
Do we really need different rules applied during upgrades versus non upgrade times? If I apply an MC change that triggers a reboot that wouldn't be during an upgrade but would trigger the same behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need different rules applied during upgrades versus non upgrade times? If I apply an MC change that triggers a reboot that wouldn't be during an upgrade but would trigger the same behavior.
I think I'm kind of leaning this direction myself. If you hosts typically take 30 minutes to reboot, set the MHC timeout to 45 minutes. IMO, MHC isn't meant for 'emergency' situations where a host has to be replaced ASAP. It's more of a nice-to-have when you have programmable infra, so if a host dies in the middle of the night, it will be replaced and you won't slowly lose machines over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need different rules applied during upgrades versus non upgrade times? If I apply an MC change that triggers a reboot that wouldn't be during an upgrade but would trigger the same behavior.
Fair point 👍🏼 We could catch this by looking at machineConfigPools MachineConfigPoolUpdating
condition though (suggested at another place already) instead of the cluster version's status, right?
If you hosts typically take 30 minutes to reboot, set the MHC timeout to 45 minutes. IMO, MHC isn't meant for 'emergency' situations
That's also a fair point, but I know we have users / customers which want fencing as fast as possible.
@beekhof @JoelSpeed WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MHC isn't meant for 'emergency' situations where a host has to be replaced ASAP. It's more of a nice-to-have when you have programmable infra, so if a host dies in the middle of the night, it will be replaced and you won't slowly lose machines over time.
This only true for the compute capacity of the unhealthy node but not for the workload that was assigned to that node.
if you have stateful workloads running on the unhealthy node, you'll have downtime until remediation takes place.
45 minutes downtime for a user application is a lot, IMO, and we saw this need coming from customers comparing OCP to other products.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
45 minutes downtime for a user application is a lot, IMO, and we saw this need coming from customers comparing OCP to other products.
If MHC is disabled for one reason or another during upgrades, this situation could persist indefinitely (eg, you're upgrading host 1, host 2 dies, problem upgrading host 1). Similarly, PDBs can prevent MHC from successfully deleting a failed node indefinitely.
Let's try to put some context around the MHC usecase/userstory. If you weren't running MHC, and you had a random worker go down in the middle of the night, you'd need to wait for X minutes to be paged, someone to log in, etc, and decide to remediate the node manually. That's probably going to take 30 minutes or so, and that's only if you decide to page for a single node failure or have 24/7 ops.
MHC is not a cluster saving or application saving utility. It's to slowly replace failed nodes. Pretty much everywhere but baremetal won't encounter this issue anyhow (other platform reboots are with a couple minutes), so like I said elsewhere, if we're going to add something like this, it should be opt-in.
|
||
### Risks and Mitigations | ||
|
||
Nothing we are aware of today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hosts that are not successfully rebooted as part of the upgrade won't be remediated.
Is there a race condition between the new controller noticing that the upgrade is in process and a machine appearing to go unhealthy and remediation being triggered? How likely is the remediation controller to win that race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hosts that are not successfully rebooted as part of the upgrade won't be remediated.
Depends on the status of the clusterVersion (or machineConfigPool as proposed at another place): will it continue to indicate an ongoing upgrade forever in case not all hosts rebooted successfully? 🤔
Not sure if I understand the race... do you mean a cluster upgrade starts, and a host gets unhealthy at the same time for another reason? Or because of the cluster upgrade? The latter won't be an issue IMHO, because remediation only starts after some configured timeout, so there is enough time to pause the MHC.
annotation is supposed to be empty. This means that the annotation might already be set by someone else when the new | ||
controller want to set it as well. That introduces the question of what to do in this case after cluster upgrade: remove | ||
the annotation, or keep it? In the latter case: how to keep track whether the controller set the annotation or not (e.g. | ||
in another annotation?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user has already paused remediation, I don't think we want an upgrade to remove that setting. I think we should use a secondary annotation to indicate that this automated controller has added the pause so that it knows to only remove the pause if it sees both annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of the annotation is supposed to be empty.
Is this documented somewhere? Is this enforced anywhere?
I had imagined we would set the value to the owner of the pause, as it were, so the upgrade controller would put its name as the value, and then only remove the annotation if the value matches (this would allow others to pause using their own value as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use a secondary annotation to indicate that this automated controller has added the pause
ack, agree, will update
The value of the annotation is supposed to be empty.
Is this documented somewhere? Is this enforced anywhere?
No and no, but that was the answer in the sig cluster api Slack when I asked about it.
https://kubernetes.slack.com/archives/C8TSNPY4T/p1626101734178700
also see #827 (comment)
### Test Plan | ||
|
||
TODO (need to check existing e2e tests, and if we extend them with a test case for this, and how much extra load to CI | ||
it would add) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test to an existing job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add this to the existing MHC tests that exist in cluster-api-actuator-pkg
### Graduation Criteria | ||
|
||
TODO I have no good idea yet | ||
- if this is alpha, beta or stable (do we need this at all, it doesn't introduce a new API)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The annotation is technically an API, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the annotation is an API IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but this enhancement is not about adding that annotation, it's about using it only
#827
The new controller lives in the machine-api-operator image, so the upgrades will be driven by the CVO which will fetch | ||
the right image version as usual. | ||
The controller does not offer any API or configuration. The controller can be replaced with an older or newer | ||
version without side effects or risks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If health checks are paused on a cluster before the upgrade to this version is started, then when the new controller comes online and sees the upgrade complete it will erase the pause annotations. I don't think we want that, because it won't leave the cluster in the same state as it was when we started and the change could cause unwanted degradation in service. See the comment above about the 2-annotation approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If health checks are paused on a cluster before the upgrade to this version is started
when everything goes as planned, it is very very unlikely that anyone added the pause annotation to machineHealthChecks already, because the machine-healthcheck-controller version, which respects the pause annotation (and so the relevant doc update), will only be introduced with the same OCP / machine-api-operator version as this new controller. But I will keep this in mind in case this does not go as planned, thanks.
|
||
The controller can't stop ongoing external remediation, which is executed by an external remediation controller. For | ||
this to happen we would either need to introduce and implement a pause flag on the external remediation CRD, or the | ||
external remediation controller needs to look for the pause annotation on the machineHealthCheck resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if the remediation controller indicates that the cluster is not upgradeable when it starts trying to fix a host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I can't follow. What do you mean with "the remediation controller indicates that the cluster is not upgradeable"?
## Alternatives | ||
|
||
- Not having this feature at all: cluster admins could pause machineHealthChecks manually, but that is unneeded work, | ||
and too easy to forget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be scripted, right?
How widely used is the machine health check controller? Is that an optional component or is it always present when we have the machine API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MHC comes by default in an OCP cluster, but it isn't configured unless a customer creates a MHC object in the cluster, so it's technically off by default
API. The goal is to align them over time. | ||
- Track the status of nodes more fine-grained and continue to remediate nodes that currently are not upgrading: this | ||
would be a perfect solution and might be investigated more deeply in the future. For now, with respect to timeline, we | ||
want to concentrate on the simple solution of pausing MHCs completely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drawback of doing this on a node-by-node basis is that the remediation is going to be another source of requests to drain nodes, and there may not be capacity to do that. We would therefore have 2 controllers potentially blocking each other from taking action and the cluster wouldn't be upgradeable or fixable.
During cluster upgrades nodes get temporarily unhealthy, which might trigger remediation when a machineHealthCheck | ||
resource is targeting those nodes, and they don't get healthy before the configured timeout. This remediation delays | ||
the cluster upgrade significantly especially on bare metal clusters, and causes needless interruption of customer | ||
workloads while they are transferred to other nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came up with a customer case, so we have seen it in the wild.
The issue was that the MCO reboot took longer than the configured timeout for the MHC. I don't know the full details of that MHC but I'm assuming they had a reasonable timeout configured. We should double check this and add some context here
|
||
### Goals | ||
|
||
- Automatically pause all machineHealthChecks during cluster upgrades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you actually care about the machineConfigPool being in progressing state rather than the upgrade state.
I agree, if we can track the MCP upgrade rather than whole cluster upgrade, that will be a smaller time window. Perhaps the easiest way to do this is to actually track the progressing
state of the ClusterOperator for machine config?
I think you should instead watch for nodes that have been cordoned and ignore them in some manner. That seems more broadly applicable.
This seems more like the node maintenance lease idea that has floated around before, kubernetes/enhancements#1411, this is definitely something that could be more intuitive and avoid the dependency on upgrade specific watching, but is a much larger piece of work.
In terms of just watching cordoned nodes, to me this makes sense as a signal that a node is going into maintenance of some description. Need to think a bit more about if this would cause other issues 🤔
### Goals | ||
|
||
- Automatically pause all machineHealthChecks during cluster upgrades | ||
- Automatically unpause all machineHealthChecks when cluster upgrade finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply unpausing MHCs that were paused by users, can we be careful about the wording to make sure that it's obvious we won't touch MHCs manually paused by a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments at other places. Yes, we need to track if the MHC was paused already, probably with another annotation. Will update.
enhancements/machine-api/machine-health-check-pause-on-upgrade.md
Outdated
Show resolved
Hide resolved
annotation is supposed to be empty. This means that the annotation might already be set by someone else when the new | ||
controller want to set it as well. That introduces the question of what to do in this case after cluster upgrade: remove | ||
the annotation, or keep it? In the latter case: how to keep track whether the controller set the annotation or not (e.g. | ||
in another annotation?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of the annotation is supposed to be empty.
Is this documented somewhere? Is this enforced anywhere?
I had imagined we would set the value to the owner of the pause, as it were, so the upgrade controller would put its name as the value, and then only remove the annotation if the value matches (this would allow others to pause using their own value as well)
### Test Plan | ||
|
||
TODO (need to check existing e2e tests, and if we extend them with a test case for this, and how much extra load to CI | ||
it would add) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add this to the existing MHC tests that exist in cluster-api-actuator-pkg
### Graduation Criteria | ||
|
||
TODO I have no good idea yet | ||
- if this is alpha, beta or stable (do we need this at all, it doesn't introduce a new API)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the annotation is an API IMO
## Alternatives | ||
|
||
- Not having this feature at all: cluster admins could pause machineHealthChecks manually, but that is unneeded work, | ||
and too easy to forget. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MHC comes by default in an OCP cluster, but it isn't configured unless a customer creates a MHC object in the cluster, so it's technically off by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should implement kubernetes/enhancements#1411 instead. In fact, there's already a POC from the Node Maintenance Operator folks.
I'm of opinion we shouldn't disable MHC during upgrades, there may indeed be node failures we want to remediate, especially if upgrades take place over long periods of time (I can imagine a scenario where a sufficiently large cluster is almost always in an upgrading state).
@michaelgugino: @slintes is the NMO folks. Unfortunately customers are being severely impacted by this and we need a solution that can ASAP. Ideally before the 4.9 feature freeze |
Co-authored-by: Joel Speed <Joel.speed@hotmail.co.uk>
... | ||
``` | ||
|
||
The new controller will add the pause annotation on the machineHealthChecks when the clusterVersion's `Progressing` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality should be opt-in, we don't want to drastically change default behavior like this. If users are having a problem with MHC too eagerly deleting instances during upgrades, they can opt-in (or adjust MHC, or pause it themselves). Making this behavior the default is not a good fit IMO.
Since we don't know who set the annotation on the MHC, we'll need another small annotation on the MHC to indicate to ourselves that we paused it for upgrade and the annotation didn't already exist.
… or upaused during upgrades Signed-off-by: Marc Sluiter <msluiter@redhat.com>
… the cluster Signed-off-by: Marc Sluiter <msluiter@redhat.com>
/uncc |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Propose to add a new controller to machine-api-operator, which will pause all MachineHealthChecks (see #827) during cluster upgrades.