-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547
Conversation
Suggesting an opt-in remediation feature, which doesn't rely on BMC credentials. Each node will take care of rebooting itself in case of a failure. Signed-off-by: Nir <niry@redhat.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: n1r1 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
One thing I don't quite understand is the tradeoffs in having this be peer-to-peer versus keeping the source of truth in etcd. It seems like etcd needs to be canonical, so wouldn't we at least require e.g. the control plane nodes are involved in the final decision to power off a machine or so? |
This work can also be useful for clusters **with** BMC credentials. | ||
If there’s a network outage between the node running CAPBM (which expects to use BMC commands like power-off etc.) and the | ||
unhealthy node, the power-cycle attempt cannot | ||
succeed. Self health checks and remediation can resolve such cases, however that work/scenario is out of scope for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not scoping this into the work, why are network outages where BMC credentials discussed?
We are not accounting for situations where BCM credentials are available but couldn't be used (thats at lease how I read this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using poison pill in BMC clusters is possible as escalation path (e.g. first try to reboot with BMC, if it didn't work poison-pill kicks in).
To simplify the proposal we wanted to avoid outlining how this escalation path can work and how this coordination should be done.
We mention the benefit for BMC clusters just as a something we might want to do in the future, or ways to expand that feature.
* To allow the healthy majority to recover workloads after a pre-determined and finite interval | ||
* To allow a node to take actions to remediate itself in case of a failure | ||
* To allow a node to detect its health status when master<->node communication is partially failed | ||
* To avoid false positives caused by failure of the control plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a list of what these might be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this boils down to inaccessible etcd (no matter what caused this).
So if the control plane is not responding for some reason, we don't want to automatically assume that the node is unhealthy. That's why we contact the peers.
and if we see that this is a wide failure, i.e. most peers can't access etcd, we assume it's a control plane failure and we don't reboot.
### Non-Goals | ||
|
||
* Self healing in combination with existing reprovisioning or reboot based remediation mechanisms in the Machine Health Check controller | ||
* Recovering from all types of failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to better articulate what failures this will address (for documentation, testing and user experience evaluation).
|
||
## Proposal | ||
|
||
Utilizing a DaemonSet to have a privileged critical pod running on each node. This pod will periodically check the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to run these as static pods? https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the added value?
1. Connectivity issues between the node and the control plane nodes (which doesn’t stem from local problem in the node) | ||
1. Resource starvation, preventing our DaemonSet from initiating the check | ||
1. Authentication failure to API (ServiceAccount deleted?) | ||
1. RBAC disallows read operation on Machine (Role deleted or changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. RBAC disallows read operation on Machine (Role deleted or changed | |
1. RBAC disallows read operation on Machine (Role deleted or changed) |
1. RBAC disallows read operation on Machine (Role deleted or changed | ||
|
||
We only want to remediate if the problem is local (#1 and #4). Otherwise, we could create a "remediation storm" where | ||
all nodes are trying to remediate themselves, even if the problem is not theirs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1 from #2 - using other peers. if other peers can access etcd, we know that it's not #2. if other peers can't access etcd we assume it's #2. if the failing node can't reach his peers, nor access etcd we assume it's #1.
#1 from #3 - using other peers to differentiate. if other peers can reach etcd we assume it's #3, if the suspected node can't reach etcd, nor any of its peers we assume it's #1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can detect #1, #3 and #4 why aren't the nodes throwing alerts that can be used for other debugging purposes?
if it's #1, the node probably won't be able to send alerts anywhere (e.g. if it has local network failure)
To identify #3 we had to introduce communication between the peers. We could identify this without them.
as for #4, there are some existing node conditions such as mem/disk pressure that are already reported. We use watchdog device since we must make sure that the poison pill pod is running or the machine is not running any workloads. Otherwise we're risking in running staeful sets with run-once semantics in two different nodes.
The semantics of the watchdog device automatically handle #4 and to distinguish between the remaining three situations, | ||
the node will ask its peers whether they think it is healthy. | ||
|
||
The possible responses from a peer are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have impacts on east-west network traffic within the cluster. Howe much impact are we expecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming 256 bytes per request, the expected traffic volume is up to 256*nodes_count bytes.
So given a 1000 nodes cluster we're talking on up to 256KB overall (layer 7 volume) per unhealthy node.
If control plane is really down, each node will contact its peers, which means 256MB volume for all cluster, but in this case the cluster is already broken so I don't think this 256MB will harm anything.
1. Mark the node as unschedulable (this prevents the node to run new workload after reboot) | ||
1. Add current time to annotation | ||
1. Unhealthy node reboot itself | ||
1. After *timeout-to-assume-node-rebooted* (either configurable or calculated by software) - delete the unhealthy node (to signal the cluster that the workload is no longer running there, and it is safe to run it elsewhere) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to pair this with a set of re-tries? Or a buffer +/- 20%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm can you elaborate? retries for which operation?
|
||
## Drawbacks | ||
|
||
TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mention a few above with False-Negatives, should we explore / outline more of those here?
|
||
## Alternatives | ||
|
||
Instead of contacting other peers, we could use a shared storage to signal the health status for a peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etcd? If this happens and we loose connectivity to etcd what issue does this cause?
etcd is the source of truth. it holds in the information about the health status of the node. |
|
||
To avoid saturating the network, the node will pick *min(cluster_size/10, cluster_size-1)* random nodes, and ask these | ||
nodes if they think it’s healthy using HTTP. | ||
Each pod will listen on a specific node-port, and will run an HTTP web server that will be able to tell other peers if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do it on multiple networks? Assuming (for example) there are multiple NICs, it might be that the 'external' network is unavailable, but the 'internal' is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I guess it's possible and we should probably try all available IPs.
I'll update the proposal accordingly
|
||
Failure to read Machine CR could stem from one of the following: | ||
1. The node is unhealthy and therefore can’t read the Machine CR (e.g. local network failure) | ||
1. There’s an etcd failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we risk losing quorum by rebooting nodes like proposed here? cc @hexfusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reboot only if etcd-quorum-guard PDB allows it, but the watchdog doesn't respect PDBs obviously.
Maybe at this stage it would be simpler to to install only on workers (and lose the opportunity to remediate masters).
We can revisit this once we gain more experience and confidence.
cc @beekhof
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
We released poison pill operator as a day-2 operator which can be installed using OLM. see upstream release: /close |
@n1r1: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Proposing an opt-in remediation feature, which doesn't rely on BMC credentials.
Each node will take care of rebooting itself in case of a failure.
implementation PoC