baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

n1r1 · 2020-11-24T15:56:13Z

Proposing an opt-in remediation feature, which doesn't rely on BMC credentials.
Each node will take care of rebooting itself in case of a failure.

implementation PoC

Suggesting an opt-in remediation feature, which doesn't rely on BMC credentials. Each node will take care of rebooting itself in case of a failure. Signed-off-by: Nir <niry@redhat.com>

openshift-ci-robot · 2020-11-24T15:56:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: n1r1
To complete the pull request process, please assign jsafrane after the PR has been reviewed.
You can assign the PR to them by writing /assign @jsafrane in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgwalters · 2020-12-17T18:37:17Z

One thing I don't quite understand is the tradeoffs in having this be peer-to-peer versus keeping the source of truth in etcd. It seems like etcd needs to be canonical, so wouldn't we at least require e.g. the control plane nodes are involved in the final decision to power off a machine or so?

sferich888 · 2020-12-18T14:21:16Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+This work can also be useful for clusters **with** BMC credentials.
+If there’s a network outage between the node running CAPBM (which expects to use BMC commands like power-off etc.) and the
+unhealthy node, the power-cycle attempt cannot
+succeed. Self health checks and remediation can resolve such cases, however that work/scenario is out of scope for


If we are not scoping this into the work, why are network outages where BMC credentials discussed?
We are not accounting for situations where BCM credentials are available but couldn't be used (thats at lease how I read this).

Using poison pill in BMC clusters is possible as escalation path (e.g. first try to reboot with BMC, if it didn't work poison-pill kicks in).
To simplify the proposal we wanted to avoid outlining how this escalation path can work and how this coordination should be done.
We mention the benefit for BMC clusters just as a something we might want to do in the future, or ways to expand that feature.

sferich888 · 2020-12-18T14:34:22Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+* To allow the healthy majority to recover workloads after a pre-determined and finite interval
+* To allow a node to take actions to remediate itself in case of a failure
+* To allow a node to detect its health status when master<->node communication is partially failed
+* To avoid false positives caused by failure of the control plane


Do we have a list of what these might be?

I think this boils down to inaccessible etcd (no matter what caused this).
So if the control plane is not responding for some reason, we don't want to automatically assume that the node is unhealthy. That's why we contact the peers.
and if we see that this is a wide failure, i.e. most peers can't access etcd, we assume it's a control plane failure and we don't reboot.

sferich888 · 2020-12-18T14:36:43Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+### Non-Goals
+
+* Self healing in combination with existing reprovisioning or reboot based remediation mechanisms in the Machine Health Check controller
+* Recovering from all types of failure


I think we need to better articulate what failures this will address (for documentation, testing and user experience evaluation).

sferich888 · 2020-12-18T14:39:51Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+
+## Proposal
+
+Utilizing a DaemonSet to have a privileged critical pod running on each node. This pod will periodically check the


Would it be better to run these as static pods? https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/

What would be the added value?

sferich888 · 2020-12-18T14:41:51Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+1. Connectivity issues between the node and the control plane nodes (which doesn’t stem from local problem in the node)
+1. Resource starvation, preventing our DaemonSet from initiating the check 
+1. Authentication failure to API (ServiceAccount deleted?)
+1. RBAC disallows read operation on Machine (Role deleted or changed


Suggested change

1. RBAC disallows read operation on Machine (Role deleted or changed

1. RBAC disallows read operation on Machine (Role deleted or changed)

sferich888 · 2020-12-18T14:43:53Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+1. RBAC disallows read operation on Machine (Role deleted or changed
+
+We only want to remediate if the problem is local (#1 and #4). Otherwise, we could create a "remediation storm" where
+all nodes are trying to remediate themselves, even if the problem is not theirs.


How will we differentiate #1 from #2, or #1 from #3, to know what remediation tasks to execute?
If we can detect #1, #3 and #4 why aren't the nodes throwing alerts that can be used for other debugging purposes?

#1 from #2 - using other peers. if other peers can access etcd, we know that it's not #2. if other peers can't access etcd we assume it's #2. if the failing node can't reach his peers, nor access etcd we assume it's #1.

#1 from #3 - using other peers to differentiate. if other peers can reach etcd we assume it's #3, if the suspected node can't reach etcd, nor any of its peers we assume it's #1.

If we can detect #1, #3 and #4 why aren't the nodes throwing alerts that can be used for other debugging purposes?

if it's #1, the node probably won't be able to send alerts anywhere (e.g. if it has local network failure)
To identify #3 we had to introduce communication between the peers. We could identify this without them.
as for #4, there are some existing node conditions such as mem/disk pressure that are already reported. We use watchdog device since we must make sure that the poison pill pod is running or the machine is not running any workloads. Otherwise we're risking in running staeful sets with run-once semantics in two different nodes.

sferich888 · 2020-12-18T14:47:15Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+The semantics of the watchdog device automatically handle #4 and to distinguish between the remaining three situations,
+the node will ask its peers whether they think it is healthy.
+
+The possible responses from a peer are:


This will have impacts on east-west network traffic within the cluster. Howe much impact are we expecting?

assuming 256 bytes per request, the expected traffic volume is up to 256*nodes_count bytes.
So given a 1000 nodes cluster we're talking on up to 256KB overall (layer 7 volume) per unhealthy node.

If control plane is really down, each node will contact its peers, which means 256MB volume for all cluster, but in this case the cluster is already broken so I don't think this 256MB will harm anything.

sferich888 · 2020-12-18T17:27:41Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+1. Mark the node as unschedulable (this prevents the node to run new workload after reboot)
+1. Add current time to annotation
+1. Unhealthy node reboot itself
+1. After *timeout-to-assume-node-rebooted* (either configurable or calculated by software) - delete the unhealthy node (to signal the cluster that the workload is no longer running there, and it is safe to run it elsewhere)


Do we also want to pair this with a set of re-tries? Or a buffer +/- 20%?

hmm can you elaborate? retries for which operation?

sferich888 · 2020-12-18T17:38:17Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+
+## Drawbacks
+
+TBD


We mention a few above with False-Negatives, should we explore / outline more of those here?

sferich888 · 2020-12-18T17:38:47Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+
+## Alternatives
+
+Instead of contacting other peers, we could use a shared storage to signal the health status for a peer.


etcd? If this happens and we loose connectivity to etcd what issue does this cause?

n1r1 · 2020-12-20T15:14:02Z

One thing I don't quite understand is the tradeoffs in having this be peer-to-peer versus keeping the source of truth in etcd. It seems like etcd needs to be canonical, so wouldn't we at least require e.g. the control plane nodes are involved in the final decision to power off a machine or so?

etcd is the source of truth. it holds in the information about the health status of the node.
the peer-to-peer protocol kicks in only if a node can not access etcd for some reason.
The peer will check the health status in etcd, and reports it back to the node who couldn't access the etcd itself. In some ways, you can think of it as a proxy to etcd.

mykaul · 2021-01-04T16:59:23Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+
+To avoid saturating the network, the node will pick *min(cluster_size/10, cluster_size-1)* random nodes, and ask these
+nodes if they think it’s healthy using HTTP.
+Each pod will listen on a specific node-port, and will run an HTTP web server that will be able to tell other peers if


Can we do it on multiple networks? Assuming (for example) there are multiple NICs, it might be that the 'external' network is unavailable, but the 'internal' is.

Good point.
I guess it's possible and we should probably try all available IPs.
I'll update the proposal accordingly

sttts · 2021-01-11T10:06:51Z

enhancements/baremetal/baremetal-poison-pill-remediation.md

+
+Failure to read Machine CR could stem from one of the following:
+1. The node is unhealthy and therefore can’t read the Machine CR (e.g. local network failure)
+1. There’s an etcd failure


do we risk losing quorum by rebooting nodes like proposed here? cc @hexfusion

We can reboot only if etcd-quorum-guard PDB allows it, but the watchdog doesn't respect PDBs obviously.

Maybe at this stage it would be simpler to to install only on workers (and lose the opportunity to remediate masters).
We can revisit this once we gain more experience and confidence.

cc @beekhof

openshift-bot · 2021-04-11T12:51:16Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

n1r1 · 2021-04-11T13:52:02Z

/remove-lifecycle stale

openshift-bot · 2021-07-10T20:14:44Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

n1r1 · 2021-07-11T08:29:33Z

We released poison pill operator as a day-2 operator which can be installed using OLM. see upstream release:
https://operatorhub.io/operator/poison-pill-operator

/close

openshift-ci · 2021-07-11T08:29:40Z

@n1r1: Closed this PR.

In response to this:

We released poison pill operator as a day-2 operator which can be installed using OLM. see upstream release:
https://operatorhub.io/operator/poison-pill-operator

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

baremetal: add poison pill remediation enhancement

e255ba8

Suggesting an opt-in remediation feature, which doesn't rely on BMC credentials. Each node will take care of rebooting itself in case of a failure. Signed-off-by: Nir <niry@redhat.com>

openshift-ci-robot requested review from deads2k and hardys November 24, 2020 15:56

sferich888 reviewed Dec 18, 2020

View reviewed changes

mykaul reviewed Jan 4, 2021

View reviewed changes

sttts reviewed Jan 11, 2021

View reviewed changes

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2021

openshift-ci bot closed this Jul 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

n1r1 commented Nov 24, 2020

openshift-ci-robot commented Nov 24, 2020

cgwalters commented Dec 17, 2020

sferich888 Dec 18, 2020

n1r1 Dec 20, 2020

sferich888 Dec 18, 2020

n1r1 Dec 20, 2020

sferich888 Dec 18, 2020

sferich888 Dec 18, 2020

n1r1 Dec 22, 2020

sferich888 Dec 18, 2020

sferich888 Dec 18, 2020

n1r1 Jan 4, 2021

n1r1 Jan 4, 2021

sferich888 Dec 18, 2020

n1r1 Dec 21, 2020

sferich888 Dec 18, 2020

n1r1 Dec 21, 2020

sferich888 Dec 18, 2020

sferich888 Dec 18, 2020

n1r1 commented Dec 20, 2020

mykaul Jan 4, 2021

n1r1 Jan 5, 2021

sttts Jan 11, 2021

n1r1 Jan 11, 2021

openshift-bot commented Apr 11, 2021

n1r1 commented Apr 11, 2021

openshift-bot commented Jul 10, 2021

n1r1 commented Jul 11, 2021

openshift-ci bot commented Jul 11, 2021


		## Proposal

		Utilizing a DaemonSet to have a privileged critical pod running on each node. This pod will periodically check the

	1. RBAC disallows read operation on Machine (Role deleted or changed
	1. RBAC disallows read operation on Machine (Role deleted or changed)


		## Alternatives

		Instead of contacting other peers, we could use a shared storage to signal the health status for a peer.


		## Drawbacks

		TBD

baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

Conversation

n1r1 commented Nov 24, 2020

openshift-ci-robot commented Nov 24, 2020

cgwalters commented Dec 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n1r1 commented Dec 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Apr 11, 2021

n1r1 commented Apr 11, 2021

openshift-bot commented Jul 10, 2021

n1r1 commented Jul 11, 2021

openshift-ci bot commented Jul 11, 2021