Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
189 changes: 189 additions & 0 deletions enhancements/baremetal/baremetal-poison-pill-remediation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
---
title: baremetal-poison-pill-remediation
authors:
- "@n1r1"
- "@abeekhof"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2020-11-15
last-updated: 2020-11-15
status: implementable
see-also:
- https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-health-checking.md
replaces:
- None
superseded-by:
- None
---

# Baremetal Poison Pill Remediation

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

Existing baremetal remediation strategies utilize BMC credentials to power-cycle and/or reprovision the host.
However there are also environments that either do not include BMCs, or there are policies in place that prevent
them from being utilized. Such environments would also benefit from the ability to safely recover affected workloads
and restore cluster capacity (where possible).

This proposal describes an alternate mechanism for a node in a cluster
to detect its health status and take actions to remediate itself in case of a failure. While not all remediation events
can result in the node returning to a healthy state, the proposal does allow surviving parts of the cluster to assume
the node has reached a safe state so that its workloads can be automatically recovered.


This work can also be useful for clusters **with** BMC credentials.
If there’s a network outage between the node running CAPBM (which expects to use BMC commands like power-off etc.) and the
unhealthy node, the power-cycle attempt cannot
succeed. Self health checks and remediation can resolve such cases, however that work/scenario is out of scope for

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not scoping this into the work, why are network outages where BMC credentials discussed?
We are not accounting for situations where BCM credentials are available but couldn't be used (thats at lease how I read this).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using poison pill in BMC clusters is possible as escalation path (e.g. first try to reboot with BMC, if it didn't work poison-pill kicks in).
To simplify the proposal we wanted to avoid outlining how this escalation path can work and how this coordination should be done.
We mention the benefit for BMC clusters just as a something we might want to do in the future, or ways to expand that feature.

this proposal.

## Motivation

Some clusters don’t have BMC credentials and we still want to have auto remediation for failing nodes.
This proposal allows:
1. To allow stateful workload running on the unhealthy nodes to be rescheduled after finite time on other healthy nodes.
1. To restore compute capacity by remediating the unhealthy node

### Goals

* To allow remediation of bare metal nodes without BMC credentials
* To utilize the MHC logic for deciding if a node is unhealthy and whether remediation is appropriate
* To allow the healthy majority to recover workloads after a pre-determined and finite interval
* To allow a node to take actions to remediate itself in case of a failure
* To allow a node to detect its health status when master<->node communication is partially failed
* To avoid false positives caused by failure of the control plane

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list of what these might be?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this boils down to inaccessible etcd (no matter what caused this).
So if the control plane is not responding for some reason, we don't want to automatically assume that the node is unhealthy. That's why we contact the peers.
and if we see that this is a wide failure, i.e. most peers can't access etcd, we assume it's a control plane failure and we don't reboot.


### Non-Goals

* Self healing in combination with existing reprovisioning or reboot based remediation mechanisms in the Machine Health Check controller
* Recovering from all types of failure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to better articulate what failures this will address (for documentation, testing and user experience evaluation).

* Having a distributed mesh of health checks where each node is checking health for its peers
* Creation of fake Machine objects for UPI clusters

## Proposal

Utilizing a DaemonSet to have a privileged critical pod running on each node. This pod will periodically check the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to run these as static pods? https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the added value?

health status of the node it’s running on, and tickle a [watchdog device](https://en.wikipedia.org/wiki/Watchdog_timer).

This will be performed by looking for the `host.metal3.io/external-remediation` annotation added by MHC to the unhealthy
Machine CR.

If the unhealthy annotation was found, the node will remediate itself by using the watchdog device to
trigger a reset of the node. Watchdog devices, especially the hardware kind, are considered more reliable and predictable than the
reboot command.

If Machine CR is not accessible in 3 consecutive attempts, it needs external assistance to detect its
health status.

Failure to read Machine CR could stem from one of the following:
1. The node is unhealthy and therefore can’t read the Machine CR (e.g. local network failure)
1. There’s an etcd failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we risk losing quorum by rebooting nodes like proposed here? cc @hexfusion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reboot only if etcd-quorum-guard PDB allows it, but the watchdog doesn't respect PDBs obviously.

Maybe at this stage it would be simpler to to install only on workers (and lose the opportunity to remediate masters).
We can revisit this once we gain more experience and confidence.

cc @beekhof

1. Connectivity issues between the node and the control plane nodes (which doesn’t stem from local problem in the node)
1. Resource starvation, preventing our DaemonSet from initiating the check
1. Authentication failure to API (ServiceAccount deleted?)
1. RBAC disallows read operation on Machine (Role deleted or changed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. RBAC disallows read operation on Machine (Role deleted or changed
1. RBAC disallows read operation on Machine (Role deleted or changed)


We only want to remediate if the problem is local (#1 and #4). Otherwise, we could create a "remediation storm" where
all nodes are trying to remediate themselves, even if the problem is not theirs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we differentiate #1 from #2, or #1 from #3, to know what remediation tasks to execute?
If we can detect #1, #3 and #4 why aren't the nodes throwing alerts that can be used for other debugging purposes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1 from #2 - using other peers. if other peers can access etcd, we know that it's not #2. if other peers can't access etcd we assume it's #2. if the failing node can't reach his peers, nor access etcd we assume it's #1.

#1 from #3 - using other peers to differentiate. if other peers can reach etcd we assume it's #3, if the suspected node can't reach etcd, nor any of its peers we assume it's #1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can detect #1, #3 and #4 why aren't the nodes throwing alerts that can be used for other debugging purposes?

if it's #1, the node probably won't be able to send alerts anywhere (e.g. if it has local network failure)
To identify #3 we had to introduce communication between the peers. We could identify this without them.
as for #4, there are some existing node conditions such as mem/disk pressure that are already reported. We use watchdog device since we must make sure that the poison pill pod is running or the machine is not running any workloads. Otherwise we're risking in running staeful sets with run-once semantics in two different nodes.


The semantics of the watchdog device automatically handle #4 and to distinguish between the remaining three situations,
the node will ask its peers whether they think it is healthy.

The possible responses from a peer are:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have impacts on east-west network traffic within the cluster. Howe much impact are we expecting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming 256 bytes per request, the expected traffic volume is up to 256*nodes_count bytes.
So given a 1000 nodes cluster we're talking on up to 256KB overall (layer 7 volume) per unhealthy node.

If control plane is really down, each node will contact its peers, which means 256MB volume for all cluster, but in this case the cluster is already broken so I don't think this 256MB will harm anything.

1. You're unhealthy (the annotation exists)
1. You're healthy (the annotation does not exist)
1. The peer also cannot contact the api-server
1. No response (timeout or any TCP/HTTP error)

To avoid saturating the network, the node will pick *min(cluster_size/10, cluster_size-1)* random nodes, and ask these
nodes if they think it’s healthy using HTTP.
Each pod will listen on a specific node-port, and will run an HTTP web server that will be able to tell other peers if
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do it on multiple networks? Assuming (for example) there are multiple NICs, it might be that the 'external' network is unavailable, but the 'internal' is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
I guess it's possible and we should probably try all available IPs.
I'll update the proposal accordingly

the machine is healthy or not.
If two or more peers responded with a different result, we take an optimistic approach - one result of
*"you’re healthy"* is enough to determine that the node is healthy.
If all randomized peers returned *"you’re unhealthy"* - the node considers itself unhealthy.
If all randomized peers returned an error (#3 or #4), another set of random nodes will be used to determine health
status.
The node considers itself healthy if more than 50% of the nodes responded with #3.
The node considers itself unhealthy if we do not receive any peer responses.
The above algorithm to determine health state of a node is demonstrated in the following flow chart:

![Machine health check](./baremetal-poison-pill-health-flow.png)

The state required to run this algorithm will be stored in-memory, including nodes addresses, retries count and so on.

Remediation will be done by using a watchdog device to trigger a reset of the node.

In the absence of a watchdog device, including softdog, a software reboot will be issued.

While reboot might not fix the health issue of the node, it will allow the cluster to assume that the workload on the failing node is no longer running, such that stateful applications scheduled on that node can be rescheduled to other healthy nodes.

Steps taken by the poison pill pod (on all nodes in the cluster) for any unhealthy node:
1. Mark the node as unschedulable (this prevents the node to run new workload after reboot)
1. Add current time to annotation
1. Unhealthy node reboot itself
1. After *timeout-to-assume-node-rebooted* (either configurable or calculated by software) - delete the unhealthy node (to signal the cluster that the workload is no longer running there, and it is safe to run it elsewhere)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to pair this with a set of re-tries? Or a buffer +/- 20%?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm can you elaborate? retries for which operation?

1. Recreate the node object, without the unschedulable taint

The first step of marking the node as unschedulable is important, as the reboot might fix the health issue, and the scheduler could assign workload to that node after the it rebooted, which could result in other nodes deleteing the rebooted node while it is running some workload. To avoid that, the others will delete it and report it as unhealthy only after marking it as unschedulable which ensures it will not get any workloads after reboot.

### User Stories [optional]

#### Story 1
As a non-BMC baremetal cluster admin, I would like to have automated self-healing nodes. This will make my cluster more robust and immune to failures

#### Story 2
As a non-BMC baremetal cluster admin, I would like to minimze downtime for my applications, even if there's a node failure.

### Implementation Details/Notes/Constraints [optional]

This enhancement relies on the existence of MHC, which is currently working only in IPI and assisted insaller.
We might want to enable MHC for UPI clusters as well, maybe by faking Machine CRs for the nodes.

### Risks and Mitigations

If the unhealthy node doesn't reboot in time, then there's a risk that a node will be deleted while the workload is still running.
To mitigate this we prefer a hardware watchdog which ensures that reboot will take place if the poison pill is not running.

There could be cases where this algorithm will declare a node as unhealthy while it’s healthy (false negative)
For example, if someone made some network configuration mistake which blocks communication between the nodes - all nodes will consider themselves unhealthy and remediation will be triggered.
In this case, the cluster is mostly in an unusable state anyway, so no significant harm is made.


## Design Details

### Upgrade / Downgrade Strategy

N/A. This is a new component.

### Version Skew Strategy

Downgrade from a version with poison pill to a version without it during remediation process can result in a node that doesn't exist in api-server.
As part of the remediation process, the poison pill agents keep a backup for the node yaml.
They are expected to use that backup after the node has been deleted, to re-create the node.
If a downgrade happens during that process, there will be no poison pill agents to re-create the node.

In addition, a node might consult with its peers for its health status. If they are downgraded to a version without poison piil,
they won't respond, and we might get false positive.

## Drawbacks

TBD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mention a few above with False-Negatives, should we explore / outline more of those here?


## Alternatives

Instead of contacting other peers, we could use a shared storage to signal the health status for a peer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd? If this happens and we loose connectivity to etcd what issue does this cause?


## Infrastructure Needed [optional]

A new repo might be needed to host the new poison pill codebase.