Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 50 additions & 56 deletions enhancements/two-node-fencing/tnf.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,8 +155,8 @@ At a glance, here are the components we are proposing to change:
| Component | Change |
| ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| [Feature Gates](#feature-gate-changes) | Add a new `DualReplicaTopology` feature which can be enabled via the `CustomNoUpgrade` feature set |
| [OpenShift API](#openshift-api-changes) | Add `DualReplica` as a new value for `ControlPlaneTopology` |
| [ETCD Operator](#etcd-operator-changes) | Add a mode to stop managing the etcd container, a new scaling strategy, and new TNF controller for initializing pacemaker |
| [OpenShift API](#openshift-api-changes) | Add `DualReplica` as a new value for `ControlPlaneTopology`, `PacemakerCluster` CRD for CEO health checking |
| [ETCD Operator](#etcd-operator-changes) | Add external etcd mode, new scaling strategy, new TNF controller for initializing pacemaker, and pacemaker health checker |
| [Install Config](#install-config-changes) | Update install config API to accept fencing credentials in the control plane for `platform: None` and `platform: Baremetal` |
| [Installer](#installer-changes) | Populate the nodes with initial pacemaker configuration when deploying with 2 control-plane nodes and no arbiter |
| [MCO](#mco-changes) | Add an MCO extension for installing pacemaker and corosync in RHCOS; MachineConfigPool maxUnavailable set to 1 |
Expand Down Expand Up @@ -317,6 +317,9 @@ In the future, it may be possible to lower the privilege level of the TNF contro
to run without root privileges. We are working with the RHEL-HA team to identify the specific set of commands that we use to narrow the scope of progress towards this goal. This remains a long-term
objective for both teams.

##### The PacemakerCluster Health Check
See [Status Propagation with PacemakerCluster Health Check](#status-propagation-with-pacemakercluster-health-check)

#### Install Config Changes

In order to initialize pacemaker with valid fencing credentials, they will be consumed by the installer via the installation config and created on the cluster as a cluster secret.
Expand Down Expand Up @@ -382,53 +385,8 @@ sshKey: ''
```

Unfortunately, Bare Metal Operator already has an API that accepts BMC credentials as part of configuring BareMetalHost CRDs. Adding BMC credentials to the BareMetalHost CRD allows the Baremetal
Operator to manage the power status of that host via ironic. This is **strictly incompatible** with TNF because both the Bare Metal Operator and the pacemaker fencing agent will have control over the
machine state.

This example shows an **invalid** install configuration that the installer will reject for TNF.
```
apiVersion: v1
baseDomain: example.com
compute:
- name: worker
replicas: 0
controlPlane:
name: master
replicas: 2
fencing:
credentials:
- hostname: <control-0-hostname>
address: https://<redfish-api-url>
username: <username>
password: <password>
- hostname: <control-1-hostname>
address: https://<redfish-api-url>
username: <username>
password: <password>
metadata:
name: <cluster-name>
platform:
baremetal:
apiVIPs:
- <api_ip>
ingressVIPs:
- <wildcard_ip>
hosts:
- name: openshift-cp-0
role: master
bmc:
address: ipmi://<out_of_band_ip>
username: <username>
password: <password>
- name: openshift-cp-1
role: master
bmc:
address: ipmi://<out_of_band_ip>
username: <username>
password: <password>
pullSecret: ''
sshKey: ''
```
Operator to manage the power status of that host via ironic. To work around this, we detach the control-plane nodes from ironic once they are provisioned by adding the detached annotation
(`baremetalhost.metal3.io/detached: ""`).

##### Why don't we reuse the existing APIs in the `Baremetal` platform?
Reusing the existing APIs tightly couples separate outcomes that are important to distinguish for the end user.
Expand Down Expand Up @@ -708,6 +666,42 @@ This collection of diagrams collects a series of scenarios where both nodes fail

![Diagrams of Multi-Node Failure Scenarios](etcd-flowchart-both-nodes-reboot-scenarios.svg)

#### Status Propagation with PacemakerCluster Health Check
An important goal of Two Node OpenShift with Fencing is ensuring an early warning when the cluster enters a state where automatic recovery from quorum loss is not possible. To provide this
warning, we need pacemaker health information to be available in the cluster. An example of this would be if the cluster administrator rotated their BMC password without updating the fencing secret
in the cluster. This would be caught by the pacemaker monitoring checks, but something in the cluster needs to propagate that information to the user directly.

To acheive this, we plan on using two new controllers in CEO. The first is a status collector which syncs every 30 seconds to gather that current state of pacemaker via `sudo pcs status xml`.
This is parsed to create a `PacemakerCluster` status object, a singleton resource created by CEO when the transition to etcd running externally is completed.
Additionally, it creates events for the follow:
- Errors events when kubelet, etcd, or the fencing agent on a node enter an unhealthy state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these error events happen then every 30s while the node is unhealthy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is that it works as follows - every 30 seconds we scan pacemaker for updates related to resources and fencing. If a new event is present (e.g. etcd/fencingAgent/kubelet was started or stopped, a node was fenced) check if it has already been posted, and post it if it hasn't already.

The latter part of this implementation is a little tricky. The naive way to do it is to ensure I do a {node-name}-{resource-name}-{timestamp-hash} kind of thing for the event names. Then I can just blindly try to created them every 30s and ignore the 409s.

The nicer way to do it probably to get the last n (probably 2-5) minutes worth of events, filter out the ones created by the status checker, and make sure my names don't conflict prior to creation.

Bottom line is - one "action" captured by pacemaker should equate to one "event" recorded by the api-server.

We don't plan on taking action based on events - those will be taken based on the API conditions. The events are just here to allow a cluster admin to reconstruct a timeline of what might have happened if we've degraded CEO due to pacemaker being unhealthy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other APIs, we see events emitted regularly over a period. An oc decsribe will say x time over x time period next to the events as it aggregates. I don't think you necessarily need to do the deduplication you describe

I assume that as an end user, I'd be able to see "this status has cleared" when there's an error because a newer event would have come through that shows things return to normal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general, the events we'll capture likely not represent error conditions. But let's say you had an etcd-node0-stop event prior to a reboot or something and the status starts reporting that the pacemakercluster is unhealthy. ClusterUnhealthy because NodeUnhealthy, NodeUnhealthy because EtcdUnhealthy. You have the event that tells you that etcd is stopped. There should be an etcd-node0-start event to match everything becoming healthy again.

That said, we can also add events for "etcd is down" that would work like emitted events. I think the conditions probably already cover that sufficiently though, yeah? Everything else is just a record of this thing happened at exactly this time.

I don't know if there is a way to detect fencing completed successfully events, as an example. We have a record of when the reboot signal was sent and succeeded, but no new event is expected when the node comes back up healthy (besides the resource start events).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase it differently:
It the collection method is: list me the things that happened in the pacemaker cluster in the last 5 minutes - then you can potentially end up with some strange windows where your repeating both events that say the cluster is healthy and events that say the cluster is not.

If the collection method is: (running every 30s) - list me the things that happened in the pacemaker cluster in the last 30 seconds - you have no duplicate events, but you could miss an event if you tried to run the status cluster during a node reboot and had to end up rescheduling it on the other node. (Which can take several minutes).

The goal of this API is to try to provide pre-warnings for cluster configuration issues and an accurate reconstructed timeline for when things happened. So I think the former design is better for the latter goal. Both solve the first one just fine.

- Warnings events when kubelet, etcd, or the fencing agent on a node are started or stopped
- Warnings events fencing actions are taken by the cluster

The `PacemakerCluster` resource provides the cluster with key information that CEO can use to determine the overall health and threats to etcd. It consists of:
- A `lastUpdated` timestamp tracking when the status was last updated
- Cluster-level conditions tracking overall health, in-service (not in maintenance mode), and node count as expected
- A list of `PacemakerClusterNodeStatus` objects representing the state of the nodes registered by pacemaker

The `PacemakerClusterNodeStatus` consists of:
- The name and IP addresses of the node (pacemaker allows multiple IP addresses for Corosync communication; the first address is used for etcd peer URLs)
- Node-level conditions tracking online status, in-service (not in maintenance mode), standby, pending, clean state, and membership
- Resource status for kubelet, etcd, and fencing agent (each with conditions tracking health, in-service, managed, enabled, operational, active, started, and schedulable states)

For full API details, see the [openshift/api pull request](https://github.com/openshift/api/pull/2544).

The `PacemakerCluster` object used by a new pacemaker healthcheck controller to inform the status of CEO. The healthcheck controller is responsible for applying conditions to
the cluster-etcd-operator to reflect when pacemaker is unhealthy and at risk of not being able to automatically recover from quorum loss events. Specifically it sets the CEO's
status to degraded if one of the following conditions are true:
- One or more of the nodes has an unhealthy kubelet, etcd, or fencing agent
- The `PacemakerCluster` status object is stale (hasn't been updated in the last 5 minutes)

Both of these conditions indicate that the cluster adminstrator should take action to restore these health checks or services to ensure the continued healthy operation of their cluster. The risk of
ignoring this is that automic quorum recovery might not be active in the cluster.

The only time the contents of `PacemakerCluster` are used outside of operator status reporting is during a node replacement event. In this situation, we need to match the node being removed to a node
registered by pacemaker. This ensures that CEO can enforce replacing the correct (failed) node in pacemaker as well as the cluster.

#### Running Two Node OpenShift with Fencing with a Failed Node

An interesting aspect of TNF is that should a node fail and remain in a failed state, the cluster recovery operation will allow the survivor to restart etcd as a cluster-of-one and resume normal
Expand All @@ -716,9 +710,8 @@ aspects:

1. Operators that deploy to multiple nodes will become degraded.
2. Operations that would violate pod-disruption budgets will not work.
3. Lifecycle operations that would violate the `MaxUnavailable` setting of the control-plane
[MachineConfigPool](https://docs.openshift.com/container-platform/4.17/updating/understanding_updates/understanding-openshift-update-duration.html#factors-affecting-update-duration_openshift-update-duration)
cannot proceed. This includes MCO node reboots and cluster upgrades.
3. Lifecycle operations that would violate the `MaxUnavailable` setting of the control-plane [MachineConfigPool](https://docs.openshift.com/container-platform/4.17/updating/understanding_updates/
understanding-openshift-update-duration.html#factors-affecting-update-duration_openshift-update-duration) cannot proceed. This includes MCO node reboots and cluster upgrades.

In short - it is not recommended that users allow their clusters to remain in this semi-operational state longterm. It is intended help ensure that api-server and workloads are available as much as
possible, but it is not sufficient for the operation of a healthy cluster longterm.
Expand Down Expand Up @@ -840,12 +833,13 @@ Disadvantages:

Pacemaker will be running as a system daemon and reporting errors about its various agents to the system journal. The question is, what is the best way to expose these to a cluster admin? A simple
example would be an issue where pacemaker discovers that its fencing agent can no longer talk to the BMC. What is the best way to raise this error to the cluster admin, such that they can see that
their cluster may be at risk of failure if no action is taken to resolve the problem? In our current design, we'd likely need to explore what kinds of errors we can bubble up through existing
cluster health APIs to see if something suitable can be reused.
their cluster may be at risk of failure if no action is taken to resolve the problem?

For situations where we recognize a risk to etcd health if no action is taken, we plan on monitoring the pacemaker status via the TNF controller and setting CEO to degraded with a message to
explain the action(s) needed. This has the added benefit of ensuring that the installer fails during deployment if we cannot properly set up etcd under pacemaker.

See [Status Propagation with PacemakerCluster Health Check](#status-propagation-with-pacemakercluster-health-check) for more details.

## Test Plan

**Note:** *Section not required until targeted at a release.*
Expand All @@ -869,7 +863,7 @@ The initial release of TNF should aim to build a regression baseline.
| Test | Kubelet failure [^2] | A new TNF test to detect if the cluster recovers if kubelet fails. |
| Test | Failure in etcd [^2] | A new TNF test to detect if the cluster recovers if etcd fails. |
| Test | Valid PDBs | A new TNF test to verify that PDBs are set to the correct configuration |
| Test | Conformant recovery | A new TNF test to verify recovery times for failure events are within the creteria defined in the requirements |
| Test | Conformant recovery | A new TNF test to verify recovery times meet or beat requirements if requirements are set. |
| Test | Fencing health check | A new TNF test to verify that the [Fencing Health Check](#fencing-health-check) process is successful |
| Test | Replacing a control-plane node | A new TNF test to verify that you can replace a control-plane node in a 2-node cluster |
| Test | Certificate rotation with an unhealthy node | A new TNF test to verify certificate rotation on a cluster with an unhealthy node that rejoins after the rotation |
Expand Down Expand Up @@ -983,4 +977,4 @@ upgrade testing. This will be something to keep a close eye on when evaluating u

## Infrastructure Needed [optional]

Bare-metal systems will be needed from Beaker to test and gather performance metrics.
Bare-metal systems will be needed from Beaker to test and gather performance metrics.