OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. #1887

jaypoulz · 2025-11-06T15:28:25Z

Updated warning against baremetal platform including BMC block
Updated test section to note that we'll skip requirements criteria if no requirements are provided
Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

openshift-ci · 2025-11-06T15:29:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enhancements/two-node-fencing/tnf.md

JoelSpeed · 2025-11-06T17:25:19Z

enhancements/two-node-fencing/tnf.md

+- A list of nodes currently registered in pacemaker
+- A list of recent events recorded by the pacemaker resources
+- A list of recent fencing events performed by pacemaker
+- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the


This doesn't feel like the right place to put this. This to me at least is not really a field a user would care about. Who is this API for?

When I built this, I made the API for CEO.
In my newest revision, I've tried to make it for useful for an end-user, but also coincidentally for CEO.

JoelSpeed · 2025-11-06T17:26:15Z

enhancements/two-node-fencing/tnf.md

+- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the
+  XML directly.
+
+Once the PacemakerCluster object is populated is it handled on the CEO side by a new pacemaker healthcheck controller. This controller evaluates the status of the report and creates events in CEO for the following things:


Why can the writer of PacemakerCluster not produce the events? Seems that component has all of the correct/relevant information to be able to write the events?

In my head it was a cleaner separation of concerns. But I've walked that back in favor or moving that functionality to the status collector.

JoelSpeed · 2025-11-06T17:27:34Z

enhancements/two-node-fencing/tnf.md

+- Warnings for fencing events that have happened on the cluster
+
+More importantly, it also sets the CEO's status to degraded if one of the following conditions are true:
+- Not all resources and nodes are in their expected / healthy state


Is this correct for CEO to go degraded? I thought I saw kubelet listed? Wouldn't some other component be responsible for alerting when a kubelet on a control plane node is down? Doesn't really feel like a CEO issue to report?

Most other components that relying on multiple replicas will be degraded at the same time. The obvious one is API server. In fact, CEO already reports degraded when kubelet is down because it doesn't have all of the endpoints it thinks it's supposed to have (one per control-plane node).

The reason we include reporting the kubelet behavior in the pacemaker status is because pacemaker ensures that kubelet is started before etcd. That means, that for etcd to be healthy, kubelet must be healthy. We could ignore the state of the kubelet resource when reporting the state of pacemaker, but as I mentioned before, the etcd member controller is going to be reporting degraded anyway so it's just extra information that explains why pacemaker is unhealthy.

JoelSpeed · 2025-11-06T17:27:50Z

enhancements/two-node-fencing/tnf.md

+
+More importantly, it also sets the CEO's status to degraded if one of the following conditions are true:
+- Not all resources and nodes are in their expected / healthy state
+- The pacemakercluster status object is stale (hasn't been updated in the last 5 minutes)


Needs admin intervention in a fairly prompt manner?

We don't know for sure. We can only give admins instructions if we know the state of pacemaker. If we haven't received a status, this means that CEO's status collector cronjob has stopped posting them or what's being posted is being rejected by the API.

In either case, the cluster could be in a state where the cluster could fail without recovering automatically. The goal is to raise this in a way where the cluster admin knows that something could be wrong.

openshift-ci-robot · 2025-11-06T18:48:58Z

@jaypoulz: This pull request references OCPEDGE-2215 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

Updated warning against baremetal platform including BMC block

Updated test section to note that we'll skip requirements criteria if no requirements are provided

Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

enhancements/two-node-fencing/tnf.md

JoelSpeed · 2025-12-09T11:07:37Z

enhancements/two-node-fencing/tnf.md

+To acheive this, we plan on using two new controllers in CEO. The first is a status collector which syncs every 30 seconds to gather that current state of pacemaker via `sudo pcs status xml`.
+This is parsed to create a `PacemakerCluster` status object, a singleton resource created by CEO when the transition to etcd running externally is completed.
+Additionally, it creates events for the follow:
+- Errors events when kubelet, etcd, or the fencing agent on a node enter an unhealthy state


I assume these error events happen then every 30s while the node is unhealthy?

The intent is that it works as follows - every 30 seconds we scan pacemaker for updates related to resources and fencing. If a new event is present (e.g. etcd/fencingAgent/kubelet was started or stopped, a node was fenced) check if it has already been posted, and post it if it hasn't already.

The latter part of this implementation is a little tricky. The naive way to do it is to ensure I do a {node-name}-{resource-name}-{timestamp-hash} kind of thing for the event names. Then I can just blindly try to created them every 30s and ignore the 409s.

The nicer way to do it probably to get the last n (probably 2-5) minutes worth of events, filter out the ones created by the status checker, and make sure my names don't conflict prior to creation.

Bottom line is - one "action" captured by pacemaker should equate to one "event" recorded by the api-server.

We don't plan on taking action based on events - those will be taken based on the API conditions. The events are just here to allow a cluster admin to reconstruct a timeline of what might have happened if we've degraded CEO due to pacemaker being unhealthy.

In other APIs, we see events emitted regularly over a period. An oc decsribe will say x time over x time period next to the events as it aggregates. I don't think you necessarily need to do the deduplication you describe

I assume that as an end user, I'd be able to see "this status has cleared" when there's an error because a newer event would have come through that shows things return to normal?

I think in general, the events we'll capture likely not represent error conditions. But let's say you had an etcd-node0-stop event prior to a reboot or something and the status starts reporting that the pacemakercluster is unhealthy. ClusterUnhealthy because NodeUnhealthy, NodeUnhealthy because EtcdUnhealthy. You have the event that tells you that etcd is stopped. There should be an etcd-node0-start event to match everything becoming healthy again.

That said, we can also add events for "etcd is down" that would work like emitted events. I think the conditions probably already cover that sufficiently though, yeah? Everything else is just a record of this thing happened at exactly this time.

I don't know if there is a way to detect fencing completed successfully events, as an example. We have a record of when the reboot signal was sent and succeeded, but no new event is expected when the node comes back up healthy (besides the resource start events).

The phrase it differently:
It the collection method is: list me the things that happened in the pacemaker cluster in the last 5 minutes - then you can potentially end up with some strange windows where your repeating both events that say the cluster is healthy and events that say the cluster is not.

If the collection method is: (running every 30s) - list me the things that happened in the pacemaker cluster in the last 30 seconds - you have no duplicate events, but you could miss an event if you tried to run the status cluster during a node reboot and had to end up rescheduling it on the other node. (Which can take several minutes).

The goal of this API is to try to provide pre-warnings for cluster configuration issues and an accurate reconstructed timeline for when things happened. So I think the former design is better for the latter goal. Both solve the first one just fine.

JoelSpeed · 2025-12-09T11:08:41Z

enhancements/two-node-fencing/tnf.md

+- A list of `PacemakerClusterNodeStatus` objects representing the state of the nodes registered by pacemaker
+
+The `PacemakerClusterNodeStatus` consists of:
+- The name and IP address of the node


Lets make sure the IP address API we build reflects that of the Node object, ie has the ability to provide multiple addresses and specify the type of the address

enhancements/two-node-fencing/tnf.md

- Updated warning against baremetal platform including BMC block - Updated test section to note that we'll skip requirements criteria if no requirements are provided - Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

openshift-ci · 2025-12-17T16:27:57Z

@jaypoulz: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

JoelSpeed · 2025-12-22T14:11:11Z

/lgtm

openshift-ci bot requested review from adambkaplan and celebdor November 6, 2025 15:29

jaypoulz force-pushed the tnf branch from efc8808 to e9338b7 Compare November 6, 2025 15:36

fonta-rh reviewed Nov 6, 2025

View reviewed changes

jaypoulz force-pushed the tnf branch from e9338b7 to 058cd19 Compare November 6, 2025 16:06

JoelSpeed reviewed Nov 6, 2025

View reviewed changes

jaypoulz force-pushed the tnf branch from 058cd19 to 3320626 Compare November 6, 2025 18:47

jaypoulz changed the title ~~Updated TNF EP to address some drift from original requirements.~~ OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. Nov 6, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 6, 2025

clobrano reviewed Nov 7, 2025

View reviewed changes

enhancements/two-node-fencing/tnf.md Outdated Show resolved Hide resolved

clobrano reviewed Nov 7, 2025

View reviewed changes

enhancements/two-node-fencing/tnf.md Outdated Show resolved Hide resolved

jaypoulz force-pushed the tnf branch 3 times, most recently from de575ed to c3c1fa5 Compare December 5, 2025 20:46

JoelSpeed reviewed Dec 9, 2025

View reviewed changes

jaypoulz force-pushed the tnf branch 2 times, most recently from 22615e7 to 356143e Compare December 17, 2025 00:59

jaypoulz force-pushed the tnf branch from 356143e to 532afc1 Compare December 17, 2025 16:04

jaypoulz force-pushed the tnf branch from 532afc1 to 39a98b5 Compare December 17, 2025 16:09

openshift-ci bot assigned JoelSpeed Dec 22, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 22, 2025

OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. #1887

Are you sure you want to change the base?

OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. #1887

Uh oh!

Conversation

jaypoulz commented Nov 6, 2025

Uh oh!

openshift-ci bot commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Nov 6, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci bot commented Dec 17, 2025

Uh oh!

JoelSpeed commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

openshift-ci-robot commented Nov 6, 2025 •

edited by openshift-ci bot

Loading