Skip to content

Conversation

@jaypoulz
Copy link
Contributor

@jaypoulz jaypoulz commented Nov 6, 2025

  • Updated warning against baremetal platform including BMC block
  • Updated test section to note that we'll skip requirements criteria if no requirements are provided
  • Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 6, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- A list of nodes currently registered in pacemaker
- A list of recent events recorded by the pacemaker resources
- A list of recent fencing events performed by pacemaker
- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel like the right place to put this. This to me at least is not really a field a user would care about. Who is this API for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I built this, I made the API for CEO.
In my newest revision, I've tried to make it for useful for an end-user, but also coincidentally for CEO.

- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the
XML directly.

Once the PacemakerCluster object is populated is it handled on the CEO side by a new pacemaker healthcheck controller. This controller evaluates the status of the report and creates events in CEO for the following things:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can the writer of PacemakerCluster not produce the events? Seems that component has all of the correct/relevant information to be able to write the events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head it was a cleaner separation of concerns. But I've walked that back in favor or moving that functionality to the status collector.

- Warnings for fencing events that have happened on the cluster

More importantly, it also sets the CEO's status to degraded if one of the following conditions are true:
- Not all resources and nodes are in their expected / healthy state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct for CEO to go degraded? I thought I saw kubelet listed? Wouldn't some other component be responsible for alerting when a kubelet on a control plane node is down? Doesn't really feel like a CEO issue to report?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other components that relying on multiple replicas will be degraded at the same time. The obvious one is API server. In fact, CEO already reports degraded when kubelet is down because it doesn't have all of the endpoints it thinks it's supposed to have (one per control-plane node).

The reason we include reporting the kubelet behavior in the pacemaker status is because pacemaker ensures that kubelet is started before etcd. That means, that for etcd to be healthy, kubelet must be healthy. We could ignore the state of the kubelet resource when reporting the state of pacemaker, but as I mentioned before, the etcd member controller is going to be reporting degraded anyway so it's just extra information that explains why pacemaker is unhealthy.


More importantly, it also sets the CEO's status to degraded if one of the following conditions are true:
- Not all resources and nodes are in their expected / healthy state
- The pacemakercluster status object is stale (hasn't been updated in the last 5 minutes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs admin intervention in a fairly prompt manner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know for sure. We can only give admins instructions if we know the state of pacemaker. If we haven't received a status, this means that CEO's status collector cronjob has stopped posting them or what's being posted is being rejected by the API.

In either case, the cluster could be in a state where the cluster could fail without recovering automatically. The goal is to raise this in a way where the cluster admin knows that something could be wrong.

@jaypoulz jaypoulz changed the title Updated TNF EP to address some drift from original requirements. OCPEDGE-2215: Updated TNF EP to address some drift from original requirements. Nov 6, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 6, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 6, 2025

@jaypoulz: This pull request references OCPEDGE-2215 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

Details

In response to this:

  • Updated warning against baremetal platform including BMC block
  • Updated test section to note that we'll skip requirements criteria if no requirements are provided
  • Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz jaypoulz force-pushed the tnf branch 3 times, most recently from de575ed to c3c1fa5 Compare December 5, 2025 20:46
To acheive this, we plan on using two new controllers in CEO. The first is a status collector which syncs every 30 seconds to gather that current state of pacemaker via `sudo pcs status xml`.
This is parsed to create a `PacemakerCluster` status object, a singleton resource created by CEO when the transition to etcd running externally is completed.
Additionally, it creates events for the follow:
- Errors events when kubelet, etcd, or the fencing agent on a node enter an unhealthy state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these error events happen then every 30s while the node is unhealthy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is that it works as follows - every 30 seconds we scan pacemaker for updates related to resources and fencing. If a new event is present (e.g. etcd/fencingAgent/kubelet was started or stopped, a node was fenced) check if it has already been posted, and post it if it hasn't already.

The latter part of this implementation is a little tricky. The naive way to do it is to ensure I do a {node-name}-{resource-name}-{timestamp-hash} kind of thing for the event names. Then I can just blindly try to created them every 30s and ignore the 409s.

The nicer way to do it probably to get the last n (probably 2-5) minutes worth of events, filter out the ones created by the status checker, and make sure my names don't conflict prior to creation.

Bottom line is - one "action" captured by pacemaker should equate to one "event" recorded by the api-server.

We don't plan on taking action based on events - those will be taken based on the API conditions. The events are just here to allow a cluster admin to reconstruct a timeline of what might have happened if we've degraded CEO due to pacemaker being unhealthy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other APIs, we see events emitted regularly over a period. An oc decsribe will say x time over x time period next to the events as it aggregates. I don't think you necessarily need to do the deduplication you describe

I assume that as an end user, I'd be able to see "this status has cleared" when there's an error because a newer event would have come through that shows things return to normal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general, the events we'll capture likely not represent error conditions. But let's say you had an etcd-node0-stop event prior to a reboot or something and the status starts reporting that the pacemakercluster is unhealthy. ClusterUnhealthy because NodeUnhealthy, NodeUnhealthy because EtcdUnhealthy. You have the event that tells you that etcd is stopped. There should be an etcd-node0-start event to match everything becoming healthy again.

That said, we can also add events for "etcd is down" that would work like emitted events. I think the conditions probably already cover that sufficiently though, yeah? Everything else is just a record of this thing happened at exactly this time.

I don't know if there is a way to detect fencing completed successfully events, as an example. We have a record of when the reboot signal was sent and succeeded, but no new event is expected when the node comes back up healthy (besides the resource start events).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase it differently:
It the collection method is: list me the things that happened in the pacemaker cluster in the last 5 minutes - then you can potentially end up with some strange windows where your repeating both events that say the cluster is healthy and events that say the cluster is not.

If the collection method is: (running every 30s) - list me the things that happened in the pacemaker cluster in the last 30 seconds - you have no duplicate events, but you could miss an event if you tried to run the status cluster during a node reboot and had to end up rescheduling it on the other node. (Which can take several minutes).

The goal of this API is to try to provide pre-warnings for cluster configuration issues and an accurate reconstructed timeline for when things happened. So I think the former design is better for the latter goal. Both solve the first one just fine.

- A list of `PacemakerClusterNodeStatus` objects representing the state of the nodes registered by pacemaker

The `PacemakerClusterNodeStatus` consists of:
- The name and IP address of the node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make sure the IP address API we build reflects that of the Node object, ie has the ability to provide multiple addresses and specify the type of the address

- Updated warning against baremetal platform including BMC block
- Updated test section to note that we'll skip requirements criteria if no requirements are provided
- Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2025

@jaypoulz: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@JoelSpeed
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants