Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

Closed
wants to merge 1 commit into from

Conversation

n1r1
Copy link

@n1r1 n1r1 commented Nov 24, 2020

Proposing an opt-in remediation feature, which doesn't rely on BMC credentials.
Each node will take care of rebooting itself in case of a failure.

implementation PoC

Suggesting an opt-in remediation feature, which doesn't rely on BMC credentials.
Each node will take care of rebooting itself in case of a failure.

Signed-off-by: Nir <niry@redhat.com>
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: n1r1
To complete the pull request process, please assign jsafrane after the PR has been reviewed.
You can assign the PR to them by writing /assign @jsafrane in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member

One thing I don't quite understand is the tradeoffs in having this be peer-to-peer versus keeping the source of truth in etcd. It seems like etcd needs to be canonical, so wouldn't we at least require e.g. the control plane nodes are involved in the final decision to power off a machine or so?

This work can also be useful for clusters **with** BMC credentials.
If there’s a network outage between the node running CAPBM (which expects to use BMC commands like power-off etc.) and the
unhealthy node, the power-cycle attempt cannot
succeed. Self health checks and remediation can resolve such cases, however that work/scenario is out of scope for

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not scoping this into the work, why are network outages where BMC credentials discussed?
We are not accounting for situations where BCM credentials are available but couldn't be used (thats at lease how I read this).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using poison pill in BMC clusters is possible as escalation path (e.g. first try to reboot with BMC, if it didn't work poison-pill kicks in).
To simplify the proposal we wanted to avoid outlining how this escalation path can work and how this coordination should be done.
We mention the benefit for BMC clusters just as a something we might want to do in the future, or ways to expand that feature.

* To allow the healthy majority to recover workloads after a pre-determined and finite interval
* To allow a node to take actions to remediate itself in case of a failure
* To allow a node to detect its health status when master<->node communication is partially failed
* To avoid false positives caused by failure of the control plane

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list of what these might be?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this boils down to inaccessible etcd (no matter what caused this).
So if the control plane is not responding for some reason, we don't want to automatically assume that the node is unhealthy. That's why we contact the peers.
and if we see that this is a wide failure, i.e. most peers can't access etcd, we assume it's a control plane failure and we don't reboot.

### Non-Goals

* Self healing in combination with existing reprovisioning or reboot based remediation mechanisms in the Machine Health Check controller
* Recovering from all types of failure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to better articulate what failures this will address (for documentation, testing and user experience evaluation).


## Proposal

Utilizing a DaemonSet to have a privileged critical pod running on each node. This pod will periodically check the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to run these as static pods? https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the added value?

1. Connectivity issues between the node and the control plane nodes (which doesn’t stem from local problem in the node)
1. Resource starvation, preventing our DaemonSet from initiating the check
1. Authentication failure to API (ServiceAccount deleted?)
1. RBAC disallows read operation on Machine (Role deleted or changed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. RBAC disallows read operation on Machine (Role deleted or changed
1. RBAC disallows read operation on Machine (Role deleted or changed)

1. RBAC disallows read operation on Machine (Role deleted or changed

We only want to remediate if the problem is local (#1 and #4). Otherwise, we could create a "remediation storm" where
all nodes are trying to remediate themselves, even if the problem is not theirs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we differentiate #1 from #2, or #1 from #3, to know what remediation tasks to execute?
If we can detect #1, #3 and #4 why aren't the nodes throwing alerts that can be used for other debugging purposes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1 from #2 - using other peers. if other peers can access etcd, we know that it's not #2. if other peers can't access etcd we assume it's #2. if the failing node can't reach his peers, nor access etcd we assume it's #1.

#1 from #3 - using other peers to differentiate. if other peers can reach etcd we assume it's #3, if the suspected node can't reach etcd, nor any of its peers we assume it's #1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can detect #1, #3 and #4 why aren't the nodes throwing alerts that can be used for other debugging purposes?

if it's #1, the node probably won't be able to send alerts anywhere (e.g. if it has local network failure)
To identify #3 we had to introduce communication between the peers. We could identify this without them.
as for #4, there are some existing node conditions such as mem/disk pressure that are already reported. We use watchdog device since we must make sure that the poison pill pod is running or the machine is not running any workloads. Otherwise we're risking in running staeful sets with run-once semantics in two different nodes.

The semantics of the watchdog device automatically handle #4 and to distinguish between the remaining three situations,
the node will ask its peers whether they think it is healthy.

The possible responses from a peer are:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have impacts on east-west network traffic within the cluster. Howe much impact are we expecting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming 256 bytes per request, the expected traffic volume is up to 256*nodes_count bytes.
So given a 1000 nodes cluster we're talking on up to 256KB overall (layer 7 volume) per unhealthy node.

If control plane is really down, each node will contact its peers, which means 256MB volume for all cluster, but in this case the cluster is already broken so I don't think this 256MB will harm anything.

1. Mark the node as unschedulable (this prevents the node to run new workload after reboot)
1. Add current time to annotation
1. Unhealthy node reboot itself
1. After *timeout-to-assume-node-rebooted* (either configurable or calculated by software) - delete the unhealthy node (to signal the cluster that the workload is no longer running there, and it is safe to run it elsewhere)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to pair this with a set of re-tries? Or a buffer +/- 20%?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm can you elaborate? retries for which operation?


## Drawbacks

TBD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mention a few above with False-Negatives, should we explore / outline more of those here?


## Alternatives

Instead of contacting other peers, we could use a shared storage to signal the health status for a peer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd? If this happens and we loose connectivity to etcd what issue does this cause?

@n1r1
Copy link
Author

n1r1 commented Dec 20, 2020

One thing I don't quite understand is the tradeoffs in having this be peer-to-peer versus keeping the source of truth in etcd. It seems like etcd needs to be canonical, so wouldn't we at least require e.g. the control plane nodes are involved in the final decision to power off a machine or so?

etcd is the source of truth. it holds in the information about the health status of the node.
the peer-to-peer protocol kicks in only if a node can not access etcd for some reason.
The peer will check the health status in etcd, and reports it back to the node who couldn't access the etcd itself. In some ways, you can think of it as a proxy to etcd.


To avoid saturating the network, the node will pick *min(cluster_size/10, cluster_size-1)* random nodes, and ask these
nodes if they think it’s healthy using HTTP.
Each pod will listen on a specific node-port, and will run an HTTP web server that will be able to tell other peers if
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do it on multiple networks? Assuming (for example) there are multiple NICs, it might be that the 'external' network is unavailable, but the 'internal' is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
I guess it's possible and we should probably try all available IPs.
I'll update the proposal accordingly


Failure to read Machine CR could stem from one of the following:
1. The node is unhealthy and therefore can’t read the Machine CR (e.g. local network failure)
1. There’s an etcd failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we risk losing quorum by rebooting nodes like proposed here? cc @hexfusion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reboot only if etcd-quorum-guard PDB allows it, but the watchdog doesn't respect PDBs obviously.

Maybe at this stage it would be simpler to to install only on workers (and lose the opportunity to remediate masters).
We can revisit this once we gain more experience and confidence.

cc @beekhof

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021
@n1r1
Copy link
Author

n1r1 commented Apr 11, 2021

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2021
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2021
@n1r1
Copy link
Author

n1r1 commented Jul 11, 2021

We released poison pill operator as a day-2 operator which can be installed using OLM. see upstream release:
https://operatorhub.io/operator/poison-pill-operator

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2021

@n1r1: Closed this PR.

In response to this:

We released poison pill operator as a day-2 operator which can be installed using OLM. see upstream release:
https://operatorhub.io/operator/poison-pill-operator

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Jul 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants