layout	title	nav_order
default	Failure Detection	3

Detecting Node Failures

{: .no_toc }

Node Healthcheck Controller

Generally Available {: .label .label-green }

The Node Healthcheck Controller checks each Node's set of NodeConditions against the criteria and thresholds defined for it in NodeHealthCheck CRs. If the Node is deemed to be in a failed state, and remediation is appropriate, the controller will instantiate a RemediationRequest template (defined as part of the CR) that specifies the mechanism/controller to be used for recovery.

Should the Node recover on its own, the NH controller removes the instantiated RemediationRequest. In all other respects, the RemediationRequest is owned by the target remediation mechanism and will persist until that controller is satisfied remediation is complete. For some mechanisms that may mean the Node has entered a safe state (e.g. the underlying "hardware" has been deprovisioned), and for others it may be the Node coming back online (e.g. after a reboot).

Remediation is not always the correct response to a failure. Especially in larger clusters, we want to protect against failures that appear to take out large portions of compute capacity but are really the result of failures on or near the control plane.

For this reason, the healthcheck CR includes the ability to define a percentage or total number of nodes that can be considered candidates for concurrent remediation.

Background

See the FAQ.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failure_detection.md

failure_detection.md

Detecting Node Failures

Table of contents

Node Healthcheck Controller

Background

Files

failure_detection.md

Latest commit

History

failure_detection.md

File metadata and controls

Detecting Node Failures

Table of contents

Node Healthcheck Controller

Background