Sample health checks for Nomad node problem detector (NNPD)
NOTE These are not real health checks and only serve as a reference to how your actual
health checks should be defined.
Nomad-node-problem-detector (NNPD) is a system which scans through the problems
on nomad client nodes, and take the bad nodes out of the scheduling pool so that
nomad doesn't schedule any new jobs on these bad nodes.
If the problem is transient and fixes itself in sometime, NNPD will put the node back
in the scheduling pool, in the next scanning cycle.
NNPD is composed of two main components
- Detector
- Aggregator
- Detector runs on every nomad client node and scans through some pre-defined health checks
- This repo (nomad-health-checks) is just a sample repo on how these health checks should be defined.
- This repo is mostly used by Nomad-node-problem-detector (NNPD) repo for it's integration tests.
- These are not real health checks and only serve as a reference to how your actual health checks should be defined.
Aggregator is the central component (mastership) to which every detector (node) reports it's problems to.
Based on those results, aggregator will either be taking the node out of the scheduling pool (bad node)
or put the node back to the scheduling pool (good node) or do nothing in case of no state change.