Skip to content

Roblox/nomad-node-problem-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nomad-node-problem-detector (NNPD)

CI Actions Status License Release

"A distributed system is a collection of autonomous compute nodes (sometimes unreliable) that appears to it's users as a single coherent reliable system"

The goal of Nomad-node-problem-detector (a.k.a NNPD) is to abstract these nodes problems from the user, so that user experience is more reliable, when using the Nomad orchestration system.

Motivation

When a user submits a job ( job --> task_groups(N) --> tasks(N) ) each task in the job needs a task driver e.g. docker, java, QEMU, containerd etc to execute this task. In the current architecture, if a task driver e.g. docker is Unhealthy on a Nomad client node and one of the tasks in the job requires docker to execute, Nomad scheduler will not schedule this job on that particular Nomad client.

Question: What is the definition of a task driver being unhealthy?
Answer: A task driver executes a Fingerprint operation every X seconds (configurable within the task driver) and reports it's HealthState to the Nomad client. Nomad client reports this HealthState to the scheduler. Scheduler can then schedule jobs based on the health states of all the task drivers running on each nomad client nodes.

An example fingerprinting operation for docker task driver.

However,

  • If I need to add a custom health check in docker task driver, I would have to modify the fingerprinting operation here, add this new check and open a PR to hashicorp/nomad repo. This custom health check could also be specific to my environment, so adding it to the upstream hashicorp/nomad repo might not be possible. NNPD decouples this from the hashicorp/nomad codebase, and provides a framework for adding custom health checks easily.
  • Nomad clients could be running under CPU, memory or disk pressure at various times. NNPD constantly monitor the nodes for these situations and take the nodes out of the scheduling pool when they are under CPU, memory or disk pressure. It also put the nodes back into the scheduling pool when the pressure is relieved.
  • The scheduler is only concerned with the task driver health state, when scheduling jobs. However there could be additional problems happening on the node. e.g. ntp service down, kernel issues, corrupted file systems. These can be integrated with NNPD, so nodes can be taken out of the scheduling pool if the node is unhealthy.

In a nutshell, NNPD provides a blackbox (a framework) where we can dump all our node problems, and if a node is running into one of these problems, NNPD will take the node out of the scheduling pool, so no new jobs gets scheduled on this faulty node, until the problem is fixed. In case of a transient issue, if the node recovers, NNPD will also move the node back to the scheduling pool, so new jobs can be scheduled on this node.

NOTE: NNPD as the name suggests Nomad-node-problem-detector is only concerned with the problems happening on the node. Problems external to the node e.g. docker registry down should not be added to NNPD, otherwise it might take all the nodes out of the scheduling pool.

Architecture

NNPD is composed of two main components:

  • Detector: is responsible for scanning through the node health checks, and exposing the node health at /v1/nodehealth HTTP endpoint. It also exposes a /v1/health endpoint, which tells if the detector itself is healthy or unhealthy.

    Detector relies on an external health check repo, which is used for defining the node health checks.
    A sample health check repo is provided for reference: https://github.com/shishir-a412ed/nomad-health-checks

    NOTE: The sample health check repo do not contain real health checks, but only provides a reference for defining your own health checks.

  • Aggregator: is responsible for getting the node health (/v1/nodehealth) for each node running detector. Based on the node health results, aggregator will mark the node as eligible or ineligible for scheduling.

NNPD is packaged as a single go binary, which can be run either in detector or aggregator mode.

Building Nomad-node-problem-detector (NNPD)

$ git clone git@github.com:Roblox/nomad-node-problem-detector.git
$ cd nomad-node-problem-detector
$ make build (This will build your npd binary)
$ make install (This will install npd binary in your /usr/local/bin)

NOTE: The binary name is npd eventhough the application is called nnpd.

Setup health checks repo

As mentioned in the Architecture section, detector relies on an external health check repo for determining the node health (/v1/nodehealth). A separate github repository can be defined for your health checks. This sample repo can be used as a reference.

At the root of the health check repo, a master config (config.json) will be defined. It has two main fields:

  • type: Directory name where the actual health check (health_check) is located.
  • health_check: Name of the health check script file.

e.g. In the sample config.json, type docker and health_check docker_health_check.sh defines that docker_health_check.sh will be located under docker directory in the nomad health checks repo.

Deploy

Prerequisite:

  • npd should be installed on all Nomad client nodes.
  • detector should be deployed before aggregator.

Deploy detector

detector can be deployed either using artifactory based job or a docker prestart hook based job

NOTE: You only need to deploy detector using one of the modes, not both.

In either deployment mode (artifactory or docker prestart hook), detector first unpacks the health check repo onto the Nomad client filesystem under Nomad allocation directory, so that the detector can scan (and execute) these health checks and expose the node health (/v1/nodehealth) for the aggregator, followed by starting the detector daemon.

Deploy detector (artifactory mode)

$ nomad job plan detector-artifact.nomad
$ nomad job run detector-artifact.nomad
$ nomad job status detector

Deploy detector (docker prestart hook mode)

How to deploy detector using docker prestart hook

Deploy aggregator

Official aggregator docker image: shm32/npd-aggregator:1.1.0
You can find the aggregator nomad job spec here

$ nomad job plan aggregator.nomad
$ nomad job run aggregator.nomad
$ nomad job status aggregator

Rolling upgrades

So, you were able to deploy detector and aggregator successfully. We have NNPD system up and running.

Detector upgrade (artifactory mode)

Question: How do I add a new health check, and do a rolling upgrade on detector?

Detector upgrade (docker prestart hook mode)

  • git clone <your_health_check_repo>
  • Add your new health check in the locally cloned copy.
  • Don't forget to update your master config (config.json).
    hint: Use npd config generate --root-dir <dir> to update your master config.
  • Follow these instructions to upgrade your detector using docker prestart hook mode.

Authentication

You can enable a token based authentication for detector HTTP endpoints (/v1/health/ and /v1/nodehealth/) by starting the detector with --auth flag.

DETECTOR_HTTP_TOKEN=<your_token> environment variable must be set when deploying aggregator and detector jobs.
aggregator will use DETECTOR_HTTP_TOKEN to set the token in the authorization header when making the HTTP requests.
detector will use DETECTOR_HTTP_TOKEN for validating against the incoming token in the authorization header.

$ DETECTOR_HTTP_TOKEN=<your_token> npd detector --auth

The token is base64 encoded, so if you are trying things out using curl, you need to encode the token first before passing it in the authorization header.

$ echo -n <your_token> | base64
$ Note down your base64 encoded token.
$ curl -H "Authorization: Basic <base64_encoded_token>" http://localhost:8083/v1/nodehealth/

NOTE: In order to keep NNPD performant and lightweight, TLS is not support at this point.

Commands and Flags

Aggregator - Run npd in aggregator mode

npd aggregator --help for more info.

Option Type Required Default Description
aggregation-cycle-time string no 15s Time (in seconds) to wait between each aggregation cycle.
debug bool no false Enable debug logging.
detector-port string no :8083 Detector HTTP server port
detector-datacenter []string no N/A List of datacenters where detector is running. If no datacenters are provided, aggregator will only reach out to nodes in $NOMAD_DC datacenter.
enforce-health-check []string no N/A Health checks in this list will be enforced i.e. node will be taken out of the scheduling pool if health-check fails.
nomad-server string no http://localhost:4646 HTTP API address of a Nomad server or agent.
node-attribute []string no N/A Aggregator will filter nodes based on these attributes. E.g. if you set os.name=ubuntu, aggregator will only reach out to ubuntu nodes in the cluster.
threshold-percentage int no 85 If the number of eligible nodes goes below the threshold, npd will stop marking nodes as ineligible.
prometheus-server-port int no 3000 The port used to expose aggregator metrics in the prometheus format
prometheus-server-addr string no 0.0.0.0 The address to bind the aggregator metrics exporter

Detector - Run nomad node problem detector HTTP server

npd detector --help for more info.

Option Type Required Default Description
detector-cycle-time string no 3s Time (in seconds) to wait between each detector cycle.
port string no :8083 Address to listen on for detector HTTP server.
NOTE: If your detector is listening on a non-default port, don't forget to start your aggregator with --detector-port flag. This will inform aggregator which detector port to reach out to.
prometheus-metrics-path string no /v1/metrics/ Set the path that is used by the metrics endpoint to expose detector metrics in the prometheus format.
auth bool no false If set to true, detector must set DETECTOR_HTTP_TOKEN=<your_token> as an environment variable when starting detector.
root-dir string no /var/lib/nnpd Location of health checks.
cpu-limit string no 85 CPU threshold in percentage.
memory-limit string no 80 Memory threshold in percentage.
disk-limit string no 90 Disk threshold in percentage.

Config - Run config and health checks related commands.

npd config --help for more info.

There are two subcommands in npd config command:

  • npd config generate - Generates the config.
Option Type Required Default Description
root-dir string no pwd - present working directory Location of health checks
  • npd config build - Copy your health checks into a docker image.
Option Type Required Default Description
image string yes N/A Fully qualified docker image name
root-dir string no pwd - present working directory Location of health checks

Tests

vagrant up will start a local vagrant VM nnpd, which has all the dependencies (e.g. nomad, golang) already installed, which are required to run the integration tests.

To run the tests locally in the vagrant VM.

$ vagrant up
$ vagrant ssh nnpd
$ sudo make test

Cleanup

make clean

This will delete your npd binary.

vagrant destroy

This will destroy your vagrant VM.

License

Copyright 2021 Roblox Corporation

Licensed under the Apache License, Version 2.0 (the "License"). For more information read the License.