domain	shortname	name	status	editor
github.com	12/CHECK	LDR1 HA Health Checks	raw	Maxim Medved <max.medved@seagate.com>

Some of the issues on the cluster could be detected by looking at systemd unit states. Some of the issues can't. This is why we have LDR1 HA Health Checks.

LDR1 HA Health Checks

Overview

The primary goal of LDR1 HA is to make LDR1 data stack and LDR1 management stack highly available. To do this LDR1 HA monitors system state and changes location and/or configuration of services in case of hardware/software failures to run them on the part of the system that hasn't failed. Basic error detection in Pacemaker uses resource agent monitor operation. For systemd units it's just monitoring unit state and if the systemd unit fails Pacemaker assumes that the resource has failed an acts accordingly. Most of our services are systemd services, so the only thing Pacemaker knows about them if their systemd unit is active or not. This kind of monitoring doesn't cover situations when the service itself is not functioning properly: it may be hung, deadlocked etc. If a custom resource agent for our components does similar health check - like "the process is running" - then it has the same issues as the systemd unit.

There is another issue with system monitoring: if LDR1 HA starts watching for all hardware and software components, it also needs to know exactly how are they communicating and what are they depend on. Such dependencies are not exactly the same as the startup/shutdown dependencies which Pacemaker uses to start and stop the services, and without real communication dependencies Pacemaker couldn't know when components are failing because of communication issues. Adding communication dependencies into our current implementation is not even close to an easy task.

To make LDR1 HA to be able detect if components or entire data/management stack are functioning a set of health checks was introduced. The purpose of health checks is to be able to detect when a component or a set of components stop functioning and then provide this information to LDR1 HA, so LDR1 HA can make a decision about how to recover from this situation.

Health checks are implemented as checks of functionality. For one check this is functionality of a single hardware component or network link, for another check this is functionality of entire data stack.

List of LDR1 HA health checks

implementation	integration	check-id	what it checks	how it checks	action if the check fails	components involved
EOS-4870	EOS-4882	data1	data stack on a single server	S3 request to HAProxy	failover to the server where the test passes	Motr, S3 server
EOS-4871	EOS-4883	data2	data stack on a both servers	S3 request to HAProxy	run data1 on each server	Motr, S3 server
EOS-4872	EOS-4884	mgmt	management stack on a single server	?	failover to another server	CSM
EOS-4874	EOS-4886	external-data	connection to the outside world over data network	ping default gw for data network	?	-
EOS-4875	EOS-4887	external-mgmt	connection to the outside world over management network	ping default gw for management network	failover to another server if this ins the server where CSM is running	-
EOS-4877	EOS-4889	internal-data	network connectivity with other server over data network	ping other server over data network	run test on one server, failover to another server if it fails	-
EOS-4878	EOS-4890	internal-mgmt	network connectivity with other server over management network	ping other server over management network	run test on one server, failover to another server if it fails	-
EOS-6586	EOS-6587	cross-data	cross-server connection for data	ping other server over cross-server connection	choose one server, do failover	-
EOS-4876	EOS-4888	bmc	BMC availability over network	ipmitool power status	failover to the server where BMC works	-
EOS-4873	EOS-4885	saslink	SAS link	?	?	-
EOS-4879	EOS-4891	consul	Consul	get Consul leader	choose one server, do failover	Hare
EOS-4880	EOS-4892	sspl1	SSPL end-to-end test on a single server	?	failover to another server	SSPL
EOS-4881	EOS-4893	sspl2	SSPL end-to-end test on both servers	?	?	SSPL
EOS-6588	EOS-6589	uds	UDS service	?	failover to another server	UDS

Integration

There are 2 options for health checks integration into Pacemaker:

Run health check in the resource agent monitor function. This would work if the component has a single custom resource agent. Example: if SSPL component has it's own resource agent the health check can be done as SSPL resource agent monitor function. This way SSPL resource agent will fail if SSPL stops functioning even if all SSPL processes are alive.
Make a special resource agent that can invoke health check function (resource agent defines resource type). Add resources with this type, then add dependencies on this resource with required failover logic. This might be required for checks with actions that depend on other checks. Example: data2 check needs data1 check to help with the decision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LDR1 HA Health Checks

Overview

List of LDR1 HA health checks

Integration

See also

Files

README.md

Latest commit

History

README.md

File metadata and controls

LDR1 HA Health Checks

Overview

List of LDR1 HA health checks

Integration

See also