Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system-health] Add support for monitoring system health #4835

Merged
merged 65 commits into from
Oct 12, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
f3d3fb5
system health first commit
Junchao-Mellanox Jun 4, 2020
63623a7
system health daemon first commit
Junchao-Mellanox Jun 4, 2020
e988130
Finish healthd
Junchao-Mellanox Jun 5, 2020
7ed33df
Changes due to lower layer logic change
Junchao-Mellanox Jun 8, 2020
fd301e6
Get ASIC temperature from TEMPERATURE_INFO table
Junchao-Mellanox Jun 9, 2020
77d57cc
Add system health make rule and service files
Junchao-Mellanox Jun 9, 2020
ae00266
fix bugs found during manual test
Junchao-Mellanox Jun 9, 2020
ad8a740
Change make file to install system-health library to host
Junchao-Mellanox Jun 10, 2020
cf861fe
Set system LED to blink on bootup time
Junchao-Mellanox Jun 11, 2020
7eb6082
Caught exceptions in system health checker to make it more robust
Junchao-Mellanox Jun 11, 2020
91c43f0
fix issue that fan/psu presence will always be true
Junchao-Mellanox Jun 11, 2020
509fa5c
fix issue for external checker
Junchao-Mellanox Jun 11, 2020
d88515d
move system-health service to right after rc-local service
Junchao-Mellanox Jun 11, 2020
a198cc5
Set system-health service start after database service
Junchao-Mellanox Jun 15, 2020
30b4668
Get system up time via /proc/uptime
Junchao-Mellanox Jun 16, 2020
8fea891
Provide more information in stat for CLI to use
Junchao-Mellanox Jun 16, 2020
0134052
fix typo
Junchao-Mellanox Jun 16, 2020
f1def48
Set default category to External for external checker
Junchao-Mellanox Jun 17, 2020
7123b8e
If external checker reported OK, save it to stat too
Junchao-Mellanox Jun 17, 2020
d68a43c
Trim string for external checker output
Junchao-Mellanox Jun 17, 2020
b24c6f8
fix issue: PSU voltage check always return OK
Junchao-Mellanox Jun 18, 2020
d9d125d
Add unit test cases for system health library
Junchao-Mellanox Jun 23, 2020
465efa7
Fix LGTM warnings
Junchao-Mellanox Jun 23, 2020
8ca8a26
Merge branch 'master' into system-health
Junchao-Mellanox Jun 24, 2020
cd17e6b
fix demo comments: 1. get boot up timeout from monit configuration fi…
Junchao-Mellanox Jun 28, 2020
3fbff53
Remove boot_timeout configuration because it will get from monit conf…
Junchao-Mellanox Jun 28, 2020
a9dcb26
Fix argument miss
Junchao-Mellanox Jun 28, 2020
da272cc
fix unit test failure
Junchao-Mellanox Jun 28, 2020
622cb3e
fix issue: summary status is not correct
Junchao-Mellanox Jun 28, 2020
084c2e2
Fix format issues found in code review
Junchao-Mellanox Jul 6, 2020
f84cdd9
rename th to threshold to make it clearer
Junchao-Mellanox Jul 6, 2020
0a5ed17
Merge branch 'master' into system-health
Junchao-Mellanox Jul 31, 2020
0c1b6ff
Fix review comment: 1. add a .dep file for system health; 2. deprecat…
Junchao-Mellanox Aug 3, 2020
e1c62f7
Fix unit test failure
Junchao-Mellanox Aug 3, 2020
1092779
Fix LGTM alert
Junchao-Mellanox Aug 3, 2020
866c0d3
Fix LGTM alert
Junchao-Mellanox Aug 4, 2020
c237886
Merge branch 'master' into system-health
Junchao-Mellanox Aug 4, 2020
a05ca87
Merge branch 'master' into system-health
Junchao-Mellanox Aug 6, 2020
fbfd654
Merge branch 'system-health' of github.com:Junchao-Mellanox/sonic-bui…
Junchao-Mellanox Aug 6, 2020
7dc033b
Fix review comments
Junchao-Mellanox Aug 10, 2020
3c722e1
Fix review comment
Junchao-Mellanox Aug 10, 2020
911b6aa
Merge branch 'master' into system-health
Junchao-Mellanox Aug 10, 2020
035cec9
1. Add relevant comments for system health; 2. rename external_checke…
Junchao-Mellanox Aug 12, 2020
30235fc
Merge branch 'system-health' of github.com:Junchao-Mellanox/sonic-bui…
Junchao-Mellanox Aug 12, 2020
183ddcc
Ignore check for unknown service type
Junchao-Mellanox Aug 12, 2020
451a395
Fix unit test issue
Junchao-Mellanox Aug 12, 2020
011b3af
Rename user define checker to user defined checker
Junchao-Mellanox Aug 14, 2020
a30d9b5
Rename user_define_checkers to user_defined_checkers for configuratio…
Junchao-Mellanox Aug 14, 2020
001141c
Renmae file user_define_checker.py -> user_defined_checker.py
Junchao-Mellanox Aug 14, 2020
8ad6dc7
Fix typo
Junchao-Mellanox Aug 14, 2020
12eef05
Adjust import order for config.py
Junchao-Mellanox Aug 17, 2020
14808cf
Adjust import order for src/system-health/health_checker/hardware_che…
Junchao-Mellanox Aug 17, 2020
610fb49
Adjust import order for src/system-health/scripts/healthd
Junchao-Mellanox Aug 17, 2020
6d0ae4c
Adjust import orders in src/system-health/tests/test_system_health.py
Junchao-Mellanox Aug 17, 2020
aece158
Fix typo
Junchao-Mellanox Aug 17, 2020
8812061
Add new line after import
Junchao-Mellanox Aug 18, 2020
8ea2ab5
If system health configuration file not exist, healthd should exit
Junchao-Mellanox Sep 7, 2020
d4c2df4
Merge branch 'master' into system-health
Junchao-Mellanox Sep 9, 2020
9de4127
Fix indent and enable pytest coverage
Junchao-Mellanox Sep 10, 2020
c9f09b0
Fix typo
Junchao-Mellanox Sep 10, 2020
ae9a476
Fix typo
Junchao-Mellanox Sep 15, 2020
78a2dc6
Remove global logger and use log functions inherited from super class
Junchao-Mellanox Sep 15, 2020
cb7f5d2
Change info level logger to notice level
Junchao-Mellanox Sep 15, 2020
fe2f1be
Merge branch 'master' into system-health
Junchao-Mellanox Sep 22, 2020
466f983
Merge branch 'master' into system-health
Junchao-Mellanox Oct 9, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
{
"services_to_ignore": [],
"devices_to_ignore": ["psu.voltage", "psu.temperature"],
"external_checkers": [],
"user_defined_checkers": [],
"polling_interval": 60,
"led_color": {
"fault": "orange",
"normal": "green",
"booting": "orange_blink"
"fault": "orange",
"normal": "green",
"booting": "orange_blink"
}
}
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
{
"services_to_ignore": [],
"devices_to_ignore": ["psu.voltage"],
"external_checkers": [],
"user_defined_checkers": [],
"polling_interval": 60,
"led_color": {
"fault": "orange",
"normal": "green",
"booting": "orange_blink"
"fault": "orange",
"normal": "green",
"booting": "orange_blink"
}
}
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
{
"services_to_ignore": [],
"devices_to_ignore": ["psu","asic","fan"],
"external_checkers": [],
"user_defined_checkers": [],
"polling_interval": 60,
"led_color": {
"fault": "orange",
"normal": "green",
"booting": "orange_blink"
"fault": "orange",
"normal": "green",
"booting": "orange_blink"
}
}
10 changes: 10 additions & 0 deletions files/build_templates/sonic_debian_extension.j2
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,12 @@ sudo cp {{platform_common_py2_wheel_path}} $FILESYSTEM_ROOT/$PLATFORM_COMMON_PY2
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip install $PLATFORM_COMMON_PY2_WHEEL_NAME
sudo rm -rf $FILESYSTEM_ROOT/$PLATFORM_COMMON_PY2_WHEEL_NAME

# Install system-health Python 2 package
SYSTEM_HEALTH_PY2_WHEEL_NAME=$(basename {{system_health_py2_wheel_path}})
sudo cp {{system_health_py2_wheel_path}} $FILESYSTEM_ROOT/$SYSTEM_HEALTH_PY2_WHEEL_NAME
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip install $SYSTEM_HEALTH_PY2_WHEEL_NAME
sudo rm -rf $FILESYSTEM_ROOT/$SYSTEM_HEALTH_PY2_WHEEL_NAME

# Install sonic-platform-common Python 3 package
PLATFORM_COMMON_PY3_WHEEL_NAME=$(basename {{platform_common_py3_wheel_path}})
sudo cp {{platform_common_py3_wheel_path}} $FILESYSTEM_ROOT/$PLATFORM_COMMON_PY3_WHEEL_NAME
Expand Down Expand Up @@ -283,6 +289,10 @@ sudo mkdir -p $FILESYSTEM_ROOT/etc/systemd/system/syslog.socket.d
sudo cp $IMAGE_CONFIGS/syslog/override.conf $FILESYSTEM_ROOT/etc/systemd/system/syslog.socket.d/override.conf
sudo cp $IMAGE_CONFIGS/syslog/host_umount.sh $FILESYSTEM_ROOT/usr/bin/

# Copy system-health files
sudo LANG=C cp $IMAGE_CONFIGS/system-health/system-health.service $FILESYSTEM_ROOT_USR_LIB_SYSTEMD_SYSTEM
echo "system-health.service" | sudo tee -a $GENERATED_SERVICE_FILE

# Copy logrotate.d configuration files
sudo cp -f $IMAGE_CONFIGS/logrotate/logrotate.d/* $FILESYSTEM_ROOT/etc/logrotate.d/

Expand Down
11 changes: 11 additions & 0 deletions files/image_config/system-health/system-health.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[Unit]
Description=SONiC system health monitor
Requires=database.service updategraph.service
After=database.service updategraph.service

[Service]
ExecStart=/usr/local/bin/healthd
Restart=always

[Install]
WantedBy=multi-user.target
8 changes: 8 additions & 0 deletions rules/system-health.dep
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
SPATH := $($(SYSTEM_HEALTH)_SRC_PATH)
DEP_FILES := $(SONIC_COMMON_FILES_LIST) rules/system-health.mk rules/system-health.dep
DEP_FILES += $(SONIC_COMMON_BASE_FILES_LIST)
DEP_FILES += $(shell git ls-files $(SPATH))

$(SYSTEM_HEALTH)_CACHE_MODE := GIT_CONTENT_SHA
$(SYSTEM_HEALTH)_DEP_FLAGS := $(SONIC_COMMON_FLAGS_LIST)
$(SYSTEM_HEALTH)_DEP_FILES := $(DEP_FILES)
9 changes: 9 additions & 0 deletions rules/system-health.mk
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# system health python2 wheel
jleveque marked this conversation as resolved.
Show resolved Hide resolved

SYSTEM_HEALTH = system_health-1.0-py2-none-any.whl
$(SYSTEM_HEALTH)_SRC_PATH = $(SRC_PATH)/system-health
$(SYSTEM_HEALTH)_PYTHON_VERSION = 2
$(SYSTEM_HEALTH)_DEPENDS = $(SONIC_PY_COMMON_PY2) $(SWSSSDK_PY2) $(SONIC_CONFIG_ENGINE)
SONIC_PYTHON_WHEELS += $(SYSTEM_HEALTH)

export system_health_py2_wheel_path="$(addprefix $(PYTHON_WHEELS_PATH)/,$(SYSTEM_HEALTH))"
3 changes: 2 additions & 1 deletion slave.mk
Original file line number Diff line number Diff line change
Expand Up @@ -819,7 +819,8 @@ $(addprefix $(TARGET_PATH)/, $(SONIC_INSTALLERS)) : $(TARGET_PATH)/% : \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(REDIS_DUMP_LOAD_PY2)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_PLATFORM_API_PY2)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_YANG_MODELS_PY3)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_YANG_MGMT_PY))
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_YANG_MGMT_PY)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SYSTEM_HEALTH))
$(HEADER)
# Pass initramfs and linux kernel explicitly. They are used for all platforms
export debs_path="$(IMAGE_DISTRO_DEBS_PATH)"
Expand Down
8 changes: 8 additions & 0 deletions src/system-health/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
*/deb_dist/
*/dist/
*/build/
*/*.tar.gz
*/*.egg-info
*/.cache/
*.pyc
*/__pycache__/
2 changes: 2 additions & 0 deletions src/system-health/health_checker/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from . import hardware_checker
from . import service_checker
144 changes: 144 additions & 0 deletions src/system-health/health_checker/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
import json
import os

from sonic_py_common import device_info


class Config(object):
"""
Manage configuration of system health.
"""

# Default system health check interval
DEFAULT_INTERVAL = 60

# Default boot up timeout. When reboot system, system health will wait a few seconds before starting to work.
DEFAULT_BOOTUP_TIMEOUT = 300

# Default LED configuration. Different platform has different LED capability. This configuration allow vendor to
# override the default behavior.
DEFAULT_LED_CONFIG = {
'fault': 'red',
'normal': 'green',
'booting': 'orange_blink'
}

# System health configuration file name
CONFIG_FILE = 'system_health_monitoring_config.json'

# Monit service configuration file path
MONIT_CONFIG_FILE = '/etc/monit/monitrc'

# Monit service start delay configuration entry
MONIT_START_DELAY_CONFIG = 'with start delay'

def __init__(self):
"""
Constructor. Initialize all configuration entry to default value in case there is no configuration file.
"""
self.platform_name = device_info.get_platform()
self._config_file = os.path.join('/usr/share/sonic/device/', self.platform_name, Config.CONFIG_FILE)
self._last_mtime = None
self.config_data = None
self.interval = Config.DEFAULT_INTERVAL
self.ignore_services = None
self.ignore_devices = None
self.user_defined_checkers = None

def config_file_exists(self):
return os.path.exists(self._config_file)

def load_config(self):
"""
Load the configuration file from disk.
1. If there is no configuration file, current config entries will reset to default value
2. Only read the configuration file is last_mtime changes for better performance
3. If there is any format issues in configuration file, current config entries will reset to default value
:return:
"""
if not self.config_file_exists():
if self._last_mtime is not None:
self._reset()
return

mtime = os.stat(self._config_file)
if mtime != self._last_mtime:
try:
self._last_mtime = mtime
with open(self._config_file, 'r') as f:
self.config_data = json.load(f)

self.interval = self.config_data.get('polling_interval', Config.DEFAULT_INTERVAL)
self.ignore_services = self._get_list_data('services_to_ignore')
self.ignore_devices = self._get_list_data('devices_to_ignore')
self.user_defined_checkers = self._get_list_data('user_defined_checkers')
except Exception as e:
self._reset()

def _reset(self):
"""
Reset current configuration entry to default value
:return:
"""
self._last_mtime = None
self.config_data = None
self.interval = Config.DEFAULT_INTERVAL
self.ignore_services = None
self.ignore_devices = None
self.user_defined_checkers = None

def get_led_color(self, status):
"""
Get desired LED color according to the input status
:param status: System health status
:return: StringLED color
"""
if self.config_data and 'led_color' in self.config_data:
if status in self.config_data['led_color']:
return self.config_data['led_color'][status]

return self.DEFAULT_LED_CONFIG[status]

def get_bootup_timeout(self):
"""
Get boot up timeout from monit configuration file.
1. If monit configuration file does not exist, return default value
2. If there is any exception while parsing monit config, return default value
:return: Integer timeout value
"""
if not os.path.exists(Config.MONIT_CONFIG_FILE):
return self.DEFAULT_BOOTUP_TIMEOUT

try:
with open(Config.MONIT_CONFIG_FILE) as f:
lines = f.readlines()
for line in lines:
if not line:
continue

line = line.strip()
if not line:
continue

pos = line.find('#')
if pos == 0:
continue

line = line[:pos]
pos = line.find(Config.MONIT_START_DELAY_CONFIG)
if pos != -1:
return int(line[pos + len(Config.MONIT_START_DELAY_CONFIG):].strip())
except Exception:
return self.DEFAULT_BOOTUP_TIMEOUT

def _get_list_data(self, key):
"""
Get list type configuration data by key and remove duplicate element.
:param key: Key of the configuration entry
:return: A set of configuration data if key exists
"""
if key in self.config_data:
data = self.config_data[key]
if isinstance(data, list):
return set(data)
return None
Loading