Adjust system health HLD due to output of 'monit summary -B' command change #887

Junchao-Mellanox · 2021-10-25T01:33:44Z

Why I did this?

Command monit summary -B can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to adjust the HLD and describe the new implementation.

Related PRs:

PR title	state	context
[sonic-buildimage] [system-health] No longer check critical process/service status via monit
[sonic-mgmt] [test_system_health.py] Adjust test cases of system health

…change

liat-grozovik · 2021-11-11T10:30:51Z

@yozhao101 could you please review and approve the HLD? code review of the listed PR is already in progress but IMO we should approve HLD before it.

yozhao101 · 2021-11-13T06:34:58Z

doc/system_health_monitoring/system-health-HLD.md

-By default any above services or file systems is not in good status will be considered as fault condition.
-
-### 1.2 Peripheral devices status which could impact the system health status
+System health monitor is intended to monitor both critical services/processes and peripheral device status and leverage system log, system status LED to and CLI command output to indicate the system status.


is intended to monitor both critical services/processes and peripheral device status ---> is designed to monitor critical services, critical processes and peripheral device status?

and leverage system log, system status LED to ---> by analyzing system log, system status LED and?

Actually, system health does not analyze "sytstem log, system status LED", instead, it triggers error log to system log and let user know what is happening, it also changes system status LED color according to the system status.

doc/system_health_monitoring/system-health-HLD.md

yozhao101 · 2021-11-13T07:03:05Z

doc/system_health_monitoring/system-health-HLD.md

@@ -156,12 +151,14 @@ Considering that different vendors platform may have different LED color capabil

 ## 2. System health monitor service business logic

-System health monitor daemon will running on the host, periodically(every 60s) check the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, if anything wrong with the services monitored by monit or peripheral devices, system status LED will be set to fault status. When fault condition relieved, system status will be set to normal status. 
+System health monitor daemon will running on the host, periodically(every 60s) check critical services/processes status, the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, if anything wrong with them, system status LED will be set to fault status. When fault condition relieved, system status will be set to normal status.


will running ---> will run?

, periodically(every 60s) check critical services/processes status, the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, ---> and periodically (every 60 seconds) check critical services, processes status, output of the command "monit summary", PSU, Fan, and thermal status which is stored in the state DB.?

if anything wrong with them ---> If anything is abnormal?

How do we define the term anything?

anything could be: services like swss, syncd; processes like orchagent; hardware like ASIC, PSU; and any other thing that is monitored by user defined checker. So, basically anything means that any object which is monitored by system health.

doc/system_health_monitoring/system-health-HLD.md

yozhao101 · 2021-11-13T07:09:02Z

doc/system_health_monitoring/system-health-HLD.md


-If monit service is not avalaible, will consider system in fault condition.
-FAN/PSU/ASIC data not available will also considered as fault conditon.
+Empty FEATURE table will be considered as fault condition.


How do we define the term fault condition? This term appears frequently in this document.

fault condition means that the system status is bad, and system health service shall put an error log to syslog and mark the overall system status as "Not OK".

doc/system_health_monitoring/system-health-HLD.md

yozhao101 · 2021-11-13T07:18:16Z

doc/system_health_monitoring/system-health-HLD.md

+
+### 1.1 Monitor critical services/processes
+
+#### 1.1.1 Monitor critical services


We have a script container_checker which is run by Monit and it will check whether the expected running containers are actually running or not. Can we borrow that script to monitor the critical services please?

System health borrows some of the code in "container_checker". What do you mean by borrow?

keboliu · 2021-11-17T01:49:13Z

@yozhao101 would you please check the feedback from Junchao?

…onit (#9068) HLD updated here: sonic-net/SONiC#887 #### Why I did it Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check. #### How I did it 1. Get container names from FEATURE table 2. For each container, collect critical process names from file critical_processes 3. Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit #### How to verify it 1. Add unit test case to cover it 2. Adjust sonic-mgmt cases to cover it 3. Manual test

liat-grozovik · 2021-11-24T07:56:36Z

@yozhao101 could you please check recent updates and let us know if there is any further outstanding issue or this can be signoff?

Junchao-Mellanox · 2021-12-06T06:19:50Z

Hi @qiluo-msft , could you please review and sign-off?

Adjust system health HLD due to output of 'monit summary -B' command …

e40c669

…change

Junchao-Mellanox force-pushed the update-system-health branch from ebfe564 to e40c669 Compare October 25, 2021 01:34

keboliu previously approved these changes Oct 25, 2021

View reviewed changes

Junchao-Mellanox mentioned this pull request Oct 29, 2021

[system-health] No longer check critical process/service status via monit sonic-net/sonic-buildimage#9068

Merged

5 tasks

liat-grozovik requested a review from yozhao101 November 11, 2021 10:31