-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust system health HLD due to output of 'monit summary -B' command change #887
Adjust system health HLD due to output of 'monit summary -B' command change #887
Conversation
ebfe564
to
e40c669
Compare
@yozhao101 could you please review and approve the HLD? code review of the listed PR is already in progress but IMO we should approve HLD before it. |
By default any above services or file systems is not in good status will be considered as fault condition. | ||
|
||
### 1.2 Peripheral devices status which could impact the system health status | ||
System health monitor is intended to monitor both critical services/processes and peripheral device status and leverage system log, system status LED to and CLI command output to indicate the system status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is intended to monitor both critical services/processes and peripheral device status
---> is designed to monitor critical services, critical processes and peripheral device status
?
and leverage system log, system status LED to
---> by analyzing system log, system status LED and
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, system health does not analyze "sytstem log, system status LED", instead, it triggers error log to system log and let user know what is happening, it also changes system status LED color according to the system status.
@@ -156,12 +151,14 @@ Considering that different vendors platform may have different LED color capabil | |||
|
|||
## 2. System health monitor service business logic | |||
|
|||
System health monitor daemon will running on the host, periodically(every 60s) check the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, if anything wrong with the services monitored by monit or peripheral devices, system status LED will be set to fault status. When fault condition relieved, system status will be set to normal status. | |||
System health monitor daemon will running on the host, periodically(every 60s) check critical services/processes status, the "monit summary" command output and PSU, fan, thermal status which stored in the state DB, if anything wrong with them, system status LED will be set to fault status. When fault condition relieved, system status will be set to normal status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will running
---> will run
?
, periodically(every 60s) check critical services/processes status, the "monit summary" command output and PSU, fan, thermal status which stored in the state DB,
---> and periodically (every 60 seconds) check critical services, processes status, output of the command "monit summary", PSU, Fan, and thermal status which is stored in the state DB.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if anything wrong with them
---> If anything is abnormal
?
How do we define the term anything
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything could be: services like swss, syncd; processes like orchagent; hardware like ASIC, PSU; and any other thing that is monitored by user defined checker. So, basically anything means that any object which is monitored by system health.
|
||
If monit service is not avalaible, will consider system in fault condition. | ||
FAN/PSU/ASIC data not available will also considered as fault conditon. | ||
Empty FEATURE table will be considered as fault condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we define the term fault condition
? This term appears frequently in this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fault condition
means that the system status is bad, and system health service shall put an error log to syslog and mark the overall system status as "Not OK".
|
||
### 1.1 Monitor critical services/processes | ||
|
||
#### 1.1.1 Monitor critical services |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a script container_checker
which is run by Monit and it will check whether the expected running containers are actually running or not. Can we borrow that script to monitor the critical services please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System health borrows some of the code in "container_checker". What do you mean by borrow?
@yozhao101 would you please check the feedback from Junchao? |
…onit (#9068) HLD updated here: sonic-net/SONiC#887 #### Why I did it Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check. #### How I did it 1. Get container names from FEATURE table 2. For each container, collect critical process names from file critical_processes 3. Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit #### How to verify it 1. Add unit test case to cover it 2. Adjust sonic-mgmt cases to cover it 3. Manual test
@yozhao101 could you please check recent updates and let us know if there is any further outstanding issue or this can be signoff? |
Hi @qiluo-msft , could you please review and sign-off? |
Why I did this?
Command monit summary -B can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to adjust the HLD and describe the new implementation.
Related PRs: