Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system-health] Add support for monitoring system health #4835

Merged
merged 65 commits into from
Oct 12, 2020

Conversation

Junchao-Mellanox
Copy link
Collaborator

@Junchao-Mellanox Junchao-Mellanox commented Jun 23, 2020

- Why I did it

System health feature requires a service to collect system status and manage the system status LED.

- How I did it

  1. Add a health-check python library at src/system-health/health_check. This library is called by healthd and CLI to get system status information
  2. Add a healthd daemon to run as a service on host side. The daemon collects system status information at a configurable interval and puts results to state DB.
  3. Add make files and service file

- How to verify it

Manual test on SN2700. Regression test on SN2700, SN4600C, SN4700, SN3800, SN2100.

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@lgtm-com

This comment has been minimized.

@lgtm-com

This comment has been minimized.

@lgtm-com

This comment has been minimized.

src/system-health/scripts/healthd Outdated Show resolved Hide resolved
src/system-health/scripts/healthd Outdated Show resolved Hide resolved
src/system-health/scripts/healthd Outdated Show resolved Hide resolved
jleveque
jleveque previously approved these changes Sep 15, 2020
Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Please wait for other reviewers.

keboliu
keboliu previously approved these changes Sep 15, 2020
@keboliu
Copy link
Collaborator

keboliu commented Sep 17, 2020

@sujinmkang, does this look ok for you? if yes, would please approve?

sujinmkang
sujinmkang previously approved these changes Sep 21, 2020
@keboliu
Copy link
Collaborator

keboliu commented Sep 22, 2020

retest this please

@Junchao-Mellanox
Copy link
Collaborator Author

retest vs please

sujinmkang
sujinmkang previously approved these changes Sep 25, 2020
liat-grozovik
liat-grozovik previously approved these changes Oct 4, 2020
Copy link
Collaborator

@liat-grozovik liat-grozovik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it comes with tests :-)

@liat-grozovik
Copy link
Collaborator

@Junchao-Mellanox could you please resolve conflicts and once all tests pass we can merge

@Junchao-Mellanox
Copy link
Collaborator Author

@liat-grozovik resolved.

@liat-grozovik liat-grozovik merged commit 1c97a03 into sonic-net:master Oct 12, 2020
@Junchao-Mellanox Junchao-Mellanox deleted the system-health branch October 13, 2020 03:40
santhosh-kt pushed a commit to santhosh-kt/sonic-buildimage that referenced this pull request Feb 25, 2021
)

* system health first commit

* system health daemon first commit

* Finish healthd

* Changes due to lower layer logic change

* Get ASIC temperature from TEMPERATURE_INFO table

* Add system health make rule and service files

* fix bugs found during manual test

* Change make file to install system-health library to host

* Set system LED to blink on bootup time

* Caught exceptions in system health checker to make it more robust

* fix issue that fan/psu presence will always be true

* fix issue for external checker

* move system-health service to right after rc-local service

* Set system-health service start after database service

* Get system up time via /proc/uptime

* Provide more information in stat for CLI to use

* fix typo

* Set default category to External for external checker

* If external checker reported OK, save it to stat too

* Trim string for external checker output

* fix issue: PSU voltage check always return OK

* Add unit test cases for system health library

* Fix LGTM warnings

* fix demo comments: 1. get boot up timeout from monit configuration file; 2. set system led in library instead of daemon

* Remove boot_timeout configuration because it will get from monit config file

* Fix argument miss

* fix unit test failure

* fix issue: summary status is not correct

* Fix format issues found in code review

* rename th to threshold to make it clearer

* Fix review comment: 1. add a .dep file for system health; 2. deprecated daemon_base and uses sonic-py-common instead

* Fix unit test failure

* Fix LGTM alert

* Fix LGTM alert

* Fix review comments

* Fix review comment

* 1. Add relevant comments for system health; 2. rename external_checker to user_define_checker

* Ignore check for unknown service type

* Fix unit test issue

* Rename user define checker to user defined checker

* Rename user_define_checkers to user_defined_checkers for configuration file

* Renmae file user_define_checker.py -> user_defined_checker.py

* Fix typo

* Adjust import order for config.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/health_checker/hardware_checker.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/scripts/healthd

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import orders in src/system-health/tests/test_system_health.py

* Fix typo

* Add new line after import

* If system health configuration file not exist, healthd should exit

* Fix indent and enable pytest coverage

* Fix typo

* Fix typo

* Remove global logger and use log functions inherited from super class

* Change info level logger to notice level

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants