Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system-health.service takes 100s to go down and delays warm shutdown causing LAGs flap #14964

Closed
vaibhavhd opened this issue May 5, 2023 · 3 comments

Comments

@vaibhavhd
Copy link
Contributor

vaibhavhd commented May 5, 2023

Description

Steps to reproduce the issue:

  1. Install 202205 image - latest 202205 image is good for repro.
  2. Warm reboot
  3. Repro rate: 1 in 10 iterations.

PR causing it: #4835

Describe the results you received:

Issue seen in 202205 branch, but may be also impacting other branches too.

Good case : system-health.service takes ~2s to shutdown. IMO 2s is also too long in this time critical path. But, it is better than bad case which takes ~100s.

Apr 28 07:04:45.463212 str-msn2700-01 INFO systemd[1]: Stopping SONiC system health monitor...
Apr 28 07:04:45.463953 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:45.465299 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:45.471012 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:46.452616 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:04:47.553187 str-msn2700-01 INFO systemd[1]: system-health.service: Succeeded.
Apr 28 07:04:47.553998 str-msn2700-01 INFO systemd[1]: Stopped SONiC system health monitor.

Apr 28 07:04:50.873897 str-msn2700-01 NOTICE admin: Rebooting with /sbin/kexec -e to SONiC-OS-20220531.26 ...

Bad case: system-health.service takes 100s to go down and keeps the system in waiting state. This causes warmreboot to fail as LACP session window of 90s is exceeded and LAGs flap and traffic is dropped.

Apr 28 07:53:44.494314 str-msn2700-01 INFO systemd[1]: Stopping SONiC system health monitor...
Apr 28 07:53:44.511256 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:44.515118 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:44.517966 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:44.538264 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:45.540920 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:56.592671 str-msn2700-01 NOTICE healthd: Caught SIGTERM - exiting...
Apr 28 07:53:57.594171 str-msn2700-01 NOTICE healthd: message repeated 2 times: [ Caught SIGTERM - exiting...]
Apr 28 07:55:26.634031 str-msn2700-01 WARNING systemd[1]: system-health.service: State 'final-sigterm' timed out. Killing.
Apr 28 07:55:26.634317 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4528 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634317 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4528 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634483 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4544 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634483 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4544 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634629 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4548 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634629 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4548 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634799 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4532 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634799 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4532 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634945 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4551 (healthd) with signal SIGKILL.
Apr 28 07:55:26.634945 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4551 (healthd) with signal SIGKILL.
Apr 28 07:55:26.635088 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4852 (healthd) with signal SIGKILL.
Apr 28 07:55:26.635088 str-msn2700-01 NOTICE systemd[1]: system-health.service: Killing process 4852 (healthd) with signal SIGKILL.
Apr 28 07:55:26.646842 str-msn2700-01 WARNING systemd[1]: system-health.service: Failed with result 'timeout'.
Apr 28 07:55:26.648759 str-msn2700-01 INFO systemd[1]: Stopped SONiC system health monitor.

Apr 28 07:55:28.882753 str-msn2700-01 NOTICE admin: Rebooting with /sbin/kexec -e to SONiC-OS-20220531.26 ...

Describe the results you expected:

system-health.service should go down as fast as possible (~0.1s).

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

syslog_healthd_fail.log

@vaibhavhd vaibhavhd changed the title system-health.service cannot be killed with SIGTERM : causes warmreboot to fail system-health.service takes 100s to go down and delays warm shutdown causing LAGs flap May 5, 2023
@vaibhavhd
Copy link
Contributor Author

Syslog containing both good and bad case runs is attached.

@sg893052
Copy link
Contributor

This PR (#15212) will address the system-health service taking 100s to go down issue.

@adyeung
Copy link
Collaborator

adyeung commented Aug 10, 2023

All fixes merged

@adyeung adyeung closed this as completed Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants