-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[services] Restart SwSS service upon unexpected critical process exit #2845
Conversation
…ore than 3 times in 20 minutes
…stalls systemd 232 (>= v230)
…es' file inside container
vrfmgrd | ||
nbrmgrd | ||
vxlanmgrd | ||
intfsyncd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intfsyncd is no longer present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I meant to delete that line, but forgot. It won't cause any issues, but I'll open a new PR to remove it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR here: #2850
…sonic-net#2845) * [service] Restart SwSS Docker container if orchagent exits unexpectedly * Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes * Move supervisor-proc-exit-listener script * [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB * Ensure dependent services stop/start/restart with SwSS * Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230) * Also update journald.conf options * Remove 'PartOf' option from unit files * Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile * Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container * Update critical_processes file for swss container
message from community commit below: [services] Restart SwSS service upon unexpected critical process exit (sonic-net#2845) * [service] Restart SwSS Docker container if orchagent exits unexpectedly * Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes * Move supervisor-proc-exit-listener script * [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB * Ensure dependent services stop/start/restart with SwSS * Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230) * Also update journald.conf options * Remove 'PartOf' option from unit files * Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile * Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container * Update critical_processes file for swss container Change-Id: Ifd2383a4a3f6edfdf4d1ceffbd60e879673d7647
…cal process in syncd container exits unexpectedly (#3534) Add the same mechanism I developed for the SwSS service in #2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
…cal process in syncd container exits unexpectedly (sonic-net#3534) Add the same mechanism I developed for the SwSS service in sonic-net#2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
…lly (sonic-net#15785) #### Why I did it src/sonic-swss ``` * 776af62c - (HEAD -> master, origin/master, origin/HEAD) [CodeQL]: Use dependencies with relevant versions in azp template. (sonic-net#2845) (4 hours ago) [Nazarii Hnydyn] ``` #### How I did it #### How to verify it #### Description for the changelog
…lly (#16642) #### Why I did it src/sonic-swss ``` * 0584d35b - (HEAD -> 202305, origin/202305) Revert "Support type7 encoded CAK key for macsec in config_db (#2892)" (3 minutes ago) [stormliang] * 7097cf2b - Revert "[teamd]: Clean teamd process if LAG creation fails (#2888)" (3 days ago) [stormliang] * a0eb0d07 - Support type7 encoded CAK key for macsec in config_db (#2892) (4 days ago) [judyjoseph] * c7e5f10e - [teamd]: Clean teamd process if LAG creation fails (#2888) (4 days ago) [Lawrence Lee] * f30b6107 - [CodeQL]: Use dependencies with relevant versions in azp template. (#2845) (4 days ago) [Nazarii Hnydyn] ``` #### How I did it #### How to verify it #### Description for the changelog
…lly (#16532) src/sonic-swss * de7186c6 - (HEAD -> 202205, origin/202205) [202205][CodeQL]: Use dependencies with relevant versions in azp template. (#2905) (13 days ago) [Nazarii Hnydyn] * 106dd9ed - [CodeQL]: Use dependencies with relevant versions in azp template. (#2845) (3 weeks ago) [Nazarii Hnydyn]
…sonic-buildimage into internal Fix conflict for rsyslog. Skip partial DNS unit test in internal branch after confirmed with Gang. Related work items: sonic-net#113, sonic-net#131, sonic-net#132, sonic-net#134, sonic-net#321, sonic-net#331, sonic-net#381, sonic-net#382, sonic-net#2525, sonic-net#2676, sonic-net#2698, sonic-net#2737, sonic-net#2789, sonic-net#2839, sonic-net#2845, sonic-net#2850, sonic-net#2882, sonic-net#2885, sonic-net#2887, sonic-net#2890, sonic-net#2895, sonic-net#13338, sonic-net#14105, sonic-net#15142, sonic-net#15223, sonic-net#15456, sonic-net#15487, sonic-net#15520, sonic-net#15726, sonic-net#15727, sonic-net#15758, sonic-net#15764, sonic-net#15765, sonic-net#15772, sonic-net#15779, sonic-net#15782, sonic-net#15785, sonic-net#15797, sonic-net#15798, sonic-net#15810, sonic-net#15811, sonic-net#15821
- What I did
Restart SwSS service (and also restart dependent services) if any critical processes running in the swss container exit abnormally.
- How I did it
supervisor-proc-exit-listener
event listener plugin for Supervisor in SwSS Docker container which in turn loads a list of critical processes for which to monitor for unexpected exits.systemctl reset-failed
[we should probably also call this command inconfig load_minigraph
before restarting services]systemctl stop swss.service
andsystemctl restart swss.service
). However this will not cause them to start with SwSS (when callingsystemctl start swss.service
). This functionality is enabled with the addition of the "WantedBy=" option.ip
commands, now check STATE_DB for interface entries with "state" == "ok"supervisor-proc-exit-listener
script resides in files/scripts/ so that the same script can be installed in multiple Docker containers. To add this solution to another container, one simply needs to do the following:/etc/supervisor/critical_processes
file to the container specifying all critical processes, one per line- How it Works
supervisor-proc-exit-listener
will send aSIGTERM
signal to Supervisor, causing it to exit also- How to verify it
Send a signal to one of the critical processes to cause it to appear to exit abnormally (e.g.,
pkill -11 orchagent
). Ensure the swss, syncd, teamd, snmp, dhcp_relay, radv and telemetry services get restarted per the above details.NOTE: My updates to systemd dependencies in this PR also fixes #2752