-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple restart of swss during config load fails to start swss #3244
Comments
@antony-rheneus: Could you please explain your issue more? The new settings (StartLimitInterval = 1200sec, StartLimitBurst = 3) now mean that if systemd has already restarted swss 3 times in the past 20 minutes, it should stop trying to restart the service and mark it as failed. Under what circumstances are you encountering this? It should never occur under normal operation. |
@jleveque We are having the same issue. Configuration that isn't incremental yet can't be pushed within the specified number of time limit, is causing swss to not start again. |
@nikos-github: To clarify, you have a need to perform > 3 configuration pushes (and thus >3 SwSS restarts, due to lack of incremental config) within 20 minutes? Is this correct? |
@jleveque , If you run test suite, this fails as test suite does change configs multiple times and test 1by1. Continuous restart is InValid Only if there is issue in application which exited or core dumped, and then endless loop in the same state has to be avoided. For this we cannot add a prevention in generic infra which affects intentional restarts. |
@jleveque Would you give some pointers why this was changed from default system service values? If you provide me some insights it would be helpful for me to understand the reason for the change, as you would have analysed to come up with these new values |
@jleveque That is correct. Currently not all configuration pertaining to sonic can be applied incrementally or without restarting swss. Keep in mind that users may also push configuration through our software at different times which when applied will force a swss restart. I don't think there is a deterministic way to predict how many times swss should be allowed to restart and in what interval. |
@antony-rheneus: This was changed from the default values once we added the 'auto-restart-upon-critical-process-crash' feature (#2845). This is to prevent SONiC from indefinitely restarting the service if there is something causing one of the critical processes to crash consistently. |
can we add "systemctl reset-failed swss" to reset the restart counter in the sonic-utilities/ where config load/load_minigraph is being called? |
also for config reload operation will need to clear counter , this is not only for tests suites, but also for allowing pusing/changing new configurations as much as desired without restrictions to it |
We can. This is something I was already considering adding. I'll look into creating a PR. |
@antony-rheneus, @avi-milner: PR here: sonic-net/sonic-utilities#607. Please review. |
Should be addressed by sonic-net/sonic-utilities#607 |
hi @jleveque , can you please fix to always reset failed counter for config load commands ? i have opened #sonic-net/sonic-utilities#616 |
…atically (#18524) #### Why I did it src/sonic-utilities ``` * bd86d33b - (HEAD -> master, origin/master, origin/HEAD) [generate_dump] call hw-management-generate-dump.sh in collect_cisco_8000 (#2809) (2 hours ago) [Geert Vlaemynck] * 52e9117c - [dualtor_neighbor_check] Fix the script not exists issue (#3244) (24 hours ago) [Longxiang Lyu] ``` #### How I did it #### How to verify it #### Description for the changelog
https://github.com/Azure/sonic-buildimage/blob/67463f18b2ea396c1b3bab87575f803376a8046e/files/build_templates/swss.service.j2#L13
@jleveque, Interval has been increased to from default 10sec to 1200sec, but burst has been decreased from default 5 to 3.
Since burst is too low in longer interval timespan, swss was not started by systemd.
Can we revert the burst/increase it?
The text was updated successfully, but these errors were encountered: